Apache DataFu

Apache DataFu is a collection of libraries for working with large-scale data in Hadoop and Pig. It is currently an Apache Incubator project. For more details, visit the main project site here.

The Apache DataFu Pig library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics in Apache Pig. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills & Endorsements”. It contains functions for:

  • PageRank
  • Statistics (e.g. quantiles, median, variance, etc.)
  • Sampling (e.g. weighted, reservoir, etc.)
  • Sessionization
  • Convenience bag functions (e.g. enumerating items)
  • Convenience utility functions (e.g. assertions, easier writing of EvalFuncs)
  • Set operations (intersect, union)
  • and more

Each function is unit tested and code coverage is being tracked for the entire library.

There is also a separate library named Hourglass that provides a framework for incrementally processing data in Hadoop.

Apache DataFu Project Website