Apache DataFu

Apache DataFu is a collection of libraries for working with large-scale data in Hadoop and Pig. It is currently an Apache Incubator project. For more details, visit the main project site:

Apache DataFu Project Website

The DataFu Pig library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics in Apache Pig. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills & Endorsements”. It contains functions for:

  • PageRank
  • Statistics (e.g. quantiles, median, variance, etc.)
  • Sampling (e.g. weighted, reservoir, etc.)
  • Sessionization
  • Convenience bag functions (e.g. enumerating items)
  • Convenience utility functions (e.g. assertions, easier writing of EvalFuncs)
  • Set operations (intersect, union)
  • and more

Each function is unit tested and code coverage is being tracked for the entire library.

The DataFu Hourglass library provides a framework for incrementally processing data in Hadoop.

Presentations

Blog Posts

{{error}}