Apache DataFu is a collection of libraries for working with large-scale data in Hadoop and Pig. It is currently an Apache Incubator project. For more details, visit the main project site:
The DataFu Pig library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics in Apache Pig. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills & Endorsements”. It contains functions for:
Each function is unit tested and code coverage is being tracked for the entire library.
The DataFu Hourglass library provides a framework for incrementally processing data in Hadoop.