The Apache DataFu Pig library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics in Apache Pig. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills & Endorsements”. It contains functions for:
Each function is unit tested and code coverage is being tracked for the entire library.
There is also a separate library named Hourglass that provides a framework for incrementally processing data in Hadoop.