We're the Data Team

The data team at LinkedIn works on LinkedIn's information retrieval systems, the social graph system, data driven features, and supporting data infrastructure.

  • team photo
  • team photo
  • team photo
October 3, 2012

LinkedIn’s DataFu library, a collection of useful UDFs for Pig, is now a standard part of Cloudera’s Hadoop distribution (CDH) starting with version 4.1:

Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!

Read the full announcement here. Pig users around the world rejoice!

September 27, 2012

LinkedIn has many analytical insight products such as "Who's Viewed My Profile?" and "Who's Viewed This Job?". At their core, these are multidimensional queries. For example, "Who's Viewed My Profile?" takes someone's profile views and breaks them down by industry, geography, company, school, etc to show the richness of people who viewed their profiles:

Online analytical processing (OLAP) has been the traditional approach to solve these multi-dimensional analytical problems. However, for our use cases, we had to build a solution that can answer these queries in milliseconds across 175+ million members. We called this solution Avatara:

Avatara: OLAP for Web-scale Analytical Products

Avatara is LinkedIn's scalable, low latency, and highly-available OLAP system for "sharded" multi-dimensional queries in the time constraints of a request/response loop. It has been successfully powering "Who's Viewed My Profile?" and other analytical products for the past two and a half years. We recently presented a paper on Avatara at the 38th conference on Very Large Databases (VLDB).


At LinkedIn, we need an OLAP solution that can serve our 175+ million members around the world. It must be able to answer queries in tens of milliseconds, as our members are waiting for the page to load on their browsers. An interesting insight for LinkedIn's use cases is that queries span relatively few – usually tens to at most a hundred – dimensions, so this data can be sharded across a primary dimension. For "Who's Viewed My Profile?", we can shard the cube by the member herself, as the product does not allow analyzing profile views of anyone other than the member currently logged in. We call this the many, small cubes problem.

Here's a brief overview of how it works. As shown in the figure below, Avatara consists of two components:

  1. An offline engine that computes cubes in batch
  2. An online engine that serves queries in real time

The offline engine computes cubes with high throughput by leveraging Hadoop for batch processing. It then writes cubes to Voldemort, LinkedIn's open-source key-value store. The online engine queries the Voldemort store when a member loads a page. Every piece in this architecture runs on commodity hardware and can be easily scaled horizontally.

Offline Engine

The offline batch engine processes data through a pipeline that has three phases:

  1. Preprocessing
  2. Projections and joins
  3. Cubification

Each phase runs one or more Hadoop jobs and produces output that is the input for the subsequent phase. We utilize Hadoop for its built-in high throughput, fault tolerance and horizontal scalability. The pipeline preprocess raw data as needed, projects out dimensions of interest, performs user-defined joins, and at the end transforms the data to cubes. The result of the batch engine is a set of sharded small cubes, represented by key-value pairs, where each key is a shard (for example, by member_id for "Who's Viewed My Profile?"), and the value is the cube for the shard.

Online Engine

All cubes are bulk loaded into Voldemort. The online query engine retrieves and processes data from Voldemort, returning results back to the client. It provides SQL-like operators, such as select, where, group by, plus some math operations. The wide-spread adoption of SQL makes it easy for application developers to interact with Avatara. With Avatara, 80% of queries can be satisfied within 10 ms, and 95% of queries can be answered within 25 ms for "Who's Viewed My Profile?" on a high traffic day.

Learn more

If you're interested in more detail about how the system works, please see the full paper: Avatara: OLAP for Web-scale Analytical Products.

September 26, 2012

Sharing knowledge is part of our core culture at LinkedIn, whether it’s through hackdays or contributions to open-source projects. We actively participate in academic conferences, such as KDDSIGIR, RecSys, and CIKM, as well as industry conferences like QCON and Strata.

Beyond sharing our own knowledge, we provide a platform for researchers and practitioners to share their insights with the technical community. We host an Tech Talk series at our Mountain View headquarters that we open up to the general public.

We recently hosted Panos Ipeirotis, a professor at NYU and one of the world’s top experts on crowdsourcing. He talked about “Crowdsourcing: Achieving Data Quality with Imperfect Humans”. If you weren’t able to attend the talk in person or watch the stream, I encourage you to watch the recording, which we’ve posted to our Tech Talks channel on YouTube and embedded above. Panos delivers an insightful and entertaining tour of the benefits and pitfalls of crowdsourcing applied to real-world data products. Enjoy!

September 23, 2012

LinkedIn is an industry leader in the area of recommender systems – a place where big data meets clever algorithms and content meets social. If you’re one of the 175M+ people using LinkedIn, you’ve probably noticed some of our recommendation products, such People You May KnowJobs You Might Be Interested In, and LinkedIn Today.

So it’s no surprise that we showed up in force at the 6th ACM International Conference on Recommender Systems (RecSys 2012), which took place in Dublin September 9th-13th.

Here’s what we presented:

All in all, it was an excellent conference. LinkedIn and other industry participants comprised about a third of attendees, and there was a strong conversation bridging the gap between academic research and industry practice.

September 7, 2012

I am gladly announcing the availability of Voldemort 0.96 open source release.

What’s happening

We really appreciate the support, patience, and valuable feedback from the open source community and our users. Since the last Voldemort open source release, the Voldemort team at LinkedIn has been actively working on improving the manageability and the operability of Voldemort clusters. In particular, in order to keep up with Voldemort’s high adoption rate at LinkedIn, one of the primary goals is to be able to operate multi-tenant clusters, thus reducing the total number of clusters that we need to manage.

To simplify the resource management in a multi-tenant Voldemort cluster and to provide faster response time, the Voldemort team has been tuning Voldemort servers, especially the BDB JE storage engine, to run on SSD. However, it has been a long journey for the team to tune the system to run on SSD, mainly due to the large scale deployment of Voldemort at LinkedIn, the intricacy of the memory management of the JVM based systems, and the subtle issues encountered in the newer versions of BDB JE (4.1.17 and 5). More details on the tuning work can be found in a separate blog post we published here.

Even though our work on performance tuning Voldemort for SSDs continues, we are doing a release now to share our initial improvements and to share other significant enhancements we have made.

About this release

In addition to many small features, enhancements, and bug fixes, this release includes significant improvements in Rebalancing, Read-only Pipeline, and performance monitoring for both client and server (including extensive stats on the BDB storage engine) through mbeans. Please refer to the release notes for more details, including an overview on the new Rebalancing work and Read-only pipeline data fetch optimization.

Next play

Although the Voldemort open source community has been quiet, the Voldemort team at LinkedIn had been actively working on Voldemort to make it better. Our current development roadmap is centered around how to build and operate Voldemort as a Service, namely VaaS. The roadmap includes the resource management (I/O, memory, network bandwidth and connections, etc), metadata management, GUI-based management console (currently with some LinkedIn specific components), performance lab infrastructure, and source-agnostic streaming-in platform. Those features will be made available to our users through a few open source releases in the next 6 – 12 months.

We also need some feedback from the open source community on how useful to have a consistent event change log from a Voldemort cluster, similar to the redo log of an Oracle instance or the MySql bin log. (There won’t be any transactional semantics in this log, of course :). The consistent event change log could enable streaming data between Voldemort clusters as well as from a Voldemort cluster to other data sources.

To the open source community

The Voldemort team values the open source community dearly. We like to fix bugs found by our open source users; Hear and discuss suggestions on how to improve Voldemort to run faster or be more capable; And have open source contributors to be more involved in our development process.

To consolidate the open source issue tracking with the actual source code, we are cleaning up and moving tickets from google code to github.

At the end of this blog, I’d like to ask the open source community the following questions. You can either directly comment on the blog post or post on the project forum.

  • Besides the features/enhancements mentioned in “Next play” section, what is the one feature/enhancement you like to see in Voldemort?
  • How often would you like to see the development updates from the Voldemort team – for features/enhancements initiated by the team at LinkedIn? How to improve the communication between the Voldemort team at LinkedIn and the open source community?


Lei Gao