We're the Data Team

The data team at LinkedIn works on LinkedIn's information retrieval systems, the social graph system, data driven features, and supporting data infrastructure.

  • team photo
  • team photo
  • team photo
September 26, 2012

Sharing knowledge is part of our core culture at LinkedIn, whether it’s through hackdays or contributions to open-source projects. We actively participate in academic conferences, such as KDDSIGIR, RecSys, and CIKM, as well as industry conferences like QCON and Strata.

Beyond sharing our own knowledge, we provide a platform for researchers and practitioners to share their insights with the technical community. We host an Tech Talk series at our Mountain View headquarters that we open up to the general public.

We recently hosted Panos Ipeirotis, a professor at NYU and one of the world’s top experts on crowdsourcing. He talked about “Crowdsourcing: Achieving Data Quality with Imperfect Humans”. If you weren’t able to attend the talk in person or watch the stream, I encourage you to watch the recording, which we’ve posted to our Tech Talks channel on YouTube and embedded above. Panos delivers an insightful and entertaining tour of the benefits and pitfalls of crowdsourcing applied to real-world data products. Enjoy!

September 23, 2012

LinkedIn is an industry leader in the area of recommender systems – a place where big data meets clever algorithms and content meets social. If you’re one of the 175M+ people using LinkedIn, you’ve probably noticed some of our recommendation products, such People You May KnowJobs You Might Be Interested In, and LinkedIn Today.

So it’s no surprise that we showed up in force at the 6th ACM International Conference on Recommender Systems (RecSys 2012), which took place in Dublin September 9th-13th.

Here’s what we presented:

All in all, it was an excellent conference. LinkedIn and other industry participants comprised about a third of attendees, and there was a strong conversation bridging the gap between academic research and industry practice.

September 7, 2012

I am gladly announcing the availability of Voldemort 0.96 open source release.

What’s happening

We really appreciate the support, patience, and valuable feedback from the open source community and our users. Since the last Voldemort open source release, the Voldemort team at LinkedIn has been actively working on improving the manageability and the operability of Voldemort clusters. In particular, in order to keep up with Voldemort’s high adoption rate at LinkedIn, one of the primary goals is to be able to operate multi-tenant clusters, thus reducing the total number of clusters that we need to manage.

To simplify the resource management in a multi-tenant Voldemort cluster and to provide faster response time, the Voldemort team has been tuning Voldemort servers, especially the BDB JE storage engine, to run on SSD. However, it has been a long journey for the team to tune the system to run on SSD, mainly due to the large scale deployment of Voldemort at LinkedIn, the intricacy of the memory management of the JVM based systems, and the subtle issues encountered in the newer versions of BDB JE (4.1.17 and 5). More details on the tuning work can be found in a separate blog post we published here.

Even though our work on performance tuning Voldemort for SSDs continues, we are doing a release now to share our initial improvements and to share other significant enhancements we have made.

About this release

In addition to many small features, enhancements, and bug fixes, this release includes significant improvements in Rebalancing, Read-only Pipeline, and performance monitoring for both client and server (including extensive stats on the BDB storage engine) through mbeans. Please refer to the release notes for more details, including an overview on the new Rebalancing work and Read-only pipeline data fetch optimization.

Next play

Although the Voldemort open source community has been quiet, the Voldemort team at LinkedIn had been actively working on Voldemort to make it better. Our current development roadmap is centered around how to build and operate Voldemort as a Service, namely VaaS. The roadmap includes the resource management (I/O, memory, network bandwidth and connections, etc), metadata management, GUI-based management console (currently with some LinkedIn specific components), performance lab infrastructure, and source-agnostic streaming-in platform. Those features will be made available to our users through a few open source releases in the next 6 – 12 months.

We also need some feedback from the open source community on how useful to have a consistent event change log from a Voldemort cluster, similar to the redo log of an Oracle instance or the MySql bin log. (There won’t be any transactional semantics in this log, of course :). The consistent event change log could enable streaming data between Voldemort clusters as well as from a Voldemort cluster to other data sources.

To the open source community

The Voldemort team values the open source community dearly. We like to fix bugs found by our open source users; Hear and discuss suggestions on how to improve Voldemort to run faster or be more capable; And have open source contributors to be more involved in our development process.

To consolidate the open source issue tracking with the actual source code, we are cleaning up and moving tickets from google code to github.

At the end of this blog, I’d like to ask the open source community the following questions. You can either directly comment on the blog post or post on the project forum.

  • Besides the features/enhancements mentioned in “Next play” section, what is the one feature/enhancement you like to see in Voldemort?
  • How often would you like to see the development updates from the Voldemort team – for features/enhancements initiated by the team at LinkedIn? How to improve the communication between the Voldemort team at LinkedIn and the open source community?

==============================================================

Lei Gao

August 1, 2012

Welcome to the data team’s new site!

July 19, 2011

I’m thrilled to announce that we have finally cut off a branch and are ready to do the 0.90 open source release for Project Voldemort. For folks still enchanted by the Potter mania, no, I am not talking about Harry’s nemesis. Project Voldemort is an open-source distributed key-value store being used here at LinkedIn and at various other companies. This is one of our biggest open-source releases and contains features that we have worked on and deployed at LinkedIn in the last one year.

Over the past year our user base internally has grown tremendously. About 3 years ago Voldemort was deployed as a small cluster to serve our first product Who viewed my profile. Over time we have grown to multiple clusters, some of which now span over LinkedIn’s two data-centers. At its crux a single cluster serves either one of the two types of applications – those dealing with read-only static data (computed offline in Hadoop after some number crunching) or read-write data. Now with around 200 stores overall in production we have been serving various critical user facing features. Some examples of applications that use us include People you may know, Jobs you may be interested in, Skills, LinkedIn Share Button, Referral Engine, Company Follow, LinkedIn Today and more. With new users ramping in nearly every other week our clusters have collectively started doing over a billion queries per day.

picture-48_01

What’s new?

One of the most important upgrades we have done in production recently has been switching all our clients and servers from the legacy thread-per-socket blocking I/O approach to the new non-blocking implementation which multiplexes using just a fixed number of threads (usually set in proportion to the number of CPU cores on the machine). This is good from an operations perspective on the server because we no longer have to manually keep bumping up the maximum number of threads when new clients are added. From the client’s perspective we now won’t need to worry about thread pool exhaustion due to slow responses from slow servers.

While upgrading the client side logic, we also redesigned the routing layer to model it as a pipelined finite state machine (we call it the pipeline routed store). For example, a put() request on the client is now modeled as a series of following states:

  1. Generate list of nodes to put to
  2. Put sequentially till first success
  3. Put in parallel to the rest of the nodes
  4. Increment vector clock.

Designing every client request as states with transitions makes it easy to extend the pipeline and add new features (states) with minimum hassle. This enabled us to add support for a long-standing feature request – hinted-handoff, an additional consistency mechanism that handles cases of transient failures by using other live nodes as a backup system. Also in preparation for our new datacenter we were able to quickly plug in a new topology aware routing strategy that we call zone aware routing. For this strategy we cluster nodes into logical groups called zones (in our case a zone = data-center). Our routing strategy is then just a simple extension of Amazon Dynamo’s partitioning algorithm with special constraints on how we jump the ring.

The other important feature that we use at LinkedIn is the read-only stores pipeline. Our largest read-only stores cluster powers most of the recommendation features and fetches around 3 TB of data every day from Hadoop. Some of the initial work we did in this area was to make the fetch + swap pipeline more efficient by migrating off the old servlet based tool to a new administrative service based tool. This gave us better visibility into progress of data transfer (fetch phase) along with more control for swaps. Besides optimizing the data-transfer pipeline, we also spent some time iterating on the underlying storage format and data layout. The final storage format that we have come up with has a better memory footprint, supports iterators and has been tweaked for making rebalancing of read-only stores as simple as possible.

Besides working a lot on our Java client, we also updated some clients relying on Voldemort’s server-side routing. In particular the Python and Ruby clients have gone through some iterations. Many of our internal customers use these for quick prototyping of their ideas during our monthly Hackdays. In fact one of the by-products of this inDay has resulted in Voldemort having a nice GUI, more about which you can read here.

What’s next?

With some of the core pieces of the system in place now, our road-map is to pay more emphasis on the operation / management aspects. For example, we want to provide better tools to make ‘migration’ or ‘rebalancing’ of clusters as easy as pressing a button. Doing so would allow us to have larger clusters thereby decreasing operational overhead of maintaining many small clusters. We’re also slowly changing our old philosophy of adding new clusters for every new application and instead plan to add better support for multi-tenancy.

Fork / Download Voldemort, read all about updating to the new version, find an interesting new sub-project, submit a patch and come join the Death Eaters.

{{error}}