Espresso is a horizontally scalable, indexed, timeline-consistent, document-oriented, highly available NoSQL data store. As LinkedIn grows, our requirements for primary source-of-truth data are exceeding the capabilities of a traditional RDBMS system. More than a key-value store, Espresso provides consistency, lookups on secondary fields, full text search, limited transaction support, and the ability to feed a change capture service for easy integration with other online, nearline and offline data ecosystem. To support our highly innovative and agile environment, we need to support on-the-fly schema changes, and for operability, the ability to add capacity incrementally with no downtime.
Espresso is in production today, and we are aggressively migrating many applications to use Espresso as the source-of-truth. Examples include: member-member messages, social gestures such as updates, sharing articles, member profiles, company profiles, news articles, and many more. Espresso is the source of truth for many applications and tens of terabytes of primary data.
As we support these applications, we are working through many interesting problems, such as consistency/availability tradeoffs, latency optimization, efficient use of system resources, performance benchmarking and optimization, and lots more.
Espresso uses existing technology where it makes sense. For example, we chose MySQL/InnoDB as our storage engine. MySQL provides a customizable framework for replication and a pluggable storage engine, both of these can be replaced for different requirements and workloads. For full text search, Espresso relies on Lucene. In order to deliver the change capture functionality that allows our applications to integrate with each other and our nearline and offline data ecosystem, Espresso uses Databus for replication. For more information, see Databus. Managing a fault tolerant cluster, including failures, load balancing, and scheduling background tasks, is non-trivial. Espresso uses Helix for cluster management. For more details, see Helix.
In addition to moving applications to Espresso, we expect to release it as an open-source project in 2014. We have lots of interesting work still to do, such as self-service provisioning, multi-tenancy at scale, integration with Hadoop, more flexible secondary indexing, materialized views, server-side aggregation, and lots more.