Just came back from the 1st ACM Symposium on Cloud Computing at Indianapolis. The conference is collocated with Sigmod and lasts a day and half. A total of 7 people from LinkedIn were at SOCC and the blog below reflects the notes that we took collectively. There were three keynote speeches, all of which are excellent (the slides will be made available at the conference website).
1. Keynote by Jeff Dean from Google:
- Google already started using flash disks in their clusters.
- Bigtable : It now only runs as a service within google (i.e., one doesn’t install Bigtable himself any more); Each SSTable is validated against the checksum immediately after it’s written, which helps detecting corruption early. This catches 1 corruption/5.4PB data; A coprocessor daemon runs on each tablet and it splits as a tablet gets split. This is used to do some processing on a set of rows and seems to me like a low-overhead MapReduce job since there is no overhead in starting the mappers and reducers.
- Google is working on Spanner, a virtualized storage service across data centers. Didn’t get too much details. The key point seems to be the capability of moving storage across data centers.
- A few key design patterns that worked well in google’s infrastructure: (1) 1 master/1000 workers, simplified design; (2) canary requests: to void an unknown type of request that brings down every worker, first send the request to 1 worker. If successful, send it to everyone; (3) distributing requests through a tree of nodes, instead of direct broadcasting; (4) backup requests to improve the performance of stragglers; (5) multiple small units per machine for better load balancing (don’t need to split a unit and move each unit as a whole); (6) range partitioning instead of hash
- Facebook has 8 data centers.
- Core infrastructure based on sharded MySQL. The biggest pain is that it’s hard to do logical migration (need to split a database). The solution is to over-partition and have multiple databases per node and only move a whole database (similar to the design pattern used at Google). They find that a 3-to-1 file size/RAM ratio ideal for MySQL (with innodb). If the ratio is larger, MySQL’s performance drops significantly. No distributed transactions across partitions. Multi-node updates (e.g., two people becoming friends) are delivered with best effort.
- Heavy use of memcache. Maintain multiple replicas. Want to solve the problem of double caching between memcache and MySQL (the problem is that you can’t take away too much memory from MySQL even though MySQL mostly handles the write load). Memcache is partitioned differently from MySQL. That way, if a memcache server goes down, the read requests on that server are spread across on all MySQL shards.
- Facebook is building FB object/association and a system called TAO on top of memcache. TAO is API-aware and supports write-through on updates (instead of an invalidation followed by a read).
- Facebook has several specialized services: search, ads, PYMK, and multifeed. It uses the search engine to serve complex “search-like” queries over structured data.
- Facebook is building Unicorn. Seems like an event publishing system. Today, it receives batch updates from Hive. It’s moving towards more real-time by taking SQL updates and apply them directly in Unicorn.
- Most amazing thing to me: there are 4000 people at SalesForce and only 200 of them are engineers (the rest are sales and marketing people).
- It’s using Resin App Server, Lucene, and (only) an 8-way Oracle RAC.
- Dell and Harah’s are among the big customers
- It uses Apex Governor for service protection to prevent a particular tenant from using too much resource. Apex limits things like heap and stack size.
- Flex schema: everything varchar; separate tables for accelerator (indexes), with data types.
- It serves both OLTP and reporting on the same database.
- It uses Ominiscent tracing for debugging : collects log on an operational system so that one can debug locally.
- It uses Chatter for real-time collaboration.
- Comet: Batched Stream Processing for Data Intensive Distributed Computing. This paper is about sharing work through multi-query optimization. For example, if you schedule a weekly job and a daily job on the same data. The weekly results can be derived from the daily jobs. How to share and what to share is determined through static analysis at a planning phase. The system is built on top of Microsoft’s Dryad.
- Stateful Bulk Processing for Incremental AnalyticsStateful Bulk Processing for Incremental Analytics. The motivating example is how to do incremental crawling.
- Towards Automatic Optimization of MapReduce Programs. This is a position paper and it argues that many database query optimization techniques (both static and dynamic) can be applied to MapReduce. Actually, some of those optimizations have already been applied in systems like Scope.
- Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. This is an alternative parallel computing engine to MapReduce. It offers many operators (e.g., joins, filter, aggregation) to construct a data flow, instead of just a map and a reduce operator.
There are several papers related to key-value stores (aka NoSQL databases).
- The Case for PIQL: A Performance Insightful Query Language. It describes a declarative query language for querying key-value stores. The system automatically maintains and selects secondary indexes.
- Benchmarking Cloud Serving Systems with YCSB. It’s a benchmark for comparing various key-value stores (e.g., HBase, Cassandra, sharded MySQL and Pnuts). The benchmark currently focuses on performance comparison.
- G-Store: A Scalable Data Store for Transactional Multi Key Access in the Cloud. This paper adds multi-row transaction support on top of a key-value store that only supports single-row transactions. The approach is similar to Google’s Megastore. The prototype is built on top of HBase, an open source implementation of Bigtable.
Some other papers that I took notes of.
- An Operating System for Multicore and Clouds: Mechanisms and Implementation. This is a paper from MIT on building a virtual Cloud OS (called fos) across multiple cores and multiple machines. This are definitely challenges in how to hide the latency across machines.
- Hermes: Clustering Users in Large-Scale E-Mail Services. This is a paper about clustering users based on their email exchanges. The motivating application is to save space by avoiding storing duplicated messages collocated on the same partition of the backend email server.
- Automated Software Testing as a Service. It’s about using cloud for testing software.