June 16, 2010

Just came back from the 1st ACM Symposium on Cloud Computing at Indianapolis. The conference is collocated with Sigmod and lasts a day and half. A total of 7 people from LinkedIn were at SOCC and the blog below reflects the notes that we took collectively. There were three keynote speeches, all of which are excellent (the slides will be made available at the conference website).

1. Keynote by Jeff Dean from Google:

  • Google already started using flash disks in their clusters.
  • Bigtable : It now only runs as a service within google (i.e., one doesn’t install Bigtable himself any more); Each SSTable is validated against the checksum immediately after it’s written, which helps detecting corruption early. This catches 1 corruption/5.4PB data; A coprocessor daemon runs on each tablet and it splits as a tablet gets split. This is used to do some processing on a set of rows and seems to me like a low-overhead MapReduce job since there is no overhead in starting the mappers and reducers.
  • Google is  working on Spanner, a virtualized storage service across data centers. Didn’t get too much details. The key point seems to be the capability of moving storage across data centers.
  • A few key design patterns that worked well in google’s infrastructure: (1) 1 master/1000 workers, simplified design; (2) canary requests: to void an unknown type of request that brings down every worker, first send the request to 1 worker. If successful, send it to everyone; (3) distributing requests through a tree of nodes, instead of direct broadcasting; (4) backup requests to improve the performance of stragglers; (5) multiple small units per machine for better load balancing (don’t need to split a unit and move each unit as a whole); (6) range partitioning instead of hash
2. Keynote by Jason Sobel from Facebook:
  • Facebook has 8 data centers.
  • Core infrastructure based on sharded MySQL. The biggest pain is that it’s hard to do logical migration (need to split a database). The solution is to over-partition and have multiple databases per node and only move a whole database (similar to the design pattern used at Google). They find that a 3-to-1 file size/RAM ratio ideal for MySQL (with innodb). If the ratio is larger, MySQL’s performance drops significantly. No distributed transactions across partitions. Multi-node updates (e.g., two people becoming friends) are delivered with best effort.
  • Heavy use of memcache. Maintain multiple replicas. Want to solve the problem of double caching between memcache and MySQL (the problem is that you can’t take away too much memory from MySQL even though MySQL mostly handles the write load). Memcache is partitioned differently from MySQL. That way, if a memcache server goes down, the read requests on that server are spread across on all MySQL shards.
  • Facebook is building FB object/association and a system called TAO on top of memcache. TAO is API-aware and supports write-through on updates (instead of an invalidation followed by a read).
  • Facebook has several specialized services: search, ads, PYMK, and multifeed. It uses the search engine to serve complex “search-like” queries over structured data.
  • Facebook is building Unicorn. Seems like an event publishing system. Today, it receives batch updates from Hive. It’s moving towards more real-time by taking SQL updates and apply them directly in Unicorn.
3. Keynote by Rob Woollen from SalesForce:
  • Most amazing thing to me: there are 4000 people at SalesForce and only 200 of them are engineers (the rest are sales and marketing people).
  • It’s using Resin App Server, Lucene, and (only) an 8-way Oracle RAC.
  • Dell and Harah’s are among the big customers
  • It uses Apex Governor for service protection to prevent a particular tenant from using too much resource. Apex limits things like heap and stack size.
  • Flex schema: everything varchar; separate tables for accelerator (indexes), with data types.
  • It serves both OLTP and reporting on the same database.
  • It uses Ominiscent tracing for debugging : collects log on an operational system so that one can debug locally.
  • It uses Chatter for real-time collaboration.
The papers are kind of mixed, likely because this is the very first symposium and nobody knows exactly what kind of papers really fit in. First of all, there are a bunch of papers on MapReduce related stuff.

There are several papers related to key-value stores (aka NoSQL databases).

Some other papers that I took notes of.