June 17, 2009

A well-known lesson in scalability is that writes are 40x more expensive than reads and if your application becomes write-intensive as it is easily the case when you are dealing with sufficiently large number of users, you will be in trouble if you don’t design to scale. For example, if you are using MySQL, you will most likely follow the conventional path of scaling by introducing one of many replication schemes such as establishing a master server with several slaves servers. All writes go to a single server, which then replicates out to the read-only slaves. This all works well, until you reach the maximum number of writes any one of your servers can handle individually and replication lag will rear its ugly face. If we ignore the specifics and brainstorm for a little bit, you might come to the conclusion that updating databases while they are online can be detrimental to the service’s performance and maybe we should find a way to batch these updates and deliver them in bulk to the slaves. And this is exactly what the team behind Project Voldemort has done in addition to the default store (read-write) available today. We have a read-only store that lets you build its index and data files in an offline system (like Hadoop) and once built, it provides a mechanism for swapping its current store for a fresher version.

At Lookery, we have been working very hard to transform most of our data processing tasks into batch-oriented workflows in order to deal with growth. For example, we were already using Hadoop to compute our index and data files for our largest database, but the process of serving that information took place over too many network hops (load balancers, reverse proxies and Amazon S3). Therefore, as soon as I learned that Project Voldemort supported offline building of distributed stores, I decided to try it and we’re now running it in production. Please read the rest for an example walkthrough building a Voldemort read-only index and data files using Hadoop. The goal for this tutorial is to deploy a Voldemort cluster storing words and their counts calculated by the canonical Hadoop example: WordCount.

Step 1: Preferably build or download Voldemort:

    > git clone git://github.com/voldemort/voldemort.git
    > cd voldemort
    > ant

Step 2: Configure Cluster

Now let’s create a single node cluster for our wordcounts store. Create a directory and place the following files inside a directory named ‘config’ within that directory.


    # The ID of *this* particular cluster node


        <partitions>0, 1</partitions>



Let’s now make sure that everything is working as expected. Voldemort will automatically create a blank read-only store in a subdirectory called data if an existing is not found.

    > $VOLDEMORT_HOME/bin/voldemort-server.sh . &> /tmp/voldemort.log &
    > $VOLDEMORT_HOME/bin/voldemort-shell.sh wordcounts tcp://localhost:6666
    Established connection to wordcounts via tcp://localhost:6666
    > get "voldemort"

Step 3: Prepare your input data to organize into a word and count sequence.

Before you continue, you must already have Hadoop configured in either pseudo-distributed or fully-distributed mode. Unfortunately, I’m going to punt on this part of the setup because there’s plenty of good documentation on the Hadoop wiki itself. Once you have your cluster running, let’s proceed to run the wordcount example.

    > bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>
    > bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>

We now have a tab-separated records that contain the word and its count: “<word>TAB<count>” stored in the <out-dir> directory.

Step 4: Use your favorite Java IDE/Editor to implement your Hadoop mapper

Project Voldemort provides you with most of the code necessary to build the store. The only missing part is how to extract the key and value objects from your input files. Since our WordCount example outputs our data in a very simple key,value format, our mapper only needs to split the actual line on the TAB character and return a String object for the key and an Integer object for the value just as declared in our store definition inside stores.xml. You will need to update your CLASSPATH to point to the voldemort jars and their dependencies in order for this to compile. Upon completion, you should export this class into a jar called wordcount-mapper.jar and place it inside a ‘lib’ directory right along your ‘config’ directory.

    package com.lookery;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import voldemort.store.readonly.mr.AbstractHadoopStoreBuilderMapper;
    public class HadoopStoreMapper extends AbstractHadoopStoreBuilderMapper<LongWritable, Text> {
        public Object makeKey(LongWritable key, Text value) {
            return value.toString().split("\t")[0];
        public Object makeValue(LongWritable key, Text value) {
            return Integer.parseInt(value.toString().split("\t")[1]);

Step 5: Build your read-only store from the command-line

The next step is for you to call Project Voldemort’s shell script as shown in the example below. Most of the parameters should be self-explanatory, if not you can always call the script with ‘–help’ to get full details for all available parameters. However, it might be worth explaining a couple in this example. First, let’s look at chunk size. The main reason for using Hadoop to build large read-only stores is because of its ability to run tasks in parallel. But to help you take advantage of Hadoop, Project Voldemort builds the index and data files for each node in the cluster in chunks in order to let the developer maximize the number of reducers his Hadoop cluster dedicates to building the distributed store. You should play with the chunksize until it gives you a number within a small multiple of your maximum reducer availability. In this case, I set it to 1 gigabyte. The second option worth mentioning is the replication factor. This setting will let you create redundancy in your storage so your cluster can continue serving values even after some node failures. Please see the rest of the documentation to choose what is the best replication factor in your application scenario.

    > $VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh --input <input_dir> \
        --output wordcounts --tmpdir tmp-build --mapper com.lookery.HadoopStoreMapper \
        --jar lib/wordcount-mapper.jar --cluster config/cluster.xml \
        --storename wordcounts --storedefinitions config/stores.xml \
        --chunksize 1073741824 --replication 2
    09/06/17 20:24:38 INFO mr.HadoopStoreBuilder: Data size = 17934, ...
    09/06/17 20:24:38 INFO mr.HadoopStoreBuilder: Number of reduces: 1
    09/06/17 20:24:38 INFO mr.HadoopStoreBuilder: Building store...
    09/06/17 20:24:38 INFO mapred.FileInputFormat: Total input paths to process : 1
    09/06/17 20:24:38 INFO mapred.FileInputFormat: Total input paths to process : 1
    09/06/17 20:24:38 INFO mapred.JobClient: Running job: job_200906171849_0002
    09/06/17 20:24:39 INFO mapred.JobClient:  map 0% reduce 0%
    09/06/17 20:24:44 INFO mapred.JobClient:  map 100% reduce 0%

Step 6: Ask the cluster to fetch the newly built read-only in Hadoop

Now the easiest part is to ask each node in the cluster to fetch its index and data chunks and only after they have succeeded, then they will all atomically swap their stores to the latest version. This has to be one of the coolest features in the read-only store. We hope you like it too. There are a couple of parameters available to the server.properties file that could help you further enhance this capability. The first is that you can specify a temporary folder for downloading the files from Hadoop HDFS (hdfs.fetcher.tmp.dir). This is important to make sure that both the read-only data store and the temporary folder are in the same device in order to avoid an extra copy of the downloaded files and make the swap truly atomic as permitted by the underlying filesystem. The second option (fetcher.max.bytes.per.sec) is to signal the HdfsFetcher to throttle its download rates in order to avoid interference with the online requests from Voldemort clients. There are also some other options for the swap-store.sh command such as timeout that could help you deal with large download periods from the cluster nodes. But other than that, it should be straight forward to update your online read-only store using this command.

    > $VOLDEMORT_HOME/bin/swap-store.sh --cluster config/cluster.xml \
        --file hdfs://localhost:54310/user/${user.name}/wordcounts --name wordcounts
    [2009-06-17 23:35:23,946] INFO Invoking fetch for node 0 for hdfs://localhost:54310/user/elias/node-0
    [2009-06-17 23:35:24,045] INFO Fetch succeeded on node 0 (voldemort.store.readonly.StoreSwapper)
    [2009-06-17 23:35:24,046] INFO Attempting swap for node 0 dir = /tmp/hdfs-fetcher/hdfs-fetcher/node-0
    [2009-06-17 23:35:24,060] INFO Swap succeeded for node 0 (voldemort.store.readonly.StoreSwapper)
    [2009-06-17 23:35:24,060] INFO Swap succeeded on all nodes in 0 seconds.

Step 7: Verify that your data was loaded successful

    > bin/voldemort-shell.sh wordcounts tcp://localhost:6666
    Established connection to wordcounts via tcp://localhost:6666
    > get "voldemort"
    version(): 2

We have been running at Lookery with this setup for almost a month now and have been very pleased the results. It allowed us to have a Voldemort cluster that is always refreshing with the new data without manual intervention. If you have any corrections or suggestions on how we can improve both the batch indexer or this tutorial, please don’t hesitate to email the mailing list.