June 16, 2010

Quick, what is the fastest way to search a sorted array?

Binary search, right?

Wrong. There is actually a method called interpolation search, in which, rather than pessimistically looking in the middle of the array, you use a model of the key distribution to predict the location of the key and look there.

Here is a simple example: assume you have an array of length 10, which contains a uniform sample of numbers between 0 and 99. Interpolation search works the same way a person would search. If asked to guess the location of 3 you would probably guess the 1st slot, if asked to guess the location of 85 you might guess the 9th slot, etc.

Okay, so this approach seems to make better guesses, but does it actually require asymptotically fewer iterations on average?

The answer is that in fact it requires exponentially fewer iterations–it runs in lg lg N time where N is the length of the array. The analysis is a little tricky–it appeared 19 years after the algorithm was formally published (according to the Knuth Search and Sorting book). I had to look it up (and before I could look it up I first had to realize I didn’t invent it and figure out what the name of it was), but essentially each step reduces the range to N^0.5 instead of 0.5 * N which yields the better asymptotic runtime. This is an average case result, so it is worth noting that the variance in the number of comparisons is lg lg N as well. This means that assuming your keys are well distributed you will almost certainly get the average case time or very close to it (I tried it on random arrays and the theory is scarily accurate).

So why isn’t this used in practice? Probably because lg N is already really small. After all, if you have an array of length 2^32 this only drops you from ~32 to ~5 comparisons which in practical terms probably isn’t a big speed up for searching arrays.

But we found a really great use for it in Voldemort. One use case we support is serving really large read-only files as Voldemort stores. This allows us to support a big batch datacycle run out of Hadoop as described here. The data structure for these uses a large sorted index file to do lookups, what is stored in this file is an MD5 of the key. Since the MD5s are used for the sort, the file is guaranteed to be uniformly distributed over the key space, and can often be many GBs in size. These files are memory mapped to help reduce the cost of a read, but the improved search algorithm can help to greatly reduce the number of seeks when the index is not fully memory resident. A disk seek comes at a price of around 10ms, so saving even one or two is a huge performance win.

Sometimes it is nice to see what these things are like in real life.  For uniformly distributed values, given a key to search for, an array to search in, and a minimum and maximum value it might look something like this:

int interpolationSearch(int key, int[] array, int min, int max) {
    int low = 0;
    int high = array.length - 1;
    while(true) {
        if(low > high || key < min || key > max)
            return -1;
        // make a guess of the location
        int guess;
        if(high == low) {
            guess = high;
        } else {
            int size = high - low;
            int offset = (int) (((size - 1) * ((long) key - min)) / (max - min));
            guess = low + offset;
        // maybe we found it?
        if(array[guess] == key)
            return guess;
        // if we didn't find it and we are out of space to look, give up
        if(guess == 0 || guess == array.length - 1)
            return -1;
        // if we guessed to high, guess lower or vice versa
        if(array[guess] > key) {
            high = guess - 1;
            max = array[guess-1];
        } else {
            low = guess + 1;
            min = array[guess + 1];

You can see the real deal implementation in Voldemort here–it is a little trickier as it uses non-integer keys and is searching a file but the basic outline is the same.