Kamikaze is a utility package for effectively compressing sorted integer arrays, which are represented as docIdSets, and performing highly efficient operations on the compressed arrays or docIdSets. Kamikaze represents the compressed integer arrays as integer sets and calls them docIdSets (the docIdSet concept is similar to that used in Lucene). Kamikaze can achieve an extremely fast decompression speed with a decent compression ratio on sorted arrays (or docIdSets). It can efficiently find the intersection or the union of N compressed arrays (or docIdSets), quickly detect the existence of an given integer in the compressed arrays (or docIdSets), etc.
Traditionally, the compression techniques are used to save storage space on disks. More interestingly, in large-scale distributed system, they can be used to reduce the expensive costs of I/O traffic and network traffic. Various compression techniques on sorted integer arrays have been widely used in commercial search engines, for example, Google and Yahoo!, and in open-source search engine - Lucene. Such large-scale systems have shown that compression techniques can significantly improve the overall system performance, although they introduces an additional CPU cost of decompressing the compressed data.
In Linkedin, Kamikaze has been used in the distributed graph team and search team, for representing over 100 million members.