July 3, 2010

Faceted search has been fully rolled out late last year, we wanted to give you some insights into how it came to be, some of its challenges and what is in the future.

faceted search

At scale and with relevance, faceted search makes a lot of sense on the rich structured data we have here at LinkedIn.  The fundamental paradigm was to provide individuals with an easy and natural way to slice and dice through search results or simply content.  We thought that if the content being indexed had some structural dimensions (say like in eCommerce) or could be augmented by implicitly deriving these dimensions with quality, then a Faceted search paradigm would be ideal not only for retrieval but also for Navigation and Discovery.  At Linkedin since a member profile does have these rich structural dimensions, along with rich text data, it seemed that it would be only a matter of time to create such an interface.

At Linkedin we had first developed a prototype exhibiting the traditional pattern; a click on a facet value would be similar to a filtering of search results through that value.  More explicitly, you would search for “John” and later clicked on the “San Francisco” facet value to get only people in San Francisco called John, i.e. “John” + facet_value(“San Francisco”) = “John AND location:(San Francisco)”.  Since a facet value would be displayed only if it contained results, it would never lead you to a dead end while you were navigating through results.

Our next steps were to bring it in front of users through a very nice useability lab (read forthcomming blog).  Feedback was really good with respect to the idea but with respect to facet interactions, users really wanted the ability to select multiple values within a facet.  More explicitly, the ability through facets to look for say “John” in San Francisco or Los Angeles at the same time i.e. “John AND (location:(San Francisco) OR location:(Los Angeles)”.  This with an interface that would naturally provide that behavior and understandable counts.  As you can imagine, we were very excited with the positive feedback.

What we have implemented is essentially a query engine for the following type of query:

SELECT f1,f2…fn FROM members

WHERE c1 AND c2 AND c3..

MATCH (fulltext query, e.g. “java engineer”)

GROUP BY fx,fy,fz…

ORDER BY fa,fb…

LIMIT offset,count

deferring this query to a traditional RDBMS on 10s – 100s millions of rows with sub-second query latency SLA is not feasible. Hence we built a distributed system that handles the above query at internet scale.

The technical challenge is really a caching and counting exercise.  Naively one can view the operation of determining scores for Facet Values as iterating over all search results and tabulating the right Facet Value.  Of course when you have more than a hundred thousand Facet Values spread across tens of Facets, the naive approach doesn’t really scale. With multi-select options counting can also be rather tricky.  More specifically, selected values do not participate in counting their respective facets. A naive solution would be to do N (the number of facets) iterations over the result set for each multi-selectable field. But performance wouldn’t be good and would degrade as we add more such facets. To deal with it, we devised a solution that only iterates the result set once by integrating hit counting and hit validation into a single step.

One item we wanted to highlight in this post is that not all Facets are equal.  If you take a closer look at your results you’d notice that facets can be:

location facet- Static:  A static Facet is ‘Location‘ ; it reflects a static property of the search record and its value is not affect by the searcher neither the search query. In this case it simply counts and tabulates, by location, the number of individuals within the entire search hits.  Furthermore some static facets values are not mutually exclusive.  In the case of ‘previous company’ a member can have multiple values if he had two or more previous employers.

Joined Faceted

- Dynamic: A dynamic Facet is ‘Recently Joined‘ ; it reflects a static property of the search record but its value is affected by the query.  In the case of the “When Joined” facet, it tracks the difference between the time of the query and when the member joined, rounded to the day.  Note here that we are not re-indexing the joined date on a daily basis.

Network Facet

- Personal: A personal Facet is ‘Relationship‘,  it depends on the distance in the social graph between the member found and the searcher.  We are a social network site hence social distance is a very important facet to provide to our users.  Its nature is actually very dynamic as new connections occur every milliseconds which in turn transform the social network of everyone involved to the 3rd degree.  This makes the social graph so dynamic, that we do not store the relationship between a member and a potential searcher within the index!.

To be good citizens of the community, we have contributed the technology to the open-source world:

What is next?  Fundamentally it is all about speed so next play is to make it even faster and more intelligent :-)