As mobile and internet penetration improve worldwide, users from developing regions are increasingly participating in online communities. Several studies of web usage patterns in developing country contexts indicate internet users in these areas are very engaged in online social networking and communication tools, spending a significant portion of their online time on them. These observations have been made across several usage scenarios, ranging from educational institutions in urban India to remote internet access sites in Africa and Latin America. In addition, press releases from social networking sites indicate users from emerging economies have been driving worldwide membership growth.
In a paper, we provide the first large scale analysis of social networking usage in developing countries. This analysis is based on profile and activity data from LinkedIn’s 150 million+ strong global membership. LinkedIn has members from every country in the world, including several million in Africa. It also has a strong presence in Asia and Latin America, with countries like India and Brazil among the most active in the world. This gives us a rich set of demographic and activity information to understand social networking usage in emerging markets.
The analysis is presented in the form of several themes that illustrate the different dimensions of social networking use in emerging markets compared to the developed world. Some of the patterns we find are unique to the developing world, often shaped by economic, social and cultural factors, or the brief history and attributes of internet citizenship for many members in these environments. Other patterns transcend geographic and economic barriers, and derive from the basic human social behavior in sharing, communication and interaction.
The paper encompassing this work was published at ICTD 2012!
Azarias Reda (PhD '12), University of Michigan
By the time members have scrolled to the bottom of the search page, it’s probably fair to assume they haven’t found what they are looking for at least on the first page. This is when search recommendations are most important: they allow people to refine their search or explore alternatives, while increasing overall search engagement. Over the summer, we have been working on making search recommendations, presented to the member as “related searches”, more relevant and useful. Using query and search logs, we compute related searches for queries combining three different techniques: collaborative filtering, agglomerative clustering, and text matching.
The paper encompassing this work was published at CIKM 2012!
Azarias Reda (PhD '12) & Yubin Park (PhD '15), University of Michigan & University of Texas, Austin
This summer I will investigate how the Apache Giraph graph processing framework can scale to become practical for LinkedIn’s social graph. Before my arrival in Mountain View, I was able to study the Giraph code and generate a few accepted contributions.
Eli Reisman (BS '13), University of Washington
Log analysis showed that click-through-rate (CTR) on 3rd degree results in search was higher than CTR for out-of-network (OON) results. Therefore, there is a need to rank 3rd degree users before OON users. However, 3rd degree users of each person could have millions or tens of millions, which is too expensive to be indexed in memory. Thus, we designed a statistical tool to estimate the likelihood of a person is 3-hops away. We examined 11 features (e.g. the number of connections, the number of common groups, age difference, geographical location, etc), and we are able to achieve a 76% prediction accuracy rate.
Xufei Wang (PhD '12), Arizona State University
I worked on two distinct projects: evaluating secondary indexing within ESPRESSO and real-time analytics using a column-oriented storage engine.
The first project was to compare Sphinx, a high-performance open-source search engine that is implemented in C++ and well integrated with mySql, vs Lucene, which is the de facto standard, and is the incumbent choice for ESPRESSO.
I integrated Sphinx into ESPRESSO, then ran side-by-side benchmarks vs Lucene, using Comm (mailbox) data as the representative dataset and workload. The result was a search engine that ran faster on smaller database sizes, but degraded as document count grew, even at levels well below ESPRESSO’s anticipated per-node workload. We determined that this was due to the index size within Sphinx. For our dataset, it was about seven times larger than Lucene’s, which causes significant degradation once the server’s RAM is exhausted.
Chris Beavers (BS '12), Harvey Mudd College
Virtual profiles aims to design a better representation of users’ profiles and implement effective and efficient algorithms to fill in the virtual profile information other than the information the users already filled in.
The first approach is a “value pair-voting algorithm”; its basic idea is to compute a score for each possible value for the field to predict by evaluating the correlation of each value in the base fields. The second approach we tried out is “Bayesian estimation using conditional probabilities”. The main idea is to use calculate a score for each possible value for the field to predict by computing the condition. Our framework aims to work for the prediction of any field in user profile, such selection does not limit its application realm.
Chenguang Pan (MS '12), University of California, Davis
Recommendation Engine produces a flurry of Personalized Recommendations for LinkedIn. Systems like ours generally work on first principles and are optimized for showing users the most relevant content that matches their interest/profiles or preferences. While this serves as a good starting point, recommender system becomes the victim of its own success as the user starts to experience monotony in the content recommended due to these systems showing more of the same kind of stuff as the user has previously seen. I set out to explore how address this issue using a case study of one of our beloved recommendations products called “Groups You May Like”.
For diversification, first we need to decide features on which we want to diversify recommendations. For groups, two used two kinds dimensions:
Extrinsic Features: These features are group’s extrinsic properties like “Group Activity”, “Trending Groups” (high user join rates), Popular Groups in number of member terms and Diverse interest groups, for example “TED Talks” is a group appealing to a diverse set of people than say “Machine Learning Connection.”
Topical Feature: These features are related to the topics of the group based on its content or member’s attributes. We used member’s skills to extract skills (topics) of the group by using Mutual Information. Additionally, we came up with a Bayesian combination criterion for combining Diversity and Accuracy metrics to provide a fair tradeoff between showing new and interesting content which is in the “relevant”. The results look promising and we will be launching this system to production in the next coming months.
Anuj Goyal (MS '12), Carnegie Mellon University
Many companies have names formed by concatenating two words, e.g. Facebook, LinkedIn and OpenTable. When searching for these companies, some users will search for name with a space inserted between the two words, which can cause subpar results to be returned. Although users typically know what they did wrong and will retry, it would be best if the results were returned the first time. Some previous work has been done on this for company search. However, it is still an issue in other search domains, such as people and job search.
To remedy this issue, changes were made to the way that certain fields are indexed and searched. Under the new approach, key fields have all compound words decomposed (e.g. “LinkedIn” becomes “Linked In”), and the result is indexed and stored into a separate field. This allows the component words to be queried for on their own.
Queries from the user are rewritten to search on both the original field and the expanded one. However, scoring the results normally gives rise to unfavorable boosting of some documents over others. To combat this, only the score from one of the fields is allowed to contribute to the total score for the document.
Initial testing indicates that this approach performs well, both in terms of finding results and ranking them well. Using this method, the correct document is likely to be found regardless of inconsistencies in spacing.
Mark Wagner (BA '12), University of California, Santa Cruz
I worked on enhancing the job recommendation model, or “Jobs You May Be Interested In”, as it is known on the website. Specifically, I computed region similarities to help rank relevant jobs according to the migratory patterns observed in LinkedIn members. This required inferring the geo locations of member positions, since most members do not provide that information. Region similarity is now one of the many dimensions we can use to establish whether or not a member will find a given job relevant.
Mario Rodriguez (MS '12), University of California, Santa Cruz
This summer I worked on Nimbus, a service which provides a database-like interface to LinkedIn’s social graph. Nimbus contains a class called EdgeSet, which stores the connections of a member, ANet or company to other nodes in the graph. Like most data structures in nimbus, EdgeSets are immutable copy-on-write structures, which makes it easy to run queries in parallel with no locking. EdgeSet previously stored edges in a flat array, so inserting a new edge required copying all existing edges into a new array. And after an update is applied in memory, nimbus would flush all of the node’s edges to disk. Both of these operations can cause dangerous activity spikes. I split up EdgeSet’s flat array into small chunks, so only one chunk needs to be copied when an update is applied. I then made similar changes to our storage layer, so only a small chunk of edges needs to be flushed to disk following an update.
Daniel Lubarov (BS '12), Harvey Mudd College
Voldemort is a dynamo style NoSQL key-value store, powering a large portion of LinkedIn web services. Voldemort operates at enormous scale serving hundreds of thousands of operations/sec across hundreds of servers with several hundred terabytes of data, at very low latency. The goal of this project was to explore strategies to gracefully handle performance degradation/load spikes, by way of determining if and when requests have to be rejected, to avoid overloading the system. The project directly benefits all of the services accessing Voldemort by improving the overall stability of the storage server amidst degradation.
Bhavani Sudha Saktheeswaran (MS '13), University of Colorado at Boulder
Espresso is a distributed document store under development at LinkedIn, expected to serve as the primary storage system in the near future. The project examined if we can leverage the Espresso infrastructure to build a highly scalable, low-latency, near real-time analytics extensions using a combination of row and column storage layouts with time-partitioning. The project would greatly benefit a broad class of applications demanding interactive online analytics, by providing more flexible, more efficient counts, aggregates and other standard statistics (e.g. personalization).
Bailu Ding (PhD '16), Cornell
LinkedIn uses Hadoop extensively to meet offline data analytics needs. The Hadoop grid at LinkedIn consists of hundreds of nodes, housing petabytes of data, running hundreds of jobs every day. The goal of this project was to build an analysis tool to provide performance breakdowns for an individual Hadoop job as well as jobs scheduled by a Pig script. The project threw light on how we were using our Hadoop infrastructure, by providing historical trends for insightful metrics like mapper/reducer times and IO activity.
Sahil Takiar (BS '13), Berkeley
LinkedIn uses Pig extensively as a high level data manipulation tool for writing large event processing ETL (Extract-Transform-Load) workflows. Often a full ETL workflow involves more than one Pig script, some of which share their input/output touch points. Though Pig is a high level language, understanding the business logic embedded inside the code is non-trivial. The goal of this project was to build parsers to parse input/output and transformation logic for a collection of Pig scripts, adding visualization capabilities to help understand data lineage and business rules. The project generates a whole new understanding of the complex data flow graphs underneath some of the most critical ETL jobs.
Zhaonan Sun (MS '14), Rice
Espresso is a distributed document storage system under development at Linkedin. As LinkedIn becomes increasingly popular, it is imperative for LinkedIn to operate actively out of multiple data centers to provide the best experience to our members. Espresso, being a major storage system at LinkedIn, needed a mechanism to reconcile conflicting writes that happen at different data centers. This project prototyped a basic form of conflict detection and resolution, with support for cross-colo routing. The project delivered a critical piece of technology that enables LinkedIn to operate seamlessly out of multiple datacenters.
Peter Bailis (PhD '16), Berkeley
Apache Helix is a generic cluster management framework that simplifies building distributed systems. Apache YARN is a generic resource manager and controller for large scale distributed applications. The goal of this project was to explore how we could leverage the automated service deployment provided by YARN to auto scale services managed by Helix, based on a set of predefined Service-Level-Objectives (e.g. latency tolerance). The project has immense value for the LinkedIn infrastructure by way of making services more elastic to traffic trends. To learn more, go here.
Alexander Pucher (PhD '16), UCSB
Apache Kafka is a distributed messaging system, very widely in use both within and outside LinkedIn. Kafka consumer clients used a distributed and complex coordination protocol which heavily depended on ZooKeeper for failure detection and partition rebalancing. Such an architecture limits scalability and performance, and is error-prone due to its split-brain nature. The project built a prototype for a thinner, scalable consumer client architecture with a highly available coordinator for load balancing. In addition to improving performance and scalability, this work also made way for easier multi-lingual consumer client support by removing all external dependencies, like ZooKeeper.
Guozhang Wang (PhD '13), Cornell