2013

Social Search in a Professional Context
Daniel Tunkelang

In the 22nd International Conference on Information and Knowledge Management (CIKM 2013)

[]

Abstract

Social networks bring a new dimension to search. Instead of looking for web pages or text documents, LinkedIn members search a world of entities connected by a rich graph of relationships. Search is a fundamental part of the LinkedIn ecosystem, as it helps our members find and be found. Unlike most search applications, LinkedIn’s search experience is highly personalized: two LinkedIn members performing the same search query are likely to see completely different results. Delivering the right results to the right person depends on our ability to leverage our each member’s unique professional identity and network. In this talk, I’ll describe the kinds of search behavior we see on LinkedIn, and some of the approaches we’ve taken to help our members address their information needs.

BiBTeX

@article{tunkelang2013,
  title={Social Search in a Professional Context},
  author={Tunkelang, D.},
  year={2013}
}
Structural Diversity in Social Recommender Systems
Xinyi Huang, Mitul Tiwari and Sam Shah

In the 5th ACM RecSys Workshop on Recommender Systems and the Social Web (RSWeb 2013)

[]

Abstract

Online social networks have become important for sharing, discovery, communication, and networking. Recommender systems are an essential part of any social network. For example, recommending people to connect with is essential for the growth of the network since an online social network is only partially observed and two people might know each other but may not be connected. In this paper, we analyze data from LinkedIn, the largest online professional social network, which recommends other members to connect through its “People You May Know” feature. Analyzing the effect of structural diversity on the invitation rate from such member recommendations, we find that higher connection density and lower structural diversity results in a higher connection invitation rate. We also analyze and study the effects of structural diversity of members’ connection networks on their engagement on the LinkedIn network.

BiBTeX


    
Hourglass: a Library for Incremental Processing on Hadoop
Matthew Hayes and Sam Shah

In the 2013 IEEE International Conference on Big Data (IEEE BigData 2013)

[]

Abstract

Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.

BiBTeX


    
Large-Scale Graph Mining for Recommendations
Sam Shah

In the 11th Workshop on Mining and Learning with Graphs (MLG 2013)

[]

Abstract

The availability and affordability of large-scale data processing is transforming graph mining into a core production use case, especially in the consumer web space. At LinkedIn, the largest professional online social network with 225+ million members, a crucial characteristic is the use of static and temporal network features for many applications, particularly recommendations. These include “People You May Know”, a link prediction system to find other members on the network; “Endorsements”, a lightweight skill reputation product; “Related Searches”, query recommendations in our search engine; and more. How do we perform this graph mining at scale? What are some of the challenges we face? Besides the social graph, what about other interesting, but potentially more complex and larger graphs? In this talk, I will illustrate several of LinkedIn’s solutions in large scale graph mining.

BiBTeX


    
Find and be Found: Information Retrieval at LinkedIn
Shakti Sinha and Daniel Tunkelang

In the 36th Annual International ACM Conference on Research & Development on Information Retrieval (SIGIR 2013)

[]

Abstract

LinkedIn has a unique data collection: the 200M+ members who use LinkedIn are also the most valuable entities in our corpus, which consists of people, companies, jobs, and a rich content ecosystem. Our members use LinkedIn to satisfy a diverse set of navigational and exploratory information needs, which we address by leveraging semi-structured and social content to understanding their query intent and deliver a personalized search experience. In this talk, we will discuss some of the unique challenges we face in building the LinkedIn search platform, the solutions we’ve developed so far, and the open problems we see ahead of us.

BiBTeX

@article{sinha2013,
title={Find and be Found: Information Retrieval at LinkedIn},
author={Sinha, S. and Tunkelang, D.},
booktitle={SIGIR 13 The 36th International ACM SIGIR conference on research and development in Information Retrieval},
year={2013}
}
Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
Rui Wang, Christopher Conrad, and Sam Shah

In the 5th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2013)

[]

Abstract

Social networks often require the ability to perform low latency graph computations in the user request path. For example, at LinkedIn, we show the graph distance and common connections when we show a profile in any context on the site. To do this, we have developed a distributed and partitioned graph system that scales to hundreds of millions of members and their connections, handling hundreds of thousands of queries per second.

To accomplish this scaling, real time distributed graph traversal is converted into set intersections that are accomplished in a scatter/gather manner. A network performance bottleneck forms on the gather node as it must merge partial results from many machines. In this paper, we present a modified greedy set cover algorithm that is used to locate the minimal set of machines that can serve the partial results. Our results indicate that we are able to save 25% in the 99th percentile latency of these graph distance calculations for LinkedIn’s social graph workloads.

BiBTeX


    
The "Big Data" ecosystem at LinkedIn
Roshan Sumbaly, Jay Kreps, and Sam Shah

In the 2013 International Conference on Management of Data (SIGMOD 2013)

[]

Abstract

The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn’s Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the ``last mile’’ issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

BiBTeX

@inproceedings{sumbaly13bigdata,
 author = {Sumbaly, Roshan and Kreps, Jay and Shah, Sam},
 title = {The "big data" ecosystem at LinkedIn},
 booktitle = {Proceedings of the 2013 International Conference on Management of Data},
 series = {SIGMOD '13},
 year = {2013},
 location = {New York, New York, USA},
 pages = {1125--1134},
 numpages = {10},
 address = {New York, NY, USA},
}
Root cause detection in a service-oriented architecture
Myunghwan Kim, Roshan Sumbaly, and Sam Shah

In the 2013 International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2013)

[]

Abstract

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user’s request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers.

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.

BiBTeX

@inproceedings{kim13rootcause,
 author = {Kim, Myunghwan and Sumbaly, Roshan and Shah, Sam},
 title = {Root cause detection in a service-oriented architecture},
 booktitle = {Proceedings of the International Conference on Measurement and Modeling of Computer Systems},
 series = {SIGMETRICS '13},
 year = {2013},
 location = {Pittsburgh, PA, USA},
 pages = {93--104},
 numpages = {12},
} 
Organizational overlap on social networks and its applications
Cho-Jui Hsieh, Mitul Tiwari, Deepak Agarwal, Xinyi (Lisa) Huang, and Sam Shah

In the 22nd International World Wide Web Conference (WWW 2013)

[]

Abstract

Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a social network, based on the users belonging to organizations such as companies, schools, and online groups. We present experimental insights from social network data on organizational overlap, a novel mathematical model to compute the probability of connection between two people based on organizational overlap, and experimental validation of this model based on real social network data. We also present novel ways in which the organization overlap model can be applied to link prediction and community detection, which in itself could be useful for recommending entities to follow and generating personalized news feed.

BiBTeX

@inproceedings{hsieh13orgoverlap,
 author = {Hsieh, Cho-Jui and Tiwari, Mitul and Agarwal, Deepak and Huang, Xinyi (Lisa) and Shah, Sam},
 title = {Organizational overlap on social networks and its applications},
 booktitle = {Proceedings of the 22nd International Conference on World Wide Web},
 series = {WWW '13},
 year = {2013},
 location = {Rio de Janeiro, Brazil},
 pages = {571--582},
 numpages = {12},
} 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Sam Shah and Pete Skomoroch

Strata 2013

[]

Abstract

Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.

Underneath this feature, there are several interesting and difficult data questions:

  1. How do you automatically create a taxonomy of skills in the professional context?

  2. How do you disambiguate between different contexts of skills? For instance, “search” could mean information retrieval, search & seizure, search & rescue, among others.

  3. How can you leverage data to determine someone’s authoritativeness in a skill?

  4. How do you use that authoritativeness to recommend people to endorse?

  5. How do you optimize a complex large scale machine learning system for viral growth & engagement?

In this talk, we’ll examine the practical aspects of building a data feature like Endorsements. We’ll talk about marrying product design and data, deep diving into several of the lessons we’ve learned along the way - all using skills & endorsements as an empirical case study. We’ll include technical detail on our approaches and how we combine crowdsourcing, machine learning, and large scale distributed systems to recommend topics to users.

We’ll also show interesting results on how members are using the endorsements feature and how it’s spread across the network.

BiBTeX


    

2012

Data By The People, For The People
Daniel Tunkelang

In the 21st International Conference on Information and Knowledge Management (CIKM 2012)

[]

Abstract

LinkedIn has a unique data collection: the 160M+ members who use LinkedIn are also the content those same members access using our information retrieval products. LinkedIn members performed over 4 billion professionally-oriented searches in 2011, most of those to find and discover other people. Every LinkedIn search and recommendation is deeply personalized, reflecting the user’s current employment, career history, and professional network. In this talk, I will describe some of the challenges and opportunities that arise from working with this unique corpus. I will discuss work we are doing in the areas of relevance, recommendation, and reputation, as well as the ecosystem we have developed to incent people to provide the high-quality semi-structured profiles that make LinkedIn so useful.

BiBTeX

@article{tunkelang2012,
  title={Data By The People, For The People},
  author={Tunkelang, D.},
  year={2012}
}
Metaphor: A System for Related Search Recommendations
Azarias Reda, Yubin Park, Mitul Tiwari, Christian Posse, and Sam Shah

In the 21st International Conference on Information and Knowledge Management (CIKM 2012)

[]

Abstract

Search plays an important role in online social networks as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one of several mechanisms used for improving members’ search experience in finding relevant results to their queries. This paper describes the design, implementation and deployment of Metaphor, the related search recommendation system on , a professional social networking site with over 160 million members worldwide. Metaphor builds on a number of signals and filters that capture several dimensions of relatedness across member search activity. The system, which has been in live operation for over a year, has gone through multiple iterations and evaluation cycles. This paper makes three contributions. First, we provide a discussion of a large-scale related search recommendation system. Second, we describe a mechanism for effectively combining several signals in building a unified dataset for related search recommendations. Third, we introduce a query length model for capturing bias in recommendation click behavior. We also discuss some of the practical concerns in deploying related search recommendations.

BiBTeX

@article{redametaphor,
  title={Metaphor: a system for related search recommendations},
  author={Reda, A. and Park, Y. and Tiwari, M. and Posse, C. and Shah, S.}
}
Bridging Offline and Online Social Graph Dynamics
Manuel Gomez Rodriguez and Monica Rogati

In the 21st International Conference on Information and Knowledge Management (CIKM 2012)

[]

Abstract

The online and oine worlds are converging. Location-based services, ubiquitous mobile devices and on-the-go social network accessibility are blurring the distinction between in-person activities and their virtual counterpart. An important e ect of this convergence is the rapid and powerful impact of oine events (meetings, conferences) on the evolution and temporal dynamics of the online connectivity between members of social and professional networks. However, these e ects have been largely unexplored.

We study these e ects by using data from LinkedIn, a popular business-related social networking site. We find that online events may induce connectivity changes in the online network – there is a dramatic increase in the number of connections between event attendees shortly after the date of the event. Building on these insights, we describe a non-supervised method that exploits connectivity changes temporally correlated to real world events to successfully infer more than 40% of speci c event attendees. Finally, we revisit the link prediction problem by including user contributed information about on line events to achieve higher link prediction performance.

BiBTeX

@article{gomezbridging,
  title={Bridging Offline and Online Social Graph Dynamics},
  author={Gomez-Rodriguez, M. and Rogati, M.}
}
Untangling Cluster Management With Helix
Kishore Gopalakrishna, Shi Lu, Zhen Zhang, Adam Silberstein, Kapil Surlaker, Ramesh Subramonian, Bob Schulman

In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC 2012)

[]

Abstract

Distributed data systems systems are used in a variety of settings like online serving, offline analytics, data transport, and search, among other use cases. They let organizations scale out their workloads using cost-effective commodity hardware, while retaining key properties like fault tolerance and scalability. At LinkedIn we have built a number of such systems. A key pattern we observe is that even though they may serve different purposes, they tend to have a lot of common functionality, and tend to use common building blocks in their architectures. One such building block that is just beginning to receive attention is cluster management, which addresses the complexity of handling a dynamic, large-scale system with many servers. Such systems must handle software and hardware failures, setup tasks such as bootstrapping data, and operational issues such as data placement, load balancing, planned upgrades, and cluster expansion.

All of this shared complexity, which we see in all of our systems, motivates us to build a cluster management framework, Helix, to solve these problems once in a general way.

Helix provides an abstraction for a system developer to separate coordination and management tasks from component functional tasks of a distributed system. The developer defines the system behavior via a state model that enumerates the possible states of each component, the transitions between those states, and constraints that govern the system’s valid settings. Helix does the heavy lifting of ensuring the system satisfies that state model in the distributed setting, while also meeting the system’s goals on load balancing and throttling state changes. We detail several Helix-managed production distributed systems at LinkedIn and how Helix has helped them avoid building custom management components. We describe the Helix design and implementation and present an experimental study that demonstrates its performance and functionality.

BiBTeX


    
All Aboard the Databus! LinkedIn's Scalable Consistent Change Data Capture Platform
Shirshanka Das, Chavdar Botev, Kapil Surlaker, Bhaskar Ghosh, Balaji Varadarajan, Sunil Nagaraj, David Zhang, Lei Gao, Jemiah Westerman, Phanindra Ganti, Boris Shkolnik, Sajid Topiwala, Alexander Pachev, Naveen Somasundaram and Subbu Subramaniam

In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC 2012)

[]

Abstract

In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly, data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.

We have built Databus, a source-agnostic distributed change data capture system, which is an integral part of LinkedIn’s data processing pipeline. The Databus transport layer provides latencies in the low milliseconds and handles throughput of thousands of events per second per server while supporting infinite look back capabilities and rich subscription functionality. This paper covers the design, implementation and trade-offs underpinning the latest generation of Databus technology. We also present experimental results from stress-testing the system and describe our experience supporting a wide range of LinkedIn production applications built on top of Databus.

BiBTeX


    
Beyond Ratings and Followers
Anmol Bhasin

In the 6th ACM International Conference on Recommender Systems (RecSys 2012)

[]

Abstract

The pervasiveness of social networks has magnified the utility of recommender systems and all three classical dimensions users, items and modes of interactions i.e. click or buy etc. have exploded in scale: more users, more heterogeneous items, and diverse interactions.

In this talk we present the challenges and opportunities of applying simple to sophisticated machine learning, data mining, and statistical modeling techniques to the world of recommender problems in social networks. Using real world example applications deployed on LinkedIn, we build from foundational literature on content based recommendations, collaborative filtering, and behavioral targeting techniques to arrive at the formalism of Social Filtering. We then cover critical aspects of developing of a web scale social recommender systems including infrastructure, feature engineering and model fitting. We describe some of the most fascinating challenges faced in the real-world setting of operating recommender systems including scalability, offline vs online tradeoffs, A/B Testing, and Multiple Objective Optimization. Finally, conclude with some new and unique paradigms of virtual profiling, social referral and intent-interest modeling, in the context of the LinkedIn recommender system.

BiBTeX

@inproceedings{bhasin2012industry,
  title={Beyond Ratings and Followers},
  author={Bhasin, A},
  booktitle={Proceedings of the sixth ACM conference on Recommender systems},
  year={2012},
  organization={ACM}
}
Social Referral: Leveraging network connections to deliver recommendations
Mohammad Amin, Baoshi Yan, Sripad Sriram, Anmol Bhasin and Christian Posse

To appear in Proceedings of the Sixth ACM conference on Recommender Systems (RecSys 2012)

[]

Abstract

Much work has been done to study the interplay between recommender systems and social networks. This creates a very powerful coupling in presenting highly relevant recommendations to the users. However, to our knowledge, little attention has been paid to leverage a user’s social network to deliver these recommendations. We present a novel approach to aid delivery of recommendations using the recipient’s friends or connections. Our contributions with this study are 1) A novel recommendation delivery paradigm called Social Referral, which utilizes a user’s social network for the delivery of relevant content. 2) An implementation of the paradigm is described in a real industrial production setting of a large online professional network. 3) A study of the interaction between the trifecta of the recommender system, the trusted connections and the end consumer of the recommendation by comparing and contrasting the pro- posed approach’s performance with the direct recommender system. Our experiments indicate that Social Referral is a promis- ing mechanism for recommendation delivery. The experiments show that a significant portion of users are receptive to passing along relevant recommendations to their social networks, and that recommendations delivered through users’ social networks are much more likely to be accepted than those directly delivered to users.

BiBTeX

@inproceedings{amin2012social,
  title={Social referral: leveraging network connections to deliver recommendations},
  author={Amin, M.S. and Yan, B. and Sriram, S. and Bhasin, A. and Posse, C.},
  booktitle={Proceedings of the sixth ACM conference on Recommender systems},
  pages={273--276},
  year={2012},
  organization={ACM}
}
Multiple Objective Optimization in Recommendation Systems
Mario Rodriguez, Christian Posse and Ethan Zhang

To appear in Proceedings of the Sixth ACM Conference on Recommender Systems (RecSys 2012)

[]

Abstract

We address the problem of optimizing recommender systems for multiple relevance objectives that are not necessarily aligned. Specifically, given a recommender system that optimizes for one aspect of relevance, semantic matching (as defined by any notion of similarity between source and target of recommendation; usually trained on CTR), we want to enhance the system with additional relevance sig- nals that will increase the utility of the recommender system, but that may simultaneously sacrifice the quality of the semantic match. The issue is that semantic matching is only one relevance aspect of the utility function that drives the recommender system, albeit a significant aspect.

BiBTeX

@inproceedings{rodriguez2012multiple,
  title={Multiple objective optimization in recommender systems},
  author={Rodriguez, M. and Posse, C. and Zhang, E.},
  booktitle={Proceedings of the sixth ACM conference on Recommender systems},
  pages={11--18},
  year={2012},
  organization={ACM}
}
Content, Connections, and Context
Daniel Tunkelang

In the 4th ACM RecSys Workshop on Recommender Systems and the Social Web (in conjunction with RecSys 2012)

[]

Abstract

Recommender systems for the social web combine three kinds of signals to relate the subject and object of recommendations: content, connections, and context.

Content comes first - we need to understand what we are recommending and to whom we are recommending it in order to decide whether the recommendation is relevant. Connections supply a social dimension, both as inputs to improve relevance and as social proof to explain the recommendations. Finally, context determines where and when a recommendation is appropriate.

I’ll talk about how we use these three kinds of signals in LinkedIn’s recommender systems, as well as the challenges we see in delivering social recommendations and measuring their relevance.

BiBTeX

@inproceedings{mobasher20124th,
  title={4th ACM RecSys workshop on recommender systems and the social web},
  author={Mobasher, B. and Jannach, D. and Geyer, W. and Hotho, A.},
  booktitle={Proceedings of the sixth ACM conference on Recommender systems},
  pages={345--346},
  year={2012},
  organization={ACM}
}
Avatara: OLAP for Web-scale Analytics Products
Lili Wu, Roshan Sumbaly, Chris Riccomini, Gordon Koo, Hyung Jin Kim, Jay Kreps, and Sam Shah

In the 38th International Conference on Very Large Databases (VLDB 2012)

[]

Abstract

Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like “Who’s Viewed My Profile?” and “Who’s Viewed This Job?” The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160 million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years.

BiBTeX

@article{avatara,
  author    = {Lili Wu and
               Roshan Sumbaly and
               Chris Riccomini and
               Gordon Koo and
               Hyung Jin Kim and
               Jay Kreps and
               Sam Shah},
  title     = {Avatara: OLAP for Web-scale Analytics Products},
  journal   = {PVLDB},
  volume    = {5},
  number    = {12},
  year      = {2012},
  pages     = {1874-1877},
  url       = {http://vldb.org/pvldb/vol5/p1874_liliwu_vldb2012.pdf},
}
Key Lessons Learned Building Recommender Systems for Large-Scale Social Networks
Christian Posse

In the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2012)

[]

Abstract

By helping members to connect, discover and share relevant content or find a new career opportunity, recommender systems have become a critical component of user growth and engagement for social networks. The multidimensional nature of engagement and diversity of members on large-scale social networks have generated new infrastructure and modeling challenges and opportunities in the development, deployment and operation of recommender systems. This presentation will address some of these issues, focusing on the modeling side for which new research is much needed while describing a recommendation platform that enables real-time recommendation updates at scale as well as batch computations, and cross-leverage between different product recommendations. Topics covered on the modeling side will include optimizing for multiple competing objectives, solving contradicting business goals, modeling user intent and interest to maximize placement and timeliness of the recommendations, utility metrics beyond CTR that leverage both real-time tracking of explicit and implicit user feedback, gathering training data for new product recommendations, virility preserving online testing and virtual profiling

BiBTeX

@inproceedings{Posse:2012:KLL:2339530.2339625,
 author = {Posse, Christian},
 title = {Key lessons learned building recommender systems for large-scale social networks},
 booktitle = {Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining},
 series = {KDD '12},
 year = {2012},
 isbn = {978-1-4503-1462-6},
 location = {Beijing, China},
 pages = {587--587},
 numpages = {1},
 url = {http://doi.acm.org/10.1145/2339530.2339625},
 doi = {10.1145/2339530.2339625},
 acmid = {2339625},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {multi-objective optimization, online testing, real-time updates, recommender systems, user intent modeling},
} 
Learning to rank social update streams
Liangjie Hong, Ron Bekkerman, Joseph Adler, and Brian Davison

In the 35th Annual International ACM Conference on Research & Development on Information Retrieval (SIGIR 2012)

[]

Abstract

As online social media further integrates deeper into our lives, we spend more time consuming social update streams that come from our online connections. Although social update streams provide a tremendous opportunity for us to access information on-the-fly, we often complain about its relevance. Some of us are flooded with a steady stream of information and simply cannot process it in full. Ranking the incoming content becomes the only solution for the overwhelmed users. For some others, in contrast, the incoming information stream is pretty weak, and they have to actively search for relevant information which is quite tedious. For these users, augmenting their incoming content flow with relevant information from outside their first-degree network would be a viable solution. In that case, the problem of relevance becomes even more prominent. In this paper, we start an open discussion on how to build effective systems for ranking social updates from a unique perspective of LinkedIn – the largest professional network in the world. More specifically, we address this problem as an intersection of learning to rank, collaborative filtering, and clickthrough modeling, while leveraging ideas from information retrieval and recommender systems. We propose a novel probabilistic latent factor model with regressions on explicit features and compare it with a number of non-trivial baselines. In addition to demonstrating superior performance of our model, we shed some light on the nature of social updates on LinkedIn and how users interact with them, which might be applicable to social update streams in general.

BiBTeX

@inproceedings{hong2012learning,
  title={Learning to rank social update streams},
  author={Hong, L. and Bekkerman, R. and Adler, J. and Davison, B.D.},
  booktitle={Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval},
  pages={651--660},
  year={2012},
  organization={ACM}
}
Bimodal invitation-navigation fair bets model for authority identification in a social network
Suratna Budalakoti and Ron Bekkerman

In the 21st International World Wide Web Conference (WWW 2012)

[]

Abstract

We consider the problem of identifying the most respected, authoritative members of a large-scale online social network (OSN) by constructing a global ranked list of its members. The problem is distinct from the problem of identifying influencers: we are interested in identifying members who are influential in the real world, even when not necessarily so on the OSN. We focus on two sources for information about user authority: (a) invitations to connect, which are usually sent to people whom the inviter respects, and (b) members’ browsing behavior, as profiles of more important people are viewed more often than others’. We construct two directed graphs over the same set of nodes (representing member profiles): the invitation graph and the navigation graph respectively. We show that the standard PageRank algorithm, a baseline in web page ranking, is not effective in people ranking, and develop a social capital based model, called the fair bets model, as a viable solution. We then propose a novel approach, called bimodal fair bets, for combining information from two (or more) endorsement graphs drawn from the same OSN, by simultaneously using the authority scores of nodes in one graph to inform the other, and vice versa, in a mutually reinforcing fashion. We evaluate the ranking results on the LinkedIn social network using this model, where members who have Wikipedia profiles are assumed to be authoritative. Experimental results show that our approach outperforms the baseline approach by a large margin.

BiBTeX

@inproceedings{budalakoti2012bimodal,
  title={Bimodal invitation-navigation fair bets model for authority identification in a social network},
  author={Budalakoti, S. and Bekkerman, R.},
  booktitle={Proceedings of the 21st international conference on World Wide Web},
  pages={709--718},
  year={2012},
  organization={ACM}
}
Data Infrastructure at Linkedin
Bhaskar Ghosh, Shirshanka Das, Jay Kreps, Kapil Surlaker, Jun Rao, et. al.

In 28th IEEE International Conference on Data Engineering (ICDE 2012)

[]

Abstract

LinkedIn is among the largest social networking sites in the world. As the company has grown, our core data sets and request processing requirements have grown as well. In this paper, we describe a few selected data infrastructure projects at Linked In that have helped us accommodate this increasing scale. Most of those projects build on existing open source projects and are themselves available as open source. The projects covered in this paper include: (1) Voldemort: a scalable and fault tolerant key-value store, (2) Data bus: a framework for delivering database changes to downstream applications, (3) Espresso: a distributed data store that supports flexible schemas and secondary indexing, (4) Kafka: a scalable and efficient messaging system for collecting various user activity events and log data.

BiBTeX

@inproceedings{auradkar2012data,
  title={Data Infrastructure at LinkedIn},
  author={Auradkar, A. and Botev, C. and Das, S. and De Maagd, D. and Feinberg, A. and Ganti, P. and Gao, L. and Ghosh, B. and Gopalakrishna, K. and Harris, B. and others},
  booktitle={2012 IEEE 28th International Conference on Data Engineering},
  pages={1370--1381},
  year={2012},
  organization={IEEE}
}
Social Networking in Developing Regions
Azarias Reda, Sam Shah, Mitul Tiwari, Anita Lillie, and Brian Noble

In the 5th International Conference on Information and Communications Technologies and Development (ICTD 2012)

[]

Abstract

Online social networks have enjoyed significant growth over the past several years. With improvements in mobile and Internet penetration, developing countries are participating in increasing numbers in online communities. This paper provides the first large scale and detailed analysis of social networking usage in developing country contexts. The analysis is based on data from LinkedIn, a professional social network with over 120 million members worldwide. LinkedIn has members from every country in the world, including millions in Africa, Asia, and South America. The goal of this paper is to provide researchers a detailed look at the growth, adoption, and other characteristics of social networking usage in developing countries compared to the developed world. To this end, we discuss several themes that illustrate different dimensions of social networking use, ranging from interconnectedness of members in geographic regions to the impact of local languages on social network participation.

BiBTeX

@inproceedings{Reda:2012:SND:2160673.2160686,
  author = {Reda, Azarias and Shah, Sam and Tiwari, Mitul and Lillie, Anita and Noble, Brian},
  title = {Social networking in developing regions},
  booktitle = {Proceedings of the Fifth International Conference on Information and Communication Technologies and Development},
  series = {ICTD '12},
  year = {2012},
  isbn = {978-1-4503-1045-1},
  location = {Atlanta, Georgia},
  pages = {94--103},
  numpages = {10},
  url = {http://doi.acm.org/10.1145/2160673.2160686},
  doi = {10.1145/2160673.2160686},
  acmid = {2160686},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {developing regions, emerging markets, social networks},
}
Serving Large-Scale Batch Computed Data with Project Voldemort
Roshan Sumbaly, Jay Kreps, Alex Feinberg, Lei Gao, and Sam Shah

In the 10th USENIX conference on File and Storage Technologies (FAST 2012)

[]

Abstract

Current serving systems lack the ability to bulk load massive immutable data sets without affecting serving performance. The performance degradation is largely due to index creation and modification as CPU and memory resources are shared with request serving. We have extended Project Voldemort, a general-purpose distributed storage and serving system inspired by Amazon’s Dynamo, to support bulk loading terabytes of read-only data. This extension constructs the index offline, by leveraging the fault tolerance and parallelism of Hadoop. Compared to MySQL, our compact storage format and data deployment pipeline scales to twice the request throughput while maintaining sub 5 ms median latency. At LinkedIn, the largest professional social network, this system has been running in production for more than 2 years and serves many of the data-intensive social features on the site.

BiBTeX

@inproceedings{Sumbaly:2012:SLB:2208461.2208479,
 author = {Sumbaly, Roshan and Kreps, Jay and Gao, Lei and Feinberg, Alex and Soman, Chinmay and Shah, Sam},
 title = {Serving large-scale batch computed data with {Project Voldemort}},
 booktitle = {Proceedings of the 10th USENIX conference on File and Storage Technologies},
 series = {FAST'12},
 year = {2012},
 location = {San Jose, CA},
 numpages = {13},
 url = {http://dl.acm.org/citation.cfm?id=2208461.2208479},
 acmid = {2208479},
 publisher = {USENIX Association},
 address = {Berkeley, CA, USA},
} 

2011

Scaling up machine learning: Parallel and distributed approaches
Ron Bekkerman, Misha Bilenko, and John Langford

Cambridge University Press (2011)

[]

Abstract

This book presents an integrated collection of representative approaches for scaling up machine learning and data mining methods on parallel and distributed computing platforms. Demand for parallelizing learning algorithms is highly task-specific: in some settings it is driven by the enormous dataset sizes, in others by model complexity or by real-time performance requirements. Making task-appropriate algorithm and platform choices for large-scale machine learning requires understanding the benefits, trade-offs, and constraints of the available options. Solutions presented in the book cover a range of parallelization platforms from FPGAs and GPUs to multi-core systems and commodity clusters, concurrent programming frameworks including CUDA, MPI, MapReduce, and DryadLINQ, and learning settings (supervised, unsupervised, semi-supervised, and online learning). Extensive coverage of parallelization of boosted trees, SVMs, spectral clustering, belief propagation and other popular learning algorithms and deep dives into several applications make the book equally useful for researchers, students, and practitioners.

BiBTeX

@book{bekkerman2011scaling,
  title={Scaling up machine learning: Parallel and distributed approaches},
  author={Bekkerman, R. and Bilenko, M. and Langford, J.},
  year={2011},
  publisher={Cambridge Univ Pr}
}
HCIR 2011: the fifth international workshop on human-computer interaction and information retrieval
Robert Capra, Gene Golovchinsky, Bill Kules, Dan Russell, Catherine Smith, Daniel Tunkelang, and Ryen White

In SIGIR Forum 45(2)

[]

Abstract

This report describes the 2011 Workshop on Human-Computer Interaction and Information Retrieval. Now in its fifth year, the workshop was held in October, in Mountain View, CA. The event brings together researchers from academia, industry, and government and a range of disciplines for in-depth discussions in an informal atmosphere. The workshop continues to grow, with around 100 attendees this year. We continued the HCIR Challenge, this time focusing on the problem of information availability, with four in-depth system demonstrations, and audience selection of a challenge winner.

BiBTeX

@inproceedings{capra2012hcir,
  title={HCIR 2011: the fifth international workshop on human-computer interaction and information retrieval},
  author={Capra, R. and Golovchinsky, G. and Kules, B. and Russell, D. and Smith, C.L. and Tunkelang, D. and White, R.W.},
  booktitle={ACM SIGIR Forum},
  volume={45},
  number={2},
  pages={102--107},
  year={2012},
  organization={ACM}
}
Databus: A System for Timeline-Consistent Low-Latency Change Capture
Chavdar Botev

In the ACM 20th Conference on Information and Knowledge Management (CIKM 2011)

[]

Abstract

LinkedIn’s rich social data allows it to successfully connect its 100+M member world professionals with new economic opportunities. This data is predominantly stored in relational database systems such as Oracle and MySQL. While such systems provide a reliable and consistent data storage layer, they have limited capabilities for dealing with graph, unstructured or semi-structured data. Therefore, for efficient and effective processing of such data, LinkedIn relies on external systems, such as its graph index Dgraph and real-time full-text search index Zoie. This approach poses the problem of keeping external systems up-to-date with constantly changing data in the primary store.

Databus solves this problem by providing a data change capture mechanism from the primary stores to external subscribers in user space. The main challenges are in providing of (a) transactional semantics with strong reliability and ordering guarantees for timeline consistency of the subscribers in a distributed asynchronous environment, (b) low latency and high throughput for (near) real-time updates, (c) scalability to hundreds of subscribers which can dynamically join, leave, fall behind and catch up with little impact on the primary store and the rest of the system.

In this talk, we will describe how Databus addresses the above challenges and some of the lessons learned from its many uses at LinkedIn, such as the aforementioned external indexes, replication, cache invalidation, and view materialization across multiple databases.

BiBTeX

@inproceedings{botev2011databus,
  title={Databus: A System for Timeline-Consistent Low-Latency Change Capture},
  author={Botev, C.},
  booktitle={Proceeding of the 20th ACM conference on Information and knowledge management},
  year={2011},
  organization={ACM}
}
Recommendations as a conversation with the user
Daniel Tunkelang

In the 5th ACM International Conference on Recommender Systems (RecSys 2011)

[]

Abstract

Recommender systems aim to provide users with products or content that satisfy the users’ stated or inferred needs. The primary evaluation measures for recommender systems emphasize either the perceived relevance of the recommendations or the actions associated with those recommendations (e.g., purchases or clicks). Unfortunately, this transactional emphasis neglects how users interact with recommendations in the context of information seeking tasks. The effectiveness of this interaction determines the user’s experience beyond a single transaction. This tutorial explores the role of recommendations as part of a conversation between the user and an information seeking system. The tutorial does not require any special background in interfaces or usability, and will focus on practical techniques to make recommender systems most effective for users.

BiBTeX

@inproceedings{tunkelang2011recommendations,
  title={Recommendations as a Conversation with the User},
  author={Tunkelang, D.},
  booktitle={Proceedings of the fifth ACM conference on Recommender systems},
  pages={11--12},
  year={2011},
  organization={ACM}
}
Social Navigation: A Position Paper
Daniel Tunkelang, Jonathan Koren, Paul Ogilvie, and John Wang

In the Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011)

[]

Abstract

In this position paper, we propose social navigation as a paradigm for information access. We define social navigation as navigation through explicit manipulation of a social lens and offer examples of its application.

BiBTeX

@inproceedings{tunkelang2011social,
  title={Social Navigation: A Position Paper},
  author={Tunkelang, D., Koren, J., Ogilvie, P., and Wang, J.},
  booktitle={Proc. of the Fifth Workshop on Human-Computer Interaction and Information Retrieval},
  pages={4},
  year={2011}
}
Is it Time to Abandon Abandonment?
Abhimanyu Lad and Daniel Tunkelang

In the Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011)

[]

Abstract

Commonly used click-based measures like abandonment and mean reciprocal rank (MRR) present an incomplete, and often misleading, picture of search performance, especially in rich user interfaces that support a wide range of search behaviors. We propose a search utility framework that is based on a holistic view of the information-seeking process. First, we go beyond the use of clicks as indicators of relevance, taking into account actions performed on search results as more reliable indicators of completion of the user’s underlying task. Second, instead of looking only at individual queries, we consider the entire search session comprising multiple queries that are meant to address a single information need. We argue that an evaluation metric that combines these two features more accurately reflects the effectiveness of the system as perceived by the user. Finally, we propose future experiments to operationalize as well as validate this framework.

BiBTeX

@inproceedings{lad2011abandonment,
  title={Is it Time to Abandon Abandonment?},
  author={Lad, A. and Tunkelang, D.},
  booktitle={Proc. of the Fifth Workshop on Human-Computer Interaction and Information Retrieval},
  pages={4},
  year={2011}
}
Faceted Search Log Query Analysis
Jonathan Koren

In the Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011)

[]

Abstract

In this paper we present an analysis of search logs of a faceted search interface of LinkedIn, a popular social net- work with approximately 100 million users. 115 million search sessions from 22 million distinct users were collected. This analysis focuses on how users utilize facets in concert with traditional keyword search, what types of facets users are more likely to find useful.

This analysis can be used to improve facet and facet-value ranking algorithms and improve models of user behavior. We believe that this one of a few public analyses of facet search system, and the first analysis of this size, and also that considers social network search.

BiBTeX

@inproceedings{koten2011faceted,
  title={Faceted Search Log Query Analysis},
  author={Koren, J.},
  booktitle={Proc. of the Fifth Workshop on Human-Computer Interaction and Information Retrieval},
  pages={4},
  year={2011}
}
Using Paxos to build a scalable, consistent, and highly available datastore
Jun Rao, Eugene Shekita, and Sandeep Tata

In the 37th International Conference on Very Large Databases (VLDB 2011)

[]

Abstract

Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads. This paper describes Spinnaker’s Paxos-based replication protocol. The use of Paxos ensures that a data partition in Spinnaker will be available for reads and writes as long a majority of its replicas are alive. Unlike traditional master-slave replication, this is true regardless of the failure sequence that occurs. We show that Paxos replication can be competitive with alternatives that provide weaker consistency guarantees. Compared to an eventually consistent datastore, we show that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes.

BiBTeX

@article{rao2011using,
  title={Using Paxos to build a scalable, consistent, and highly available datastore},
  author={Rao, J. and Shekita, E.J. and Tata, S.},
  journal={Proceedings of the VLDB Endowment},
  volume={4},
  number={4},
  pages={243--254},
  year={2011},
  publisher={VLDB Endowment}
}
High-precision phrase-based document classification on a modern scale
Ron Bekkerman and Matan Gavish

In the 17th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2011)

[]

Abstract

We present a document classification system that employs lazy learning from labeled phrases, and argue that the system can be highly effective whenever the following property holds: most of information on document labels is captured in phrases. We call this property near sufficiency. Our research contribution is twofold: (a) we quantify the near sufficiency property using the Information Bottleneck principle and show that it is easy to check on a given dataset; (b) we reveal that in all practical cases—from small-scale to very large-scale—manual labeling of phrases is feasible: the natural language constrains the number of common phrases composed of a vocabulary to grow linearly with the size of the vocabulary. Both these contributions provide firm foundation to applicability of the phrase-based classification (PBC) framework to a variety of large-scale tasks. We deployed the PBC system on the task of job title classification, as a part of LinkedIn’s data standardization effort. The system significantly outperforms its predecessor both in terms of precision and coverage. It is currently being used in LinkedIn’s ad targeting product, and more applications are being developed. We argue that PBC excels in high explainability of the classification results, as well as in low development and low maintenance costs. We benchmark PBC against existing high-precision document classification algorithms and conclude that it is most useful in multilabel classification.

BiBTeX

@inproceedings{bekkerman2011high,
  title={High-precision phrase-based document classification on a modern scale},
  author={Bekkerman, R. and Gavish, M.},
  booktitle={Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining},
  pages={231--239},
  year={2011},
  organization={ACM}
}
Identifying similar people in professional social networks with discriminative probabilistic models
Suleyman Cetintas, Monica Rogati, Luo Si, and Yi Fang

In the 34th Annual International ACM Conference on Research & Development on Information Retrieval (SIGIR 2011)

[]

Abstract

Identifying similar professionals is an important task for many core services in professional social networks. Information about users can be obtained from heterogeneous information sources, and different sources provide different insights on user similarity.

This paper proposes a discriminative probabilistic model that identifies latent content and graph classes for people with similar profile content and social graph similarity patterns, and learns a specialized similarity model for each latent class. To the best of our knowledge, this is the first work on identifying similar professionals in professional social networks, and the first work that identifies latent classes to learn a separate similarity model for each latent class. Experiments on a real-world dataset demonstrate the effectiveness of the proposed discriminative learning model.

BiBTeX

@inproceedings{cetintas2011identifying,
  title={Identifying similar people in professional social networks with discriminative probabilistic models},
  author={Cetintas, S. and Rogati, M. and Si, L. and Fang, Y.},
  booktitle={Proceedings of the 34th international ACM SIGIR conference on Research and development in Information},
  pages={1209--1210},
  year={2011},
  organization={ACM}
}
Efficiently evaluating graph constraints in content-based publish/subscribe
Andrei Broder, Shirshanka Das, Marcus Fontoura, Bhaskar Ghosh, Vanja Josifovski, Jayavel Shanmugasundaram, and Sergei Vassilvitskii

In the 20th International World Wide Web Conference (WWW 2011)

[]

Abstract

We introduce the problem of evaluating graph constraints in content-based publish/subscribe (pub/sub) systems. This problem formulation extends traditional content-based pub/sub systems in the following manner: publishers and subscribers are connected via a (logical) directed graph G with node and edge constraints, which limits the set of valid paths between them. Such graph constraints can be used to model a Web advertising exchange (where there may be restrictions on how advertising networks can connect advertisers and publishers) and content delivery problems in social networks (where there may be restrictions on how information can be shared via the social graph). In this context, we develop efficient algorithms for evaluating graph constraints over arbitrary directed graphs G. We also present experimental results that demonstrate the effectiveness and scalability of the proposed algorithms using a realistic dataset from Yahoo!’s Web advertising exchange.

BiBTeX

@inproceedings{broder2011efficiently,
  title={Efficiently evaluating graph constraints in content-based publish/subscribe},
  author={Broder, A. and Das, S. and Fontoura, M. and Ghosh, B. and Josifovski, V. and Shanmugasundaram, J. and Vassilvitskii, S.},
  booktitle={Proceedings of the 20th international conference on World wide web},
  pages={497--506},
  year={2011},
  organization={ACM}
}

2010

Modeling relationship strength in online social networks
Rongjing Xiang, Jennifer Neville, and Monica Rogati

In the 19th International World Wide Web Conference (WWW 2010)

[]

Abstract

Previous work analyzing social networks has mainly focused on binary friendship relations. However, in online social networks the low cost of link formation can lead to networks with heterogeneous relationship strengths (e.g., acquaintances and best friends mixed together). In this case, the binary friendship indicator provides only a coarse representation of relationship information. In this work, we develop an unsupervised model to estimate relationship strength from interaction activity (e.g., communication, tagging) and user similarity. More specifically, we formulate a link-based latent variable model, along with a coordinate ascent optimization procedure for the inference. We evaluate our approach on real-world data from Facebook and LinkedIn, showing that the estimated link weights result in higher autocorrelation and lead to improved classification accuracy.

BiBTeX

@inproceedings{xiang2010modeling,
  title={Modeling relationship strength in online social networks},
  author={Xiang, R. and Neville, J. and Rogati, M.},
  booktitle={Proceedings of the 19th international conference on World wide web},
  pages={981--990},
  year={2010},
  organization={ACM}
}

2008

The social (open) workspace
David A. Evans, Susan Feldman, Ed H. Chi, Nataša Milic-Frayling, and Igor Perisic

In the ACM 17th Conference on Information and Knowledge Management (CIKM 2008)

[]

Abstract

Social networking promises individuals new dimensions of freedom to interact, associate, and give expression to their talents. Recently, systems such as Mechanical Turk have started to facilitate self-organizing collaboration on work-related tasks. Such developments raise interesting questions. Is it possible to create (and sustain) businesses that do not have traditional, formal structure - without traditional “employees”? Can we find and organize (and optimize) talent on the web for task-oriented work - spontaneously and efficiently? How do people relate to one another in possibly evanescent workgroups? One aspect of the challenge in the Social Workspace is understanding and modeling the user behavior and the economic basis for creating, preserving, and exchanging value in the marketplace when workgroup identity, orientations to property, recruiting and managing appropriate talent are not organized under traditional company structures. Another aspect is the technology needed to support virtual organizations and work. The panel will discuss trends in social work and the evolving (scientific) basis of our understanding of new models of workers and organizations.

BiBTeX

@inproceedings{evans2008social,
  title={The social (open) workspace},
  author={Evans, D.A. and Feldman, S. and Chi, E.H. and Milic-Frayling, N. and Perisic, I.},
  booktitle={Proceeding of the 17th ACM conference on Information and knowledge management},
  pages={1529--1529},
  year={2008},
  organization={ACM}
}
{{error}}