Improving Twitter Search with Real-Time Human Computation

(This is a post from the Twitter Engineering Blog that I wrote with Alpa Jain.)

One of the magical things about Twitter is that it opens a window to the world in real-time. An event happens, and just seconds later, it’s shared for people across the planet to see.

Consider, for example, what happened when Flight 1549 crashed in the Hudson.

http://twitpic.com/135xa – There’s a plane in the Hudson. I’m on the ferry going to pick up the people. Crazy.
— Janis Krums (@jkrums) January 15, 2009

When Osama bin Laden was killed.

Helicopter hovering above Abbottabad at 1AM (is a rare event).
— Sohaib Athar (@ReallyVirtual) May 1, 2011

Or when Mitt Romney mentioned binders during the presidential debates.

Boy, I’m full of women! #debates
— Romney’s Binder (@RomneysBinder) October 17, 2012

When each of these events happened, people instantly came to Twitter – and, in particular, Twitter search – to discover what was happening.

From a search and advertising perspective, however, these sudden events pose several challenges:

The queries people perform have never before been seen, so it’s impossible to know beforehand what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for “Horses and Bayonets” are interested in the debates?
Since these spikes in search queries are so short-lived, there’s only a short window of opportunity to learn what they mean.

So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can, because in just a few hours those searches will be gone.

How do we do this? We’ll describe a novel real-time human computation engine we built that allows us to find search queries as soon as they’re trending, send these queries to real humans to be judged, and finally incorporate these human annotations into our backend models.

Overview

Before we delve into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask human judges to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

Monitoring for popular queries

Storm is a distributed system for real-time computation. In contrast to batch systems like Hadoop, which often introduce delays of hours or more, Storm allows us to run online data processing algorithms to discover search spikes as soon as they happen.

In brief, running a job on Storm involves creating a Storm topology that describes the processing steps that must occur, and deploying this topology to a Storm cluster. A topology itself consists of three things:

Tuple streams of data. In our case, these may be tuples of (search query, timestamp).
Spouts that produce these tuple streams. In our case, we attach spouts to our search logs, which get written to every time a search occurs.
Bolts that process tuple streams. In our case, we use bolts for operations like updating total query counts, filtering out non-English queries, and checking whether an ad is currently being served up for the query.

Here’s a step-by-step walkthrough of how our popular query topology works:

Whenever you perform a search on Twitter, the search request gets logged to a Kafka queue.
The Storm topology attaches a spout to this Kafka queue, and the spout emits a tuple containing the query and other metadata (e.g., the time the query was issued and its location) to a bolt for processing.
This bolt updates the count of the number of times we’ve seen this query, checks whether the query is “currently popular” (using various statistics like time-decayed counts, the geographic distribution of the query, and the last time this query was sent for annotations), and dispatches it to our human computation pipeline if so.

One interesting feature of our popularity algorithm is that we often rejudge queries that have been annotated before, since the intent of a search can change. For example, perhaps people normally search for “Clint Eastwood” because they’re interested in his movies, but during the Republican National Convention users may have wanted to see tweets that were more political in nature.

Human evaluation of popular search queries

At Twitter, we use human computation for a variety of tasks. (See also Clockwork Raven, an open-source crowdsourcing platform we built that makes launching tasks easier.) For example, we often run experiments to measure ad relevance and search quality, we use it to gather data to train and evaluate our machine learning models, and in this section we’ll describe how we use it to boost our understanding of popular search queries.

So suppose that our Storm topology has detected that the query “Big Bird” is suddenly spiking. Since the query may remain popular for only a few hours, we send it off to live humans, who can help us quickly understand what it means; this dispatch is performed via a Thrift service that allows us to design our tasks in a web frontend, and later programmatically submit them to crowdsourcing platforms like Mechanical Turk using any of the different languages we use across Twitter.

On our crowdsourcing platforms, judges are asked several questions about the query that help us serve better ads. Without going into the exact questions, here are flavors of a few possibilities:

What category does the query belong to? For example, “Stanford” may typically be an education-related query, but perhaps there’s a football game between Stanford and Berkeley at the moment, in which case the current search intent would be sports.
Does the query refer to a person? If so, who, and what is their Twitter handle if they have one? For example, the query “Happy Birthday Harry” may be trending, but it’s hard to know beforehand which of the numerous celebrities named Harry it’s referring to. Is it One Direction’s Harry Styles, in which case the searcher is likely to be interested in teen pop? Harry Potter, in which case the searcher is likely to be interested in fantasy novels? Or someone else entirely?

Turkers in the machine

Since humans are core to this system, let’s describe how our workforce was designed to give us fast, reliable results.

For completing all our tasks, we use a small custom pool of judges to ensure high quality. Other typical possibilities in the crowdsourcing world are to use a static set of in-house judges, to use the standard worker filters that Amazon provides, or to go through an outside company like Crowdflower. We’ve experimented with these other solutions, and while they have their own benefits, we found that a custom pool fit our needs best for a few reasons:

In-house judges can provide high-quality work as well, but they usually work standard hours (for example, 9 to 5 if they work onsite, or a relatively fixed and limited set of hours if they work from home), it can be difficult to communicate with them and schedule them for work, and it’s hard to scale the hiring of more judges.
Using Crowdflower or Amazon’s standard filters makes it easy to scale the workforce, but their trust algorithms aren’t perfect, so an endless problem is that spammy workers get through and many of the judgments will be very poor quality. Two methods of combatting low quality are to seed gold standard examples for which you know the true response throughout your task, or to use statistical analysis to determine which workers are the good ones, but these can be time-consuming and expensive to create, and we often run tasks of a free-response researchy nature for which these solutions don’t work. Another problem is that using these filters gives you a fluid, constantly changing set of workers, which makes them hard to train.

In contrast:

Our custom pool of judges work virtually all day. For many of them, this is a full-time job, and they’re geographically distributed, so our tasks complete quickly at all hours; we can easily ask for thousands of judgments before lunch, and have them finished by the time we get back, which makes iterating on our experiments much easier.
We have several forums, mailing lists, and even live chatrooms set up, all of which makes it easy for judges to ask us questions and to respond to feedback. Our judges will even give us suggestions on how to improve our tasks; for example, when we run categorization tasks, they’ll often report helpful categories that we should add.
Since we only launch tasks on demand, and Amazon provides a ready source of workers if we ever need more, our judges are never idly twiddling their thumbs waiting for tasks or completing busywork, and our jobs are rarely backlogged.
Because our judges are culled from the best of the crowdsourcing world, they’re experts at the kinds of tasks we send, and can often provide higher quality at a faster rate than what even in-house judges provide. For example, they’ll often use the forums and chatrooms to collaborate amongst themselves to give us the best judgments, and they’re already familiar with the Firefox and Chrome scripts that help them be the most efficient at their work.

All the benefits described above are especially valuable in this real-time search annotation case:

Having highly trusted workers means we don’t need to wait for multiple annotations on a single search query to confirm validity, so we can send responses to our backend as soon as a single judge responds. This entire pipeline is design for real-time, after all, so the lower the latency on the human evaluation part, the better.
The static nature of our custom pool means that the judges are already familiar with our questions, and don’t need to be trained again.
Because our workers aren’t limited to a fixed schedule or location, they can work anywhere, anytime – which is a requirement for this system, since global event spikes on Twitter are not beholden to a 9-to-5.
And with the multiple easy avenues of communication we have set up, it’s easy for us to answer questions that might arise when we add new questions or modify existing ones.

Thanks

Thanks to everyone on the Revenue and Storm teams, as well as our Turkers, for helping us launch this project.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Improving Twitter Search with Real-Time Human Computation

Overview

Monitoring for popular queries

Human evaluation of popular search queries

Turkers in the machine

Thanks

Related

Overview

Monitoring for popular queries

Human evaluation of popular search queries

Turkers in the machine

Thanks

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)