Moving Beyond CTR: Better Recommendations Through Human Evaluation

[This article was first published on Edwin Chen's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Imagine you’re building a recommendation algorithm for your new online site. How do you measure its quality, to make sure that it’s sending users relevant and personalized content? Click-through rate may be your initial hope…but after a bit of thought, it’s not clear that it’s the best metric after all.

Take Google’s search engine. In many cases, improving the quality of search results will decrease CTR! For example, the ideal scenario for queries like When was Barack Obama born? is that users never have to click, since the question should be answered on the page itself.

Or take Twitter, who one day might want to recommend you interesting tweets. Metrics like CTR, or even number of favorites and retweets, will probably optimize for showing quick one-liners and pictures of funny cats. But is a Reddit-like site what Twitter really wants to be? Twitter, for many people, started out as a news site, so users may prefer seeing links to deeper and more interesting content, even if they’re less likely to click on each individual suggestion overall.

Or take eBay, who wants to help you find the products you want to buy. Is CTR a good measure? Perhaps not: more clicks may be an indication that you’re having trouble finding what you’re looking for. What about revenue? That might not be ideal either: from the user perspective, you want to make your purchases at the lowest possible price, and by optimizing for revenue, eBay may be turning you towards more expensive products that make you a less frequent customer in the long run.

And so on.

So on many online sites, it’s unclear how to measure the quality of personalization and recommendations using metrics like CTR, or revenue, or dwell time, or whatever. What’s an engineer to do?

Well, consider the fact that many of these are relevance algorithms. Google wants to show you relevant search results. Twitter wants to show you relevant tweets and ads. Netflix wants to recommend you relevant movies. LinkedIn wants to find you relevant people to follow. So why, so often, do we never try to measure the relevance of our models?

I’m a big fan of man-in-the-machine techniques, so to get around this problem, I’m going to talk about a human evaluation approach to measuring the performance of personalization and discovery products. In particular, I’ll use the example of related book suggestions on Amazon as I walk through the rest of this post.

Amazon, and Moving Beyond Log-Based Metrics

(Let’s continue motivating a bit why log-based metrics are often imperfect measures of relevance and quality, as this is an important but difficult-to-understand point.)

So take Amazon’s Customers Who Bought This Item Also Bought feature, which tries to show you related books.

Amazon Related Books

To measure its effectiveness, the standard approach is to run a live experiment and measure the change in metrics like revenue or CTR.

But imagine we suddenly replace all of Amazon’s book suggestions with pornography. What’s going to happen? CTR’s likely to shoot through the roof!

Or suppose that we replace all of Amazon’s related books with shinier, more expensive items. Again, CTR and revenue are likely to increase, as the flashier content draws eyeballs. But is this anything but a short-term boost? Perhaps the change decreases total sales in the long run, as customers start to find Amazon too expensive for their tastes and move to other marketplaces.

Scenarios like these are the machine learning analogue of turning ads into blinking marquees. While they might increase clicks and views initially, they’re probably not optimizing user happiness or site quality for the future. So how can we avoid them, and ensure that the quality of our suggestions remains consistently high? This is a related books algorithm, after all – so why, by sticking to a live experiment and metrics like CTR, are we nowhere inspecting the relatedness of our recommendations?

Human Evaluation

Solution: let’s inject humans into the process. Computers can’t measure relatedness (if they could, we’d be done), but people of course can.

For example, in the screenshot below, I asked a worker (on a crowdsourcing platform I’ve built on my own) to rate the first three Customers Who Bought This Also Bought suggestions for a book on

Fool's Assassin

(Copying the text from the screenshot:

  • Fool’s Assassin, by Robin Hobb. (original book) “I only just read this recently, but it’s one of my favorites of this year’s releases. This book was so beautifully written. I’ve always been a fan of the Fantasy genre, but often times it’s all stylistically similar. Hobb has a wonderful way with characters and narration. The story was absolutely heartbreaking and lead me wanting another book.”
  • The Broken Eye (Lightbringer Series #3), by Brent Weeks. (first related book) “It’s a good, but not great, recommendation. This book sounds interesting enough. It’s third in a series though, so I’d have to see how I like the first couple of books first.”
  • Dust and Light: A Sanctuary Novel, by Carol Berg. (second related book) “It’s an okay recommendation. I’m not so familiar with this author, but I like some of the premise of the book. I know the saying ‘Don’t judge a book by its cover…’ but the cover art doesn’t really appeal to me and kind of turns me off of the book.”
  • Tower Lord, by Anthony Ryan. (third related book) “It’s a good, but not great, recommendation. Another completely unfamiliar author to me (kind of like this though, it shows me new places to look for books!) This is also a sequel, though, so I’d have to see how I liked the first book before purchasing this one.”)

The recommendations are decent, but already we see a couple ways to improve them:

  • First, two of the recommendations (The Broken Eye and Tower Lord) are each part of a series, but not Book #1. So one improvement would be to ensure that only series introductions get displayed unless they’re followups to the main book.
  • Book covers matter! Indeed, the second suggestion looks more like a fantasy romance novel than the kind of fantasy that Robin Hobb tends to write. (So perhaps B&N should invest in some deep learning…)

CTR and revenue certainly wouldn’t give us this level of information, and it’s not clear that they could even tell us our algorithms are producing irrelevant suggestions in the first place. Nowhere does the related scroll panel make it clear that two of the books are part of a series, so the CTR on those two books would be just as high as if they were indeed the series introductions. And if revenue is low, it’s not clear whether it’s because the suggestions are bad or because, separately, our pricing algorithms need improvement.

So in general, here’s one way to understand the quality of a Do This Next algorithm:

  1. Take a bunch of items (e.g., books if we’re Amazon or Barnes & Noble), and generate their related brethren.
  2. Send these pairs off to a bunch of judges (say, by using a crowdsourcing platform like Mechanical Turk), and ask them to rate their relevance.
  3. Analyze the data that comes back.

Algorithmic Relevance on Amazon, Barnes & Noble, and Google

Let’s make this process concrete. Pretend I’m a newly minted VP of What Customers Also Buy at Amazon, and I want to understand the product’s flaws and stars.

I started by going to Mechanical Turk and asking a couple hundred workers to take a book they enjoyed in the past year, and to find it on Amazon. They’d then take the first three related book suggestions from a different author, rate them on the following scale, and explain their ratings.

  • Great suggestion. I’d definitely buy it. (Very positive)
  • Decent suggestion. I might buy it. (Mildly positive)
  • Not a great suggestion. I probably wouldn’t buy it. (Mildly negative)
  • Terrible suggestion. I definitely wouldn’t buy it. (Very negative)

(Note: I usually prefer a three-point or five-point Likert scale with a “neutral” option, but I was feeling a little wild.)

For example, here’s how a rater reviewed the related books for Anne Rice’s The Wolves of Midwinter.

Anne Rice

So how good are Amazon’s recommendations? Quite good, in fact: 47% of raters said they’d definitely buy the first related book, another 29% said it was good enough that they might buy it, and only 24% of raters disliked the suggestion.

A1 Ratings

The second and third book suggestions, while a bit worse, seem to perform pretty well too: around 65% of raters rated them positive.

What can we learn from the bad ratings? I ran a follow-up task that asked workers to categorize the bad related books, and here’s the breakdown.

Bad Suggestions Breakdown

  • Related but different-subtopic. These were book suggestions that were generally related to the original book, but that were in a different sub-topic that the rater wasn’t interested in. For example, the first related book for Sun Tzu’s The Art of War (a book nominally about war, but which nowadays has become more of a life hack book) was On War (a war treatise), but the rater wasn’t actually interested in the military aspects: “I would not buy this book, because it only focuses on military war. I am not interested in that. I am interested in mental tactics that will help me prosper in life.”
  • Completely unrelated. These were book suggestions that were completely unrelated to the original book. For example, a Scrabble dictionary appearing on The Art of War‘s page.
  • Uninteresting. These were suggestions that were related, but whose storylines didn’t appeal to the rater. “The storyline doesn’t seem that exciting. I am not a dog fan and it’s about a dog.”
  • Wrong audience. These were book suggestions whose target audiences were quite different from the original book’s audiences. In many cases, for example, a related book suggestion would be a children’s book, but the original book would be geared towards adults. “This seems to be a children’s book. If I had a child I would definitely buy this; alas I do not, so I have no need for it.”
  • Wrong book type. Suggestions in this category were items like textbooks or appearing alongside novels.
  • Disliked author. These recommendations were by similar authors, but one that the rater disliked. “I do not like Amber Portwood. I would definitely not want to read a book by and about her.”
  • Not first in series. Some book recommendations would be for an interesting series the rater wasn’t familiar with, but they wouldn’t be the first book in the series.
  • Bad rating. These were book suggestions that had a particularly low Amazon rating.

So to improve their recommendations, Amazon could try improving its topic models, add age-based features to its books, distinguish between textbooks and novels, and invest in series detectors. (Of course, for all I know, they do all this already.)

Competitive Analysis

We now have a general grasp of Amazon’s related book suggestions and how they could be improved, and just like we could quote a metric like a CTR of 6.2% or whatnot, we can also now quote a relevance score of 0.62 (or whatever). So let’s turn to the question of how Amazon compares to other online booksellers like Barnes & Noble and Google Play.

I took the same task I used above, but this time asked raters to review the related suggestions on those two sites as well.

Amazon vs. B&N vs. Google

In short,

  • Barnes & Nobles’s algorithm are almost as good as Amazon’s: the first three suggestions were rated positive 58% of the time, compared to 68% on Amazon.
  • But Play Store recommendations are atrocious : a whopping 51% of Google’s related book recommendations were marked terrible.

Why are the Play Store’s suggestions so bad? Let’s look at a couple examples.

Here’s the Play Store page for John Green’s The Fault in Our Stars, a critics-loved-it book about cancer and romance (and now also a movie).

Fault in Our Stars

Two of the suggestions are completely random: a poorly-rated Excel manual and a poorly-reviewed textbook on sexual health. The others are completely unrelated cowboy books, by a different John Green.

Here’s the page for The Strain. In this case, all the suggestions are in a different language! And there are only four of them.

The Strain

Once again asking raters to categorize all of the Play Store’s bad recommendations…

Google Atrocity

  • 45% of the time, the related book suggestions were completely unrelated to the original book in question. For example: displaying a physics textbook on the page for a romance novel.
  • 32% of the time, there simply wasn’t a book suggestion at all. (I’m guessing the Play Bookstore’s catalog is pretty limited.)
  • 14% of the time, the related books were in a different language.

So despite Google’s state-of-the-art machine learning elsewhere, its Play Store suggestions couldn’t really get much worse.


Let’s step back a bit. So far I’ve been focusing on an absolute judgments paradigm, in which judges rate how relevant a book is to the original on an absolute scale. This model is great for understanding the overall quality of Amazon’s related book algorithm.

In many cases, though, we want to use human evaluation to compare experiments. For example, it’s common at many companies to:

  1. Launch a “human evaluation A/B test” before a live experiment, both to avoid accidentally sending out incredibly buggy experiments to users, as well as to avoid the long wait required in live tests.
  2. Use a human-generated relevance score as a supplement to live experiment metrics when making launch decisions.

For these kinds of tasks, what’s preferable is often a side-by-side model, wherein judges are given two items and asked which one is better. After all, comparative judgments are often much easier to make than absolute ones, and we might want to detect differences at a finer level than what’s available on an absolute scale.

The idea is that we can assign a score to each rating (negative, say, if the rater prefers the control item; positive if the rater prefers the experiment), and we aggregate these to form an overall score for the side-by-side. Then in much the same way that drops in CTR may block an experiment launch, a negative human evaluation score should also give much cause for concern.

Unfortunately, I don’t have an easy way to generate data for a side-by-side (though I could perform a side-by-side on Amazon vs. Barnes & Noble), so I’ll omit an example, but the idea should be pretty clear.


Here’s another subtlety. In my examples above, I asked raters to pick a starting book themselves (one that they read and loved in the past year), and then rate whether they personally would want to read the related suggestions.

Another approach is to pick the starting books for the raters, and then have the rate the related suggestions more objectively, by trying to put themselves in the shoes of someone who’d be interested in the starting item.

Which approach is better? As you can probably guess, there’s no clear answer – it depends on the task and goals at hand.

Pros of the first approach:

  • It’s much more nuanced. It can often be hard to put yourself in someone else’s shoes: would someone reading Harry Potter be interested in Twilight? On the one hand, they’re both fantasy books; on the other hand, Twilight seems a bit more geared towards female audiences.

Pros of the second approach:

  • Sometimes, objectivity is a good thing. (Should you really care if someone dislikes Twilight simply because Edward reminds her of her boyfriend?)
  • Allowing people to choose their starting items may bias certain metrics. For instance, people are much more likely to choose popular books to rate, whereas we might want to measure the quality of Amazon’s suggestions across a broader, more representative slice of its catalog.


Let’s review what I’ve discussed so far.

  1. Online, log-based metrics like CTR and revenue aren’t necessarily the best measures of a discovery algorithm’s quality. Items with high CTR, for example, may simply be racy and flashy, not the most relevant.
  2. So instead of relying on these proxies, let’s directly measure the relevance of our recommendations by asking a pool of human raters.
  3. There are a couple different approaches for sending which items to be judged. We can let the raters choose the items (an approach which is often necessary for personalized algorithms), or we can generate the items ourselves (often useful for more objective tasks like search; this also has the benefit of making it easier to derive popularity-weighted or uniformly-weighted metrics of relevance). We can also take an absolute judgment approach, or use side-by-sides.
  4. By analyzing the data from our evaluations, we can make better launch decisions, discover examples where our algorithms do very well or very poorly, and find patterns for improvement.

What are some of the benefits and applications?

  1. As mentioned, log-based metrics like CTR and revenue don’t always capture the signals we want, so human-generated scores of relevance (or other dimensions) are useful complements.
  2. Human evaluations can make iteration quicker and easier. A algorithm change might normally require weeks of live testing before we gather enough data to know how it affects users, but we can easily have a task judged by humans in a couple hours.
  3. Imagine we’re an advertising company, and we choose which ads to show based on a combination of CTR and revenue. Once we gather hundreds of thousands of relevance judgments from our evaluations, we can build a relevance-based machine learning model to add to the mix, thereby injecting a more direct measure of quality into the system.
  4. How can we decide what improvements need to be made to our models? Evaluations give us very concrete feedback and specific examples about what’s working and what’s wrong.

In the spirit of item-item similarities, here are some other posts readers of this post might also want to read.

And finally, I’ll end with a call for information. Do you run any evaluations, or use crowdsourcing platforms like Mechanical Turk or Crowdflower (whether for evaluation purposes or not)? Or do you want to? I’d love to talk to you to learn more about what you’re doing, so please feel free to send me an email and hit me up!

To leave a comment for the author, please follow the link and comment on their blog: Edwin Chen's Blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)