Why build a recommender system?
The most wonderful and most frustrating characteristic of the Internet is its excessive supply of content. As a result, many of today’s commercial giants are not content providers, but content distributors. The success of companies such as Amazon, Netflix, YouTube and Spotify relies on their ability to effectively deliver relevant and novel content to users. However, with such a vast array of content at their fingertips, the search space becomes near impossible to navigate with traditional search methods. It is therefore essential for businesses to exploit the data at their disposal to find similarities between products and user behaviours, in order to make relevant recommendations to users.
The importance of this is further emphasised by phenomena such as the Long Tail, a term popularised by Chris Anderson’s iconic 2004 blog post. This refers to the fact that a large percentage of online distributors’ revenue comes from the sale of less popular items, for which they are able to find a market thanks to their recommendation engines. “If the Amazon statistics are any guide, the market for books that are not even sold in the average bookstore is larger than the market for those that are”.
Another interesting example is Spotify, a company which invests heavily in recommendation, since one of their selling points is their ability to build perfectly curated playlists for individual users. A lesser-known ulterior motive of Spotify’s recommendations is their need to reduce their licensing costs, which are currently growing at a faster rate than their revenue. By recommending relevant songs by emerging artists, towards which Spotify can pay lower licensing fees, the company can reduce their average cost per listen. Similarly, any business with a large product range might find a recommendation engine useful for identifying which products to push to certain customers.
Figure 1 – Anatomy of the Long Tail (https://wired.com/2004/10/tail/)
Now, I hear you cry, “what considerations does one have to make when building a recommender system?” Well, I’m glad you asked!
What metric are we optimising for?
First and foremost, we are trying to solve a business need. Just because an algorithm perfectly predicts a user’s movie rating, this might not necessarily translate to a higher level business metric. What are we optimising for? User retention? An increase in sales? How many items have the average user bookmarked or purchased? How many recommended items users have clicked on? These goals will vary for different business contexts. Even with a well-defined business objective, these are only observations we can make after a model has been trained and deployed, and many successive iterations of A/B testing will be required to establish the usefulness of a model.
Different user profiles
To add further complexity, it is possible that some users respond better to one type of model over another. Then the question arises: do we use varying algorithms for different user profiles, and how do we identify those profiles? This is where a weighted hybrid recommender system might come in. A more adventurous user might prefer more exploratory recommendations, whereas a conservative user may only respond to recommendations which closely relate to their browsing history. How do we balance customer satisfaction with the need to push new content on them? A content distributor may be satisfied with suiting the needs of their customer, whereas a content provider may also want to increase the sale of their less popular items, alongside increasing their customer retention.
Operational costs & algorithm selection
Furthermore, we need to determine whether the operational costs of developing and maintaining an advanced recommender system are worth the potentially marginal improvements in content suggestions. Aside from the cost of hiring researchers and engineers, there can also be large costs associated with training an advanced recommendation engine on the cloud, such as that of Amazon or Spotify. As the size of the user base and item database increase, so will the operational costs. An algorithm which has to compare an item to the whole user database to perform a recommendation (such as memory-based collaborative filtering) is not as scalable as one which uses item properties and metadata to identify similar items (such as content-based recommendations). However, it might also be that a more complex algorithm (such as matrix factorization) – popularised by the Netflix Prize would be able to extract better features from the data, with the caveat of requiring much more time and compute power to train.
This highlights the importance of clearly defining business goals and evaluation metrics before launching into a venture such as this, since A/B testing might reveal that an expensive recommender system which offers marginally better recommendations might not have any better impact on the bottom line than a simple one.
Data Availability / Sparsity
Figure 2 – Data Sparsity (https://ebaytech.berlin/deep-learning-for-recommender-systems-48c786a20e1a)
The second major challenge we face is the sparsity of our dataset. The average user’s activity only provides a limited amount of data relating to their likes and dislikes. The biggest mistake we can make is to assume that a user who has not clicked or rated an item necessarily dislikes that item. The more likely explanation is that the user has not yet discovered it. As a result, missing values need to be ignored, rather than included as dislikes or 0 ratings. However, this results in a very sparse dataset in which users have only interacted with a fraction of the available items. This leads to a few issues – can we guarantee that we have a full picture of this user? How do we make predictions for a new user, for whom we have no data available? This is also known as the “cold start” problem. Potential solutions include recommending the most popular items (YouTube and Amazon home page), user-inductions which request information from a new user (such as Reddit or Quora) or extracting metadata from items to compare them, such as Spotify.
For this reason, implicit feedback is oftentimes preferred. This refers to the use of data such as number of clicks, shares and streaming time. The advantage of this over explicit feedback is that it allows businesses to collect more data on their users, who may otherwise be unwilling to give explicit ratings. It also removes any potential bias towards users who may be particularly expressive of their opinions but do not represent the majority.
However, implicit feedback brings its own set of problems. Whereas a 5-star rating has a predetermined scale, which allows us to adjust any bias towards users who are more critical or complementary than average, implicit feedback is more difficult to deal with. How do we determine the relative value between a click, a like or a sale? In addition, how do we deal with data from a user who may have listened to their favourite song 99 times but also has a special place in their heart for that song they only listen to once a month?
Some algorithms simply ignore values such as play count and transform them into binary 1s and 0s, whereas others use them as a confidence metric for how much a user likes an item. This may be part of the reason why YouTube and Netflix have switched to a like/dislike based system rather than 5-star ratings. Likes might have a better usage than 5-star ratings, and oftentimes confer the same amount of information to a recommender system as a 5-star rating.
In follow up posts, I will explore the different types of recommender systems, followed by an implementation of these using recent technologies such as PyTorch.