Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering

[This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


I will use {ordinal} clm() (and other cool R packages such as {text2vec}
as well) here to develop a hybrid content-based, collaborative
filtering, and (obivously) model-based approach to solve the
recommendation problem on the MovieLens 100K dataset in R. All R code
used in this project can be obtained from the respective GitHub repository; the chunks of code present in the body of the post illustrate the essential steps only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. The first part of the study introduces the new approach
and refers to the feature engineering steps that are performed by the OrdinalRecommenders_1.R script (found on GitHub). The second part, to be published soon, relies on the R code in  OrdinalRecommenders_3.R and presents the model training, cross-validation, and analyses steps. The  OrdinalRecommenders_2.R
script encompasses some tireless for-looping in R (a bad habbit indeed)
across the dataset only in order to place the information from the
dataset in the format needed for the modeling phase. The study aims at
(a) the demonstration of the improvement in predicted ratings for
recommending on a well-known dataset, and (b) attempts to shedd light on
the importance of various types of information in the work of
recommendation engines. Consequently, the code is not suited for use in
production; additional optimizations are straightforward, simple, and
necessary as well.


Avoiding the formal exposition of the problem
altogether, the recommendation problem in the context of contemporary
online markets can expressed in the following way: given (A) the
information on the user’s past ratings of various items (i.e. the  user-item ratings;
for example, the ratings on a 5-point Likert type scale of the movies
or TV shows that she or he has watched in the past, the ratings on
similar scales of the items purchased, and similar), (B) the information
on the user-item ratings of other users, © the attributes of users
and/or items (c.f. the user’s gender, geographical location, age,
occupation, and similar; the genre of the movie, product type, and
similar for items), provide a prediction of the user’s rating of a novel, previously unassessed item.
The predictions are then used to select the novel items (or lists of
items) that a particular user would – by assumption – enjoy to assess,
and eventualy purchase, in the future. The cognitive systems used for
making such predictions are known as recommendation engines, or recommender systems, and are widely used nowadays across the Internet business.

It is generally recognized that recommendation engines can be grouped in two broad categories: (1) content-based systems,  or (2) collaborative filtering systems, with the later additionally described as (1.1) neighborhood or (1.2) model-based methods [1]. Content based
systems (CF) rely on a typical description of items over feature
vectors, and then recommend novel items to users by computing some
similarity metric between them and the items that the user has already
rated. Collaborative filtering systems, on the
other hand, rely on the assumption that the covariations between the
ratings (a) from different users over the same set of items, or (b) to
different items from the same set of users, implicitly carry the
information on item and user attributes. The information about the true
user and item attributes are thus latent in CF. In the neighborhood (or memory) variant of CF, these approaches use the user-item ratings directly and proceed by aggregating the ratings of similar users (user-based CF) and/or similar items (item-based CF)
in order to provide an estimate of the user’s rating of a novel item,
typically weighting the known ratings of similar users/items by the
respective proximity/similarity measures. In the model-based
variant of CF, the systems tries to discover the structure of the
latent factors from the observed behavioral measures (ie. known
user-item ratings), and then a myriad of machine learning algortihms and
statistical approaches can be used to train a predictive model for the
novel user-item ratings of novel items.

I will here introduce a new hybrid approach to
recommender systems with the following characteristics. First, the
system is both content-based and CF. The content-based component of the
system encompasses two matrices: the user-user and the item-item
proximity matrices, both obtained from applying the relevant distance
metric over a set of features that characterize users and items,
respectively. The CF component of the system, relies on the typical
user-user and item-item similarity matrices computed from the known,
past user-item ratings, providing for a memory component of the
recommender. Second, the system is model-based in the sense that it
combines the information from these four matrices in an ordinal logistic
regression model to predict the ratings of novel items on a 5-point
Likert scale. The underlying logic of this approach is to try to express
the recommendation problem as a discrete choice problem, where the
alternatives are provided by ordinal information from a 5-point Likert
scale, while the decision makers and the choice context are described by
a set of content-based and memory-based features, all referring to some
degree of closeness between the unknown user/item combination with the
known user/item in two spaces: the attribute (feature) space and the
neighborhood (memory) space. Because the CF approaches rely on
users-user (or item-item) proximity, the question of how many users (or
items) that are similar to the context of the present, novel user-item
rating should be used in prediction arises naturally. By expressing the
recommender problem as a regression problem essentially, we become able
to judge the significance of additional user-user and/or item-item
similarity information from the change in regression coefficients
obtained by progressively adding more and more information to nested
ordinal logistic models and training them. Thus the new approach also
provides and excellent opportunity for a comparison between
content-based information and memory-based information (the later as
being typically used in CF): which one is more important? Given the
assumption that the former is implicitly present in the later, i.e. that
it presents latent information in the CF approach, does the usage of
manifest memory-based information on user-item ratings completelly
eliminates the need for a content-based description? If not, how the two
should be combined? I will demonstrate how the present approach
resolves this dilemma while resulting in highly improved recommendations
on a well-known and widely used MovieLens dataset, at the same time
reducing drastically the number of features that are needed to build the
relevant prediction model.


Consider the following user-item ratings matrix, encompassing 10 items (columns) and three users (rows):


In this rating matrix, the ratings of user1 and user2
have a maximal Pearson correlation of 1; however, the ratings of user2
and user3 have a linear correlation of .085, sharing around .007 % of
variance only. Upon a closer look, user1 and user2 have only three
common ratings (for items i1, i4, and i5), while user2 and user3 have
rated exactly the same items (i1, i3, i4, i5, i8, i9, and i10). Imagine
that the items rated by user2 and user 3 are all Sci-Fi movies; we can
imagine these two users chating and challenging each other’s views on
the contemporary Sci-Fi production on online forums or fan conferences,
right? They do not have to have perfectly similar tastes, right, but in
fact  they are similar: simply because they both enjoy Sci-Fi. The Jaccard distance
measures the proximity of two vectors A and B, both given over binary
features (e.g. present vs. not present), by computing the ratio of (a)
the difference between the union of A and B and the intersection of A
and B to the (b) union of A and B; it ranges from zero to one and can be
viewed as a proportion of common features that the two vectors share in
the total number of features present in both. The Jaccard similarity
coefficient is given by one minus the Jaccard distance; we will use the
{proxy} dist() function to compute these from our ratings matrix:

### --- Illustration
user1 <- c(5, NA, NA, 4, 3, NA, 1, NA, NA, NA)
user2 <- c(4, NA, 3, 3, 2, NA, NA, 1, 2, 4)
# -- User1 and User2 have a high Pearson Correlation Coefficient:
cor(user1, user2, use = "pairwise.complete.obs")
user3 <- c(3, NA, 1, 5, 4, NA, NA, 2, 5, 4)
# -- User2 and User3 have a low Pearson Correlation Coefficient:
cor(user2, user3, use = "pairwise.complete.obs")
# -- User2 and User3 have a high Jaccard similarity
# -- i.e. 1 minus the Jaccard distance between them:
as.numeric(1 - dist(t(user2), t(user3), method = "jaccard"))
as.numeric(1 - dist(t(user1), t(user2), method = "jaccard"))
as.numeric(1 - dist(t(user1), t(user3), method = "jaccard"))

User2 and User3 who have provided the ratings for the
same items exactly have a Jaccard similarity index of 1; they both have
the Jaccard similarity index of .375 with User1. Then who is the user
similar to User2 - the one whose ratings could be used to approximate
the experiences of User2 with previously unseen items: (a) User1, with
whom he has a memory-based Pearson correlation of 1 on three shared
items, or (b) User3, with whom he has a Pearson correlation of only
.085, but with whom he had shared the same experiences and thus achieved
a maximum level of Jaccard similarity?

The value of the linear correlation coefficient (or
cosine similarity, or any other similar metric) will not preserve the
information on the number of shared items; that information is imlicitly
present in the Type I Error test (for example), not in the extent of
covariance itself. The Jaccard distance, on the other hand, neglects the
covariances from the rating scales altogether, preserving only the
information about the extent of the shared ratings. If we think about
the items rated by a particular user, neglecting the particular rating
values, we understand that we can describe some aspects of the user’s
taste by binarizing his or her ratings; we can know, for example, the
frequency distribution of the item types that he or she has decided to
assess at all. If we additionally compute a distance or similarity
matrix over that information for all users, we express the content-based
approach in terms of the attributes that describe a particular use by
its distance or similarity from other users in terms of the very same
attributes - in a content-based description space. We can add the
avaialable demographic information to the binary vectors before
computing the similarities too and thus enrich the representation. And
we can do the same for items, of course. As we already have the
similarity matrices from linear correlations between the paired users’
ratings at our disposal, we can combine the information from these two
approaches - the content-based and the memory, or neighborhood-based -
and try to predict the unobserved user-item ratings from a unified

The details of feature engineering along these lines in R are provided in the respective section below.


The MovieLens 100K dataset [2] comprises 100,000
ratings (Likert type 1 to 5 scale) from 943 users on 1682 movies;
additionally, there are at least 20 movies rated per user. Some
demographic information for the users is present - age, gender,
occupation, zip - as well as the genre and the release dates for movies.
The dataset can be downloaded from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. Historically, it is among the most popular datasets used
to test the new approaches to the development of recommendation engines.
MovieLens 100K can be also obtained from Kaggle and Datahub. Our  OrdinalRecommenders_1.R
script reads in the components of this dataset and computes the desired
user-user and item-item distance and similarity matrices following some
data reshaping.

Feature engineering

Part 1.A of the  OrdinalRecommenders_1.R
script is used merely to read the original MovieLens 100K dataset
files, rename the columns present there, and save as .csv files; you
will need to run it too in order to use the code from part 1.B and the
remaining scripts. In Part 1.B we begin to operate over the following
three data.frames: (1) the ratingsData, which comprises the ratings on
5-point scales for all user-item pairs from the dataset, (2) usersData,
which comprises demographic information on users (Gender, Age,
Occupation, Zip-code), and (3) moviesData, which comprises all item
information (Title, Release Date, and Genre). Part 1.B of this R script
performs the essentialy feature engineering operations in order to
produce the following four matrices:

Matrix P1 - the Item-Item Proximity Matrix:
the Jaccard distances between the items, computed from the binarized
features encompassing relase decade (derived from the original release
year information, provided in the movie Title column) and movie genre.
This matrix plays the role of the content-based representation of items. The dist2() function from {text2vec} is used on a sparse Matrix class to compute the Jaccard distances here:

### --- Compute moviesDistance w. Jaccard {text2vec} from movie genres
moviesData <- moviesData %>% separate(col = Date, 
                                      into = c('Day', 'Month','Year'), 
                                      sep = "-")
moviesData$Day <- NULL
moviesData$Month <- NULL
moviesData$Year <- as.numeric(moviesData$Year)
# - that would be: [1] 1922 1998
# - Introduce Movie Decade in place of Year:
decadeBoundary <- seq(1920, 2000, by = 10)
moviesData$Year <- sapply(moviesData$Year, function(x) {
  wL <- x < decadeBoundary
  wU <- x >= decadeBoundary
  if (sum(wL) == length(decadeBoundary))  {
  } else if (sum(wU) == length(decadeBoundary)) {
  } else {
    decadeBoundary[max(which(wL-wU == -1))]
# - Match moviesData$Year with ratingsData:
mD <- moviesData %>% 
  select(MovieID, Year)
ratingsData <- merge(ratingsData, mD,
                     by = 'MovieID')
# - Movie Year (now Decade) as binary: 
moviesData <- moviesData %>%
  spread(key = Year,
         value = Year,
         fill = 0,
         sep = "_")
# - compute moviesDistance:
moviesDistance <- moviesData[, 3:ncol(moviesData)]
w <- which(moviesDistance > 0, arr.ind = T)
moviesDistance[w] <- 1
moviesDistance <- dist2(Matrix(as.matrix(moviesData[, 4:ncol(moviesData)])), 
                        method = "jaccard")
moviesDistance <- as.matrix(moviesDistance)
rm(moviesData); gc()
# - save objects and clear:
numMovies <- length(unique(ratingsData$MovieID))
          path = paste0(getwd(),'/moviesDistance.csv'), 
          append = F, col_names = T)
rm(moviesDistance); gc()

Matrix P2 - the User-User Proximity Matrix:
the Jaccard distances between the users, computed from the binarized
features representing whether a particular user has or has not rated a
particular item only. This matrix plays the role of the content-based representation of users:

### --- produce binary User-Item Matrix (who rated what only):
userItemMat <- matrix(rep(0, dim(usersData)[1]*numMovies),
                      nrow = dim(usersData)[1],
                      ncol = numMovies)
userItemMat[as.matrix(ratingsData[c('UserID', 'MovieID')])] <- 1
rm('w', 'ratingsData', 'usersData'); gc()
### --- Compute userDistance w. Jaccard {text2vec}
userItemMat <- Matrix(userItemMat)
usersDistance <- dist2(userItemMat, 
                       method = "jaccard")
rm(userItemMat); gc()
usersDistance <- as.matrix(usersDistance)
          path = paste0(getwd(),'/usersDistance.csv'), 
          append = F, col_names = T)
rm(usersDistance); gc()

Note that I’ve missed to use the user demographics
available in MovieLens 100K. In this project, I will not use them at
all, but the general ratio is: because I am going after an ordinal
logistic regression model, I can use these information as regressors
there. Also, as I have computed the user-user distance by looking at
what movies ratings they have in common, presumeably capturing some
information on shared tastes, I could have done the some for the items,
by inspecting what movies have been rated by the same users. However, a
hunch told me that I might be end up using redundand feature
representations in the same model; I’ve decided to describe the movie
similarities from content-based features directly (genre and the decade
of production), and user similarities from content-based features that
encompass the similarity in tastes (the common movie ratings among

The following two similarity matrices are those
typically used in user and item-based CF; they comprise Pearson
correlation coefficients computed directly from the ratings matrix for
1681 items (note: one movie was ‘unknown’, so I have removed it from the
dataset and cleaned up from the ratings matrix accordingly) and 943
users. Given the moderate size of the desired correlation matrices, I
have used the {base} cor() function on pairwise complete observations to
compute them. For digging out correlation or cosine matrices from
problems larger than the MovieLens 100K, you might want to take a look
at the functions provided by my colleague Stefan Nikolić in his “Recommender Systems: Matrix operations for fast calculation of similarities” (supporting the computations for CF in an order of magnitude lesser time than the popular {recommenderlab} R package). The code produces the remaining two matrices:

Matrix S1 - the User-User Similarity Matrix:
the pairwise complete Pearson correlation coefficients between the
users’ ratings from the ratings matrix directly. This matrix plays the
role of the memory-based representation of users:

### --- Compute User-User and Item-Item Ratings Similarity Matrices
ratingsData <- read_csv("ratingsData_Model.csv",
                        col_names = T)
ratingsData$X1 <- NULL
# - User-Item Ratings Matrix
ratingsData$Timestamp <- NULL
ratingsData <- ratingsData %>% 
  spread(key = MovieID,
         value = Rating, 
         sep = "_") %>% 
# - Pearson Correlations: User-User Sim Matrix
UserUserSim <- ratingsData %>% 
UserUserSim <- t(UserUserSim)
UserUserSim <- cor(UserUserSim, 
                   use = 'pairwise.complete.obs')
UserUserSim <- as.data.frame(UserUserSim)
          path = paste0(getwd(),'/UserUserSim.csv'), 
          append = F, col_names = T)
rm(UserUserSim); gc()

Matrix S2 - the Item-Item Similarity Matrix:
the pairwise complete Pearson correlation coefficients between the
item’ ratings from the ratings matrix directly. This matrix plays the
role of the memory-based representation of items:

# - Pearson Correlations: Item-Item Sim Matrix
ItemItemSim <- ratingsData %>% 
rm(ratingsData); gc()
ItemItemSim <- cor(ItemItemSim,
                   use = 'pairwise.complete.obs')
ItemItemSim <- as.data.frame(as.matrix(ItemItemSim))
          path = paste0(getwd(),'/ItemItemSim.csv'), 
          append = F, col_names = T)
rm(ItemItemSim); gc()


The idea is to combine the content-based with
memory-based information in a unified model to solve the recommendation
problem for online recommender systems. I have used the Jaccard
distances to describe the content-based proximity between users, based
on what items they have rated or not; on the other hand, the Jaccard
distances were also used to described the proximity of items in a
content-based space derived from the “essentialistic” item features like
movie genres and the decade of their production. Additionally, the
typical Pearson correlation similarity matrices - computed directly from
the ratings matrix - were produced for both users and items. The two
user-user and item-item matrices share the same respective
dimensionality. In the next step, the idea is to use the neighborhoods
of different size from these matrices (say, take p similar users from the content-based proximity and the memory-based user-user similarity matrices, and q
similar items from the content-based proximity and the memory-based
item-item matrices) as regressors in an ordinal logistic model to
predict the ordinal 5-point ratings of novel user-item combinations. By
inspecting the regression coefficients from these predictive models
while varying the size of the neighborhood sets of users and items whose
ratings will be used as regressors, I hope to be able to provide some
conclusion on the relative importance of content-based and memory-based
information for building efficient recommendation engines. We’ll see.
But before skipping to  OrdinalRecommenders_3.R to play with the beautiful {ordinal} models, you might wish to run  OrdinalRecommenders_2.R
that will prepare the first 100 neighbours from the proximity and
similarity matrices for you. As I told you, there’s a for loop there -
and a bad one.


[1] Desrosiers, C. & Karypis, G. (2010). A comprehensive survey of neighborhood-based recommendation methods. In Ricci, F., Rokach, L., Shapira, B. & Kantor, P. B. (Eds), Recommender Systems Handbook. Springer: US. DOI:10.1007/978-0-387-85820-3.

[2] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872.

Goran S. Milovanović, PhD
Data Science Consultant, SmartCat

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)