Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (2): Recommendation as discrete choice

April 14, 2017
By

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

image

In this continuation of “Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering” I will describe the application of the {ordinal}
clm() function to test a new, hybrid content-based, collaborative
filtering approach to recommender engines by fitting a class of ordinal
logistic (aka ordered logit) models to ratings
data from the MovieLens 100K dataset. All R code used in this project
can be obtained from the respective GitHub repository; the chunks of
code present in the body of the post illustrate the essential steps
only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. This second part of the study relies on the R code in OrdinalRecommenders_3.R
and presents the model training, cross-validation, and the analyses.
But before we proceed to approach the recommendation problem from a
viewpoint of discrete choice modeling, let me briefly remind you of the
results of the feature engineering phase and explain what happens in OrdinalRecommenders_2.R which prepares the data frames that are passed to clm().

Enter explanatory variables

In the feature engineering phase (OrdinalRecommender_1.R)
we have produced four – two user-user and two item-item – matrices. Of
these four matrices, two are proximity matrices with Jaccard distances
derived from binarized content-based features: for users, we have used
the binary-featured vectors that describe each user by what movies he or
she has or has not rated, while for items we have derived the proximity
matrix from the binarized information on movie genre and the decade of
the production. In addition, we have two Pearson correlation matrices
derived from the ratings matrix directly, one describing the user-user
similarity and the other the similarity between the items. The later two
matrices present the memory-based component of the model, while the
former carry content-based information. In OrdinalRecommenders_2.R, these four matrices are:

  • usersDistance – user-user Jaccard distances,
  • UserUserSim, – user-user Pearson correlations,
  • itemsDistance – item-item Jaccard distances, and
  • ItemItemSim – item-item Pearson correlations.

The OrdinalRecommenders_2.R essentially performs only the following operations over these matrices: for each user-item ratings present in ratingsData (the full ratings matrix),

  • from the user-user proximity/similarity matrix,
    select 100 most users who have rated the item under consideration and
    who are closest to the user under consideration;
  • place them in increasing order of distance/decreasing order of similarity from the current user;
  • copy their respective ratings of the item under consideration.

For the item-item matrices, the procedure is similar:

  • from the item-item proximity/similarity matrix,
    select 100 items that the user under consideration has rated and that
    are closest to the item under consideration;
  • place them in increasing order of distance/decreasing order of similarity from the current item;
  • copy the respective ratings of the item under consideration.

Note that this procedure of selecting the nearest
neighbours’ ratings produces missing data almost by necessity: a
particular user doesn’t have to have 100 ratings of the items similar to
the item under consideration, and for a particular user and item we
don’t necessarily find 100 similar users’ ratings of the same item. This
would reflect in a decrease of the sample size as one builds
progressively more and more feature-encompassing ordered logit models.
In order to avoid this problem, I have first reduced the sample size so
to keep only complete observations for the most feature-encompassing
model that I have planned to test: the one relying on 30 nearest
neighbors from all four matrices (e.g., encompassing 4 * 30 = 120
regressors, add four intercepts needed for the ordered logit over a
5-point rating scale); 87,237 observations from the dataset were kept.

For the sake of completeness, I have also copied the
respective distance and similarity measures from the matrices; while the
first 100 columns of the output matrices present ratings only, the
following 100 columns are the respective distance/similarity measures.
The later will not be used in the remainder of the study. The resulting
four matrices are:

  • proxUsersRatingsFrame – for each
    user-item combination in the data set, these are the ratings of the item
    under consideration from the 100 closest neighbours in the user-user
    content-based proximity matrix;
  • simUsersRatingsFrame, – for each
    user-item combination in the data set, these are the ratings of the item
    under consideration from the 100 closest neighbours in the user-user
    memory-based similarity matrix;
  • proxItemsRatingsFrame – for each
    user-item combination in the data set, these are the ratings from the
    user under consideration of the 100 closest neighbours in the item-item
    content-based proximity matrix; and
  • simItemsRatingsFrame – for
    each user-item combination in the data set, these are the ratings from
    the user under consideration of the 100 closest neighbours in the
    item-item memory-based similarity matrix.

Of course it can be done faster than for-looping over
100,000 ratings. Next, the strategy. The strategy here is to
progressively select a larger number of numSims – the number of nearest neighboours taken from proxUsersRatingsFrame, simUsersRatingsFrame, proxItemsRatingsFrame, and simItemsRatingsFrame – and pull them together with the ratings from the ratingsData. For example, if numSims == 5, we obtain a predictive model Rating ~ 4 matrices * 5 regressors which is 20 regressors in total, if numSims == 10, we obtain a predictive model Rating ~ 4 matrices * 10 regressors which is 40 regressors in total, etc.

Enter Ordered Logit Model

The idea to think of the recommendation problem as
discrete choice is certainly not new; look at the very introductory
lines of [1] where the Netflix competition problem was used to motivate
the introduction of ordered regressions in discrete choice. We assume
that each user has an unobservable, continuous utility function that
maps onto his or her item preferences, and of which we learn only by
observing the discrete information that is captured by his or her
ratings on a 5-point scale. We also assume that the mapping of the
underlying utility on the categories of the rating scale is controlled
by threshold parameters, of which we fix the first to have a value of
zero, and estimate the thresholds for 1→2, 2→3, 3→4, and 4→5 scale
adjustments; these play a role of category-specific intercepts, and are
usually considered as nuisance parameters in ordered logistic
regression. In terms of the unobserved user’s utility for items, these
threshold parameters “disect” the hypothesized continuous utility scale
into fragments that map onto the discrete rating responses (see Chapter
3, Chapter 4.8 in [1]). Additionally, we estimate a vector of regression
coefficients, β, one for each explanatory variable in the model: in our case, the respective ratings from the numSims
closest neighbours from the similarity and proximity matrices as
described in the previous section. In the ordered logit setting – and
with the useful convention of writing minus in front of the xiTβ term used in {ordinal} (see [2]) – we have:

image

and given that

image

the logit is cumulative; θj stands for the j-th response category-specific threshold, while i is the index of each observation in the dataset. Vectors x and β play the usual role. The interpretation of the regression coefficients in β becomes clear after realizing that the cumulative odds ratios take the form of:

image

and is independent of the response category j, so that for a unit change in x the cumulative odds ratio of the change in probability P(Yi ≥ j) increases by a multiplicative factor of exp(β).
Back to the recommendation problem: by progressively adding more
nearest neighbours’’ ratings from the user-user and item-item proximity
and similarity matrices, and tracking the behavior of regression
coefficients in predictions from the respective ordered logit models,
one can figure out the relative importance of the memory-based
information (from the similarity matrices) and the content-based
information (from proximity matrices).

I have varied the numSims (the number of nearest neighbours used) as seq(5, 30, by = 5)
to develop six ordinal models with 20, 40, 60, 80, 100, and 120
regressors respectively, and performed a 10-fold cross-validation over
87,237 user-item ratings from MovieLens 100K (OrdinalRecommenders_3.R):

 
### --- 10-fold cross-validation for each:
### --- numSims <- seq(5, 30, by = 5)
numSims <- seq(5, 30, by = 5)
meanRMSE <- numeric(length(numSims))
totalN <- dim(ratingsData)[1]
n <- numeric()
ct <- 0
## -- Prepare modelFrame:
# - select variables so to match the size needed for the most encompassing clm() model:
f1 <- select(proxUsersRatingsFrame,
            starts_with('proxUsersRatings_')[1:numSims[length(numSims)]])
f2 <- select(simUsersRatingsFrame,
            starts_with('simUsersRatings_')[1:numSims[length(numSims)]])
f3 <- select(proxItemsRatingsFrame,
            starts_with('proxItemsRatings_')[1:numSims[length(numSims)]])
f4 <- select(simItemsRatingsFrame,
            starts_with('simItemsRatings_')[1:numSims[length(numSims)]])
# - modelFrame:
mFrame <- cbind(f1, f2, f3, f4, ratingsData$Rating)
colnames(mFrame)[dim(mFrame)[2]] <- 'Rating'
# - Keep complete observations only;
# - to match the size needed for the most encompassing clm() model:
mFrame <- mFrame[complete.cases(mFrame), ]
# - store sample size:
n <- dim(mFrame)[1]
# - Rating as ordered factor for clm():
mFrame$Rating <- factor(mFrame$Rating, ordered = T)
# - clean up a bit:
rm('f1', 'f2', 'f3', 'f4'); gc()
## -- 10-fold cross-validation
set.seed(10071974)
# - folds:
foldSize <- round(length(mFrame$Rating)/10)
foldRem <- length(mFrame$Rating) - 10*foldSize
foldSizes <- rep(foldSize, 9)
foldSizes[10] <- foldSize + foldRem
foldInx <- numeric()
for (i in 1:length(foldSizes)) {
 foldInx <- append(foldInx, rep(i,foldSizes[i]))
}
foldInx <- sample(foldInx)
# CV loop:
for (k in numSims) {
 
 ## -- loop counter
 ct <- ct + 1
 
 ## -- report
 print(paste0("Ordinal Logistic Regression w. ",
              k, " nearest neighbours running:"))
 
 ### --- select k neighbours:
 modelFrame <- mFrame[, c(1:k, 31:(30+k), 61:(60+k), 91:(90+k))]
 modelFrame$Rating <- mFrame$Rating
 
 # - model for the whole data set (no CV):
 mFitAll <- clm(Rating ~ .,
                data = modelFrame)
 saveRDS(mFitAll, paste0("OrdinalModel_NoCV_", k, "Feats.Rds"))
 
 RMSE <- numeric(10)
 for (i in 1:10) {
   # - train and test data sets
   trainFrame <- modelFrame[which(foldInx != i), ]
   testFrame <- modelFrame[which(foldInx == i), ]
   # - model
   mFit <- clm(Rating ~ .,
               data = trainFrame)
   
   # - predict test set:
   predictions <- predict(mFit, newdata = testFrame,
                          type = "class")$fit
   
   # - RMSE:
   dataPoints <- length(testFrame$Rating)
   RMSE[i] <-
     sum((as.numeric(predictions)-as.numeric(testFrame$Rating))^2)/dataPoints
   print(paste0(i, "-th fold, RMSE:", RMSE[i]))
 }
 
 ## -- store mean RMSE over 10 folds:
 meanRMSE[ct] <- mean(RMSE)
 print(meanRMSE[ct])
 
 ## -- clean up:
 rm('testFrame', 'trainFrame', 'modelFrame'); gc()
}

The outer loop varies k in numSims,
the number of nearest neighbours selected, then prepares the dataset,
picks ten folds for cross-validation, and fits the ordered logit to the
whole dataset before the cross-validation; the inner loop performs a
10-fold cross-validation and stores the RMSE from each test fold; the
average RMSE from ten folds for each k in numSims is recorded. This procedure is later modified to test models that work with proximity or similarity matrices exclusively.

Results

The following figure presents the average RMSE from
10-fold CV after selecting 5, 10, 15, 20, 25, and 30 nearest neighbours
from the user-user and item-item proximity and similarity matrices. Once
again, the similarity matrices represent the memory-based information,
while the proximity matrices carry the content-based information. The
four different sources of features are termed “feature categories” in
the graph.

image

Figure 1. The average RMSE from the 10-fold cross-validations from clm()
with 5, 10, 15, 20, 25, and 30 nearest neighbours.

The model with 20 features only
 and exactly 24 parameters (4 intercepts were estimated plus the
coefficients for 5 ratings from four matrices) achieves an average RMSE
of around .52. For comparison, in a recent 10-fold cross-validation over MovieLens 100K with {recommenderlab}
the CF algorithms achieved an RMSE of .991 with 100 nearest neighbours
in user-based CF, and an RMSE of .940 with 100 nearest neighbours in
item-based CF: our approach almost double downs the error with a
fivefold reduction in the number of features used. With 120 features (30
nearest neighbours from all four matrices) and 124 parameters, the
models achieves a RMSE of approximately .43. With these results, the new
approach surpasses many recommender engines that were tested on
MovieLens 100K in the previous years; for example, compare the results
presented here – taking into consideration the fact that the model with
20 features only achieves an RMSE of .517 on the whole dataset (ie.
without cross-validation) – with the results
from the application of several different approaches presented by the
Recommender Systems project of the Carleton College, Northfield, MN.

What about the relative importance of memory and
content-based information in the prediction of user ratings? The
following figure presents the median values and range of exp(β) for
progressively more complex ordinal models that were tested here:

image

Figure 2. The median exp(β) from the clm() ordered logit  models estimated
without cross-validation and encompassing 5, 10, 15, 20, 25, and 30 nearest neighbours from the memory-based and content-based matrices.

As it is readily observed, the memory-based features (simItemsCoeff for item-item similarity neighbours, and simUsersCoeff
for user-user similarity neighbours in the graph legend) weigh more
heavily in the ratings prediction than the content-based features (proxItemsCoeff for the item-item proximity neighbours, and proxUsersCoeff
for the user-user proximity neighbours). I have performed another round
of the experiment by fitting ordinal models with the information from
the memory-based, similarity matrices only; Figure 3 presents the
average RMSE obtained from 10-fold CVs of these models:

image

Figure 3. The average RMSE from the 10-fold cross-validations from clm()
with 5, 10, 15, 20, 25, and 30 nearest neighbours for models with memory-based
information only.

The results seem to indicate almost no change in the
RMSEs across the models (but note how the model starts overfitting when
more than 25 features per matrix are used). Could it be that the
content-based information – encoded by the proximity matrices – is not a
necessary ingredient in my predictive model at all? No. But we will
certainly not use RMSE to support this conclusion:

modelFiles <- list.files()
modelRes <- matrix(rep(0,6*7), nrow = 6, ncol = 7)
numSims <- seq(5, 30, by = 5)
ct = 0
for (k in numSims) {
 ct = ct + 1
 fileNameSimProx <- modelFiles[which(grepl(paste0("CV_",k,"F"), modelFiles, fixed = T))]
 modelSimProx <- readRDS(fileNameSimProx)
 fileNameSim <- modelFiles[which(grepl(paste0("Sim_",k,"F"), modelFiles, fixed = T))]
 modelSim <- readRDS(fileNameSim)
 testRes <- anova(modelSimProx, modelSim)
 modelRes[ct, ] <- c(testRes$no.par, testRes$AIC, testRes$LR.stat[2],
                     testRes$df[2], testRes$`Pr(>Chisq)`[2])
}
modelRes <- as.data.frame(modelRes)
colnames(modelRes) <- c('Parameters_Model1', 'Parameters_Model2',
                       'AIC_Model1', 'AIC_Model2',
                       'LRTest','DF','Pr_TypeIErr')
write.csv(modelRes, 'ModelSelection.csv')

The results of the likelihood ratio tests,
together with the values of the Akaike Information Criterion (AIC) for
all models are provided in Table 1; a comparison between the AIC values
are provided in Figure 4. I have compared the models that encompass both
the top neighbours from the memory and the content-based matrices, on
one, with the models encompassing only information from the memory-based
matrices, on the other hand. As you can see, the AIC is consistently
lower for the models that incorporate both the content and the
memory-based information (with the significant LRTs testifying that the
contribution of content-based information cannot be rejected).

image
image

Figure 4. Model selection: AIC for ordered logit models
with 5, 10, 15, 20, 25, and 30 nearest neighbours: memory-based
information only (Sim) vs. memory-based and content-based models (Sim+Prox).

Summary

Some say that comparing model-based recommendation
systems with the CF approaches is not fair in the first place. True.
However, achieving a reduction in error of almost 50% by using 80% less
features is telling. I think the result has two important stories to
tell, in fact. The first one is that the comparison is really not fair
:-). The second story is, however, much more important. It is not about
the application of the relevant model; since the development of discrete
choice models in economics, many behavioral scientist who see a Likert
type response scale probably tends to think of some appropriate ordinal
model for the problem at hand. It is about the formulation of the problem: as ever in science, the crucial question turns out to be the one on the choice of the most relevant description of the problem; in cognitive and data science, the question of what representation enables for its solution by as simple as possible means of mathematical modeling.

References

[1] Greene, W. & Hensher, D. (2010). Modeling Ordered Choices. Cambridge University Press, 2010.

[2] Christensen, R. H. B. (June 28, 2015). Analysis
of ordinal data with cumulative link models – estimation with the
R-package ordinal. Accessed on March 03, 2017, from the The Comprehensive R Archive Network. URL: https://cran.r-project.org/web/packages/ordinal/vignettes/clm_intro.pdf

Goran S. Milovanović, PhD
Data Science Consultant, SmartCat

image

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)