**R on Yixuan's Blog**, and kindly contributed to R-bloggers)

# A Quick View of Recommender System

The main task of recommender system is to predict unknown entries in the

rating matrix based on observed values, as is shown in the table below:

Each cell with number in it is the rating given by some user on a specific

item, while those marked with question marks are unknown ratings that need

to be predicted. In some other literatures, this problem may be named

collaborative filtering, matrix completion, matrix recovery, etc.

A popular technique to solve the recommender system problem is the matrix

factorization method. The idea is to approximate the whole rating matrix

$R_{m\times n}$ by the product of two matrices of lower dimensions,

$P_{k\times m}$ and $Q_{k\times n}$, such that

$$R\approx P^\prime Q$$

Let $p_u$ be the $u$-th column of $P$, and $q_v$ be the

$v$-th column of $Q$, then the rating given by user $u$ on item $v$

would be predicted as $p^\prime_u q_v$.

A typical solution for $P$ and $Q$ is given by the following optimization

problem [1; 2]:

$$\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]$$

where $(u,v)$ are locations of observed entries in $R$, $r_{u,v}$ is

the observed rating, $f$ is the loss function, and

$\mu_P,\mu_Q,\lambda_P,\lambda_Q$ are penalty parameters

to avoid overfitting.

The process of solving the matrices $P$ and $Q$ is called

model training, and the selection of penalty parameters is

parameter tuning. After obtaining $P$ and $Q$, we can then do

the prediction of $\hat{R}_{u,v}=p^\prime_u q_v$.

# LIBMF and recosystem

LIBMF

is an open source C++ library for recommender system using parallel

matrix factorization, developed by

Dr. Chih-Jen Lin and his research group.

[3]

LIBMF is a parallelized library, meaning that

users can take advantage of multi-core CPUs to speed up the computation.

It also utilizes some advanced CPU features to further improve the performance.

recosystem

(Github) is an R wrapper of

the LIBMF library that inherits most of its features. Additionally, this

package provides a number of user-friendly R functions to

simplify data processing and model building. Also, unlike most other R packages

for statistical modeling that store the whole dataset and model object in

memory, LIBMF (and hence `recosystem`

) can significantly reduce memory use,

for instance the constructed model that contains information for prediction

can be stored in the hard disk, and output result can also be directly

written into a file rather than be kept in memory.

# Overview of recosystem

The usage of `recosystem`

is quite simple, mainly consisting of the following steps:

- Create a model object (a Reference Class object in R) by calling
`Reco()`

. - Specify the data source, either from a data file or from R objects in memory.
- Train the model by calling the
`$train()`

method. A number of parameters

can be set inside the function. - (Optionally) Call the
`$tune()`

method to select best tuning parameters

along a set of candidate values, in order to achieve better model performance. - (Optionally) Export the model via
`$output()`

, i.e. write the factorization matrices

$P$ and $Q$ into files or return them as R objects. - Use the
`$predict()`

method to compute predicted values.

More details are covered in the package

vignette

and the help pages `?recosystem::Reco`

, `?recosystem::data_source`

, `?recosystem::train`

,

`?recosystem::tune`

, `?recosystem::output`

, and `?recosystem::predict`

.

In the next section we will demonstrate how to use `recosystem`

to analyze a

real movie recommendation data set.

# MovieLens Data

The MovieLens website collected many

movie rating data for research use. [4] In this article we download the

MovieLens 1M Dataset

from grouplens,

which contains 1 million ratings from 6000 users and 4000 movies.

The rating data file, `ratings.dat`

, looks like below:

```
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
...
```

Each line has the format `UserID::MovieID::Rating::Timestamp`

,

for example the first line says that User #1 gave Movie #1193 a rating of

5 at certain time point.

In `recosystem`

, we will not use the time information, and the required data

format is `UserID MovieID Rating`

, i.e., the columns are space-separated, and

columns after `Rating`

will be ignored.

Therefore, we first transform the data file into a format that is supported by

`recosystem`

. On Unix-like OS’s, we can use the `sed`

command to replace `::`

by a space:

```
sed -e 's/::/ /g' ratings.dat > ratings2.dat
```

Then we can start to train a recommender, as the following code shows:

```
library(recosystem) # 1
r = Reco() # 2
train_set = data_file("ratings2.dat", index1 = TRUE) # 3
r$train(train_set, opts = list(dim = 20, # 4
costp_l1 = 0, costp_l2 = 0.01, # 5
costq_l1 = 0, costq_l2 = 0.01, # 6
niter = 10, # 7
nthread = 4)) # 8
```

In the code above, line 2 creates a model object such that the training

function `$train()`

can be called from it. Line 3 specifies the data

source – a data file on hard disk. Since in our data user ID and movie ID

start from 1 rather than 0, we use the `index1 = TRUE`

options in the function.

The data can also be read from memory, if the UserID, MovieID and Rating

columns are stored as R vectors. Below shows an alternative way to provide

the training set.

```
dat = read.table("ratings2.dat", sep = " ", header = FALSE,
colClasses = c(rep("integer", 3), "NULL"))
train_set = data_memory(user_index = dat[, 1],
item_index = dat[, 2],
rating = dat[, 3], index1 = TRUE)
```

Line 4 to line 6 set the relevant model parameters: $k, \mu_P,\mu_Q,\lambda_P$,

and $\lambda_Q$, and Line 7 gives the number of iterations. Finally as I have

mentioned previously, LIBMF is a parallelized library, so

users can specify the number of threads that will be working

simultaneously via the `nthread`

parameter. However, when `nthread > 1`

,

the training result is **NOT** guaranteed to be reproducible, even if

a random seed is set.

Now everything looks good, except one inadequacy: the setting of tuning

parameters is ad-hoc, which may make the model sub-optimal.

To tune these parameters, we can call the `$tune()`

function to test

a set of candidate values and use cross validation

to evaluate their performance. Below shows this process:

```
opts_tune = r$tune(train_set, # 9
opts = list(dim = c(10, 20, 30), # 10
costp_l2 = c(0.01, 0.1), # 11
costq_l2 = c(0.01, 0.1), # 12
costp_l1 = 0, # 13
costq_l1 = 0, # 14
lrate = c(0.01, 0.1), # 15
nthread = 4, # 16
niter = 10, # 17
verbose = TRUE)) # 18
r$train(train_set, opts = c(opts_tune$min, # 19
niter = 100, nthread = 4)) # 20
```

The options in line 9 to line 15 are tuning parameters. The tuning function

will evaluate each combination of them and calculate the associated

cross-validated RMSE. The parameter set with the smallest RMSE will be contained

in the returned value, which can then be passed to `$train()`

(Line 19-20).

Finally, we can use the model object to do predictions. The code

below shows how to predict ratings given by the first 20 users

on the first 20 movies.

```
user = 1:20
movie = 1:20
pred = expand.grid(user = user, movie = movie)
test_set = data_memory(pred$user, pred$movie, index1 = TRUE)
pred$rating = r$predict(test_set, out_memory())
library(ggplot2)
ggplot(pred, aes(x = movie, y = user, fill = rating)) +
geom_raster() +
scale_fill_gradient("Rating", low = "#d6e685", high = "#1e6823") +
xlab("Movie ID") + ylab("User ID") +
coord_fixed() +
theme_bw(base_size = 22)
```

# Performance

To make the best use of `recosystem`

, the parallel computing option `nthread`

should be used in the training and tuning step. Also, LIBMF and recosystem can

make use of some advanced CPU features to speed-up computation, if you

compile the package from source and turn on some compiler options.

To build `recosystem`

, one needs a C++ compiler that supports

the C++11 standard. Then you can edit `src/Makevars`

(`src/Makevars.win`

for Windows system)

according to the following guideline:

- The default
`Makevars`

provides generic options that should apply to most

CPUs. - If your CPU supports SSE3

(a list of supported CPUs), add

```
PKG_CPPFLAGS += -DUSESSE
PKG_CXXFLAGS += -msse3
```

- If not only SSE3 is supported but also AVX

(a list of supported CPUs), add

```
PKG_CPPFLAGS += -DUSEAVX
PKG_CXXFLAGS += -mavx
```

After editing the `Makevars`

file, run `R CMD INSTALL recosystem`

to install `recosystem`

.

The plot below shows the effect of parallel computing and the compiler option on

the performance of computation. The y axis is the elapsed time of the model tuning

procedure in the previous example.

## References

[1] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015a. *A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems*. ACM TIST.

[2] Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2015b. *A Learning-Rate Schedule for Stochastic Gradient Methods to Matrix Factorization*. ACM TIST.

[3] Lin, Chih-Jen, Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin. 2015. *LIBMF: A Matrix-Factorization Library for Recommender Systems*.

[4] F. Maxwell Harper and Joseph A. Konstan. 2015. *The MovieLens Datasets: History
and Context*. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,

Article 19 (December 2015), 19 pages.

**leave a comment**for the author, please follow the link and comment on their blog:

**R on Yixuan's Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...