**Mad (Data) Scientist**, and kindly contributed to R-bloggers)

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: recommender systems. (See also a recent paper by P.O. Perry.)

The typical example of a recommender system involves movie review data, such as that in Netflix and MovieLens, with various users rating various movies. We have a matrix, with rows representing users and columns for movies, but it is a very sparse matrix, because most users don’t rate most movies. So, we’d like to guess the missing elements, a process sometimes termed the *matrix completion problem*.

There are many methods for this, such as matrix factorization, as exemplified in the R package **NMF**. I much prefer the approach taken by Art and Katelyn, as it is more statistical and thus potentially can provide more insight into the underlying processes.

I was inspired by their research to start a project that takes a new approach to random effects models in general. I’ll explain that later, but let’s first address the major computational problems that can occur with random effects models when applied to big data. Note carefully that not only is this a problem with long run times, but also with memory limitations. For example, Pennell and Dunson wrote, in Fitting Semiparametric Random Effects Models to Large Data Sets, *Biostatistics*, 2007, “SAS PROC MIXED ran out of memory when we attempted to fit a model with random smoking effects.”

One of the features of my **partools** package is what I call Software Alchemy, with the term alluding to the fact that the procedure turns a statistical algorithm into a statistically equivalent version that is easily parallelized. The idea is simple: Break i.i.d. data into chunks, apply the original algorithm to each chunk, then average the results. It’s easy to prove that the statistical accuracy of this method matches that of the original one, and one not only gets a speedup but also may solve the memory problem, since each chunk is smaller. See my paper on this, to appear soon in the *Journal of Statistical Software*, but also outlined in my book on parallel computation in data science. (Chunking has been tried in random effects contexts before; see references in the Perry paper above.)

Here is an example of all this, applied to the MovieLens data, 1-million records version. I’ll use Art and Katelyn’s model here, which writes the rating by user i of movie j as

Y_{ij} = μ + a_{i} + b_{j} + ε_{ij}

where μ is an unknown constant and a_{i} + b_{j} + ε_{ij} are independent, mean-0 variables (not necessarily normal) with variances σ_{a}^{2}, σ_{b}^{2} and σ_{ε}^{2}. One can also add covariates, but we’ll keep things simple here. In addition to estimating the variances, one can estimate the a_{i} and b_{j}, and thus predict the missing Y_{ij}. (See the **b** component in the output of **lmer()**, to be discussed below. It lists the estimated a_{i} and b_{j}, in that order.)

Art and Katelyn (as well as Perry) use the Method of Moments rather than the standard methods for random effects models, so that (a) they reduce computation, by deriving explicit solutions and (b) are not tied to assumptions of normality. I use MM in my project too, but here I’ll use the famous R package for random effects models, **lme4**. I ran the experiments here on a quad-core machine, using an R **parallel** cluster of that size.

I first ran without chunking:

> system.time(lmerout <- lmer(V3 ~ (1|V1) + (1|V2),ml1m))

user system elapsed

1105.129 3.692 1108.946

> getME(lmerout,'theta')

V1.(Intercept) V2.(Intercept)

0.3994240 0.6732281

Not good — almost 20 minutes! So, I turned to **partools**.The main Software Alchemy function is **cabase()**. For a given statistical algorithm, one needs to specify the function that implements the algorithm and a function to extract the estimated parameters from the output of the algorithm. There is already a wrapper for **lm()** in the package, so I just modified it for **lmer()**:

```
calmer <- function (cls, lmerargs)
{
ovf <- function(u) {
tmp <- paste("lmer(", lmerargs, ")", collapse = "")
docmd(tmp)
}
estf <- function(lmerout) {
getME(lmerout,'theta')
}
cabase(cls, ovf, estf)
}
```

I had earlier used **partools **function **distribsplit()** to distribute my data frame **m1m** to the cluster nodes, in chunks of the same name. So, the call to **cabase()** does the estimation on each chunk, collects the results, and averages them. Here is what happened:

```
> system.time(print(calmer(cls,'V3 ~ (1|V1) + (1|V2),ml1m')))
$thts
V2.(Intercept) V1.(Intercept)
[1,] 0.6935344 0.3921150
[2,] 0.6623718 0.3979066
[3,] 0.6505459 0.3967739
[4,] 0.6825985 0.4103681
$tht
V2.(Intercept) V1.(Intercept)
0.6722626 0.3992909
user system elapsed
0.812 0.080 189.744
```

We got an almost 6-fold improvement! And note that we used only 4 cores; this is the *superlinear* effect that I discuss in my Software Alchemy paper. Just as important, we got essentially the same estimates as from the non-chunked computation. (Please note: The chunked version can be shown to have the same asymptotic variance as the original one; neither is an “approximation” to the other.)

But wait a minute. Didn’t I say that Software Alchemy is intended for i.i.d. data? I did. In the paper I said that the theory could easily be extended to the case of independent but non identically-distributed data, but it would be difficult to state general conditions for this.

Random effects models generally do NOT assume the data are identically distributed. In Art and Katelyn’s paper, for instance, they define indicator variables z_{ij} representing that user i rated movie j, and treat these as constants. So, for example, the row counts — numbers of movies rated by each user — are treated as constants.

It’s my contention that such quantities should be treated as random, just like other user behavior. (We haven’t brought in variables such as movie genre here, but if we did, a similar statement would hold.) And once we do that, then voila!, we now have i.i.d. data, and Software Alchemy applies.

In my paper on arXiv, I also show why it doesn’t matter: The MM estimators are going to turn out essentially indentical, regardless of assuming fixed or random quantities of this type.

Moreover, the advantage of doing things this way **goes beyond simply being able to use Software Alchemy**. Those quantities may be of statistical interest in their own right. For instance, consider a study of children within families. In the classical approach, the number of kids in a family in a random effects model would be treated as fixed. But research has shown that family size is negatively correlated with income, and the number of kids should be an important variable — a *random* variable — in the analysis.

Now, back to MM. The analysis will compute a lot of quantities involving within-user variances and so on. If so, it will help to group together rows or columns of our input data frame, e.g. group by user and group by movie. How might we do this using **par tools**?

The R function **split()** comes to mind, but there is an issue of combining the outputs of **split(), ** which are R lists, from the various cluster nodes. Note in particular that some user, for example, may have data only at one of the nodes. But this is exactly the situation that the **partools** function **addlists()** is designed to handle!

That helps, but still we have to worry about overhead incurred by shipping so much data back and forth between the worker nodes and manager node in the cluster. Also, what about memory? Again, we should try (may need) to avoid holding the entire data set in memory at one time. The **partools** function **filesplit()** helps in that regard, but this continues to be a computational challenge. It may be that the only good solutions involve C.

**leave a comment**for the author, please follow the link and comment on their blog:

**Mad (Data) Scientist**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...