Parallelizing Linear Regression or Using Multiple Sources

June 21, 2018
By

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix \mathbf{X} which is n\times k matrix and \mathbf{y} a n-dimensional vector, and we want to compute \widehat{\mathbf{\beta}}=[\mathbf{X}^T\mathbf{X}]^{-1}\mathbf{X}^T\mathbf{y} by spliting the job. Instead of using the n observations, we’ve seen that it was to possible to compute “something” using the first n_1 rows, then the next n_2 rows, etc. Then, finally, we “aggregate” the m objects created to get our overall estimate.

Parallelizing on multiple cores

Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we’ve seen in the previous post. Here, the data we use are

1
2
3
y = cars$dist
X = data.frame(1,cars$speed)
k = ncol(X)

On my laptop, I have three cores, so we will split it in m=3 chunks

1
2
3
4
library(parallel)
library(pbapply)
ncl = detectCores()-1
cl = makeCluster(ncl)

This is more or less what we will do: we have our dataset, and we split the jobs,

We can then create lists containing elements that will be sent to each core, as Ewen suggested,

1
2
3
4
5
chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
a_parcourir = chunk(seq_len(nrow(X)), ncl)
for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]]))
Xlist = split(X, unlist(a_parcourir))
ylist = split(y, unlist(a_parcourir))

It is also possible to simplify the QR functions we will use

1
2
3
4
5
6
7
8
compute_qr = function(x){
  list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x))))
}
get_Vlist = function(j){
  Q3 = QR1[[j]]$Q %*% Q2list[[j]]
  t(Q3) %*% ylist[[j]]
}
clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment())

Then, we can run our functions on each core. The first one is

1
  QR1 = parLapply(cl=cl,Xlist, compute_qr)

note that it is also possible to use

1
  QR1 = pblapply(Xlist, compute_qr, cl=cl)

which will include a progress bar (that can be nice when the database is rather large). Then use

1
2
3
4
5
6
7
  R1 = pblapply(QR1, function(x) x$R, cl=cl) %>% do.call("rbind", .)
  Q1 = qr.Q(qr(as.matrix(R1)))
  R2 = qr.R(qr(as.matrix(R1)))
  Q2list = split.data.frame(Q1, rep(1:ncl, each=k))
  clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment())
  Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl)
  sumV = Reduce('+', Vlist)

and finally the ouput is

1
2
3
4
solve(R2) %*% sumV
         [,1]
X1 -17.579095
X2   3.932409

which is what we were expecting…

Using multiple sources

In practice, it might also happen that various “servers” have the data, but we cannot get a copy. But it is possible to run some functions on their server, and get some output, that we can use afterwards.

Datasets are supposed to be available somewhere. We can send a request, and get a matrix. Then we we aggregate all of them, and send another request. That’s what we will do here. Provider j should run f_1(\mathbf{X}) on his part of the data, that function will return R^{(1)}_j. More precisely, to the first provider, send

1
2
3
function1 = function(subX){
return(qr.R(qr(as.matrix(subX))))}
R1 = function1(Xlist[[1]])

and actually, send that function to all providers, and aggregate the output

1
for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))

The create on your side the following objects

1
2
3
4
Q1 = qr.Q(qr(as.matrix(R1)))
R2 = qr.R(qr(as.matrix(R1)))
Q2list=list()
for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]

Finally, contact one last time the providers, and send one of your objects

1
2
3
4
function2=function(subX,suby,Q){
Q1=qr.Q(qr(as.matrix(subX)))
Q2=Q
return(t(Q1%*%Q2) %*% suby)}

Provider j should then run f_2(\mathbf{X},\mathbf{y},Q_j^{(2)}) on his part of the data, using also Q_j^{(2)} as argument (that we obtained on own side) and that function will return (\mathbf{Q}^{(2)}_j\mathbf{Q}^{(1)}_j)^{T}_j\mathbf{y}_j. For instance, ask the first provider to run

1
sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])

and do the same with all providers

1
for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]])
1
2
3
4
solve(R2) %*% sumV
         [,1]
X1 -17.579095
X2   3.932409

which is what we were expecting…

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)