Large scale optimization with optim() and RevoScaleR

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Derek McCrae Norton – Senior Sales Engineer

Optimization is something that I hear clients ask for on a fairly regular basis. There are many problems that a few functions to carry out optimization can solve. R has much of this functionality in the base product, such as nlm(), and optim(). There are also many packages that address this issue, as well as a task view devoted to it (Optimization and Mathematical Programming).

The reason I hear requests is that clients wonder how to scale these problems up. Most of the time I am asked if we have a scaleR function that does optimization, i.e. optimization for “Big Data.” Though I am sure this will come in the near future, that doesn't mean that you can't do optimization on “Big Data” right now. As mentioned above, R has lots of functionality around optimization, and Revolution R Enterprise provides a framework for for dealing with “Big Data.” Combining these two means that we can easily carry out optimization on “Big Data.”

Below is a description of some basic optimization in R as well as some code to demonstrate, and then how to do the same sort of thing in Revolution R Enterprise (RRE).

Optimization in R

optim() is often the first function used in R when one tackles an optimization problem, so we will also use it as a first step. In general we are trying to minimize a function, and here that will be the maximum likelihood estimate for a normal distribution. This is of course silly since there is a closed form for the mean and standard deviation, but it does help clearly demonstrate optimization scaled up.

Let's generate some data for use in this section and in the next “Big Data” section. These data are not that large but they will be broken down into 4 “chunks” to demonstrate out of memory processing.

set.seed(123)
n <- 10000
mu <- 4
sd <- 4
x <- rnorm(n, mean = mu, sd = sd)
theta <- c(mu = mean(x), sigma = sqrt((n - 1) * var(x)/n))
rxDataStep(inData = data.frame(x), outFile = "optim-work.xdf", overwrite = TRUE, 
    rowsPerRead = 2000, reportProgress = 0)

Data-dist

Below is a very simple maximum likelihood estimation of the mean and standard deviation using the log-likelihood function.

logLikFun <- function(param, mleData) {
    mu <- param[1]
    sigma <- param[2]
    -sum(dnorm(mleData, mean = mu, sd = sigma, log = TRUE))
}
 
timing1 <- system.time(mle1 <- optim(par = c(mu = 0, sigma = 1), fn = logLikFun, 
    mleData = x))
rbind(mle = mle1$par, standard = theta)
##             mu sigma
## mle      3.990 3.993
## standard 3.991 3.994

As we can see above, the MLE is slightly different than the actual mean and standard deviation, but that is to be expected. We can also see that the process was pretty quick at 0.05 seconds.

“Big Data” Optimization in RRE

As we saw above, the log-likelihood function only depends on musigma, and the data. More specifically it is a sum of a transformation of the data. Since it is a sum, that means that if we calculate a sum on parts of the data and the sum up the sums, it will be equivalent to the sum of all the data. This is exactly what rxDataStep() in the RevoScaleR package allows us to do, i.e. step through data, calculate a sum on each chunk, and update the total sum as we go. We can reuse the same log-likelihood function we defined for small data, we just need to define how that function will be used on individual chunks and then combined into a single return value.

library(RevoScaleR)
mleFun <- function(param, mleVar, data, llfun) {
    mleXform <- function(dataList) {
        .rxModify("sumLL", llfunx(paramx, mleData = dataList[[mleVarx]]))
        return(NULL)
    }
    rxDataStep(inData = data, transformFunc = mleXform, transformObjects = list(llfunx = llfun, 
        sumLL = 0, paramx = param, mleVarx = mleVar), returnTransformObjects = TRUE, 
        reportProgress = 0)$sumLL
}
timing2 <- system.time(mle2 <- optim(par = c(mu = 0, sigma = 1), fn = mleFun, 
    mleVar = "x", data = "optim-work.xdf", llfun = logLikFun))
all.equal(mle1$par, mle2$par)
## [1] TRUE
rbind(mle1 = mle1$par, mle2 = mle2$par, standard = theta)
##             mu sigma
## mle1     3.990 3.993
## mle2     3.990 3.993
## standard 3.991 3.994

Again, the MLE is slightly different than the actual mean and standard deviation, but it is equivalent to our in-memory calculation using the log-likelihood. We can also see that the process was a bit slower at 5.6 seconds.

The data we are using are not that large, but it is being processed from disk and iterated over which means that it will be slower than the equivalent in memory computation. We should not be too surprised by this. What is of more interest is that utilizing the big data framework in RRE, the same optimization can be carried out on data much larger than the available RAM, and as the data sizes grow the overhead that we see above is diminished. As with anything involving “Big Data”, this just means that you need to think a bit more about what you are doing (Probably good advice all the time).

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)