A warning on the R save format

August 23, 2011
By

(This article was first published on CYBAEA Data and Analysis, and kindly contributed to R-bloggers)

The save() function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.

How to lose your data with save()

I am using Windows on my travel laptop and Linux on my workstation. To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.

To illustrate the problem with the save file format, I created a file on the Linux machine simply as:

library("multicore")
a <- list(data = 1:10, fun = mclapply)
save(a, file = "a.RData")

What could be simpler? The mclapply is a function from the ‘multicore’ package but it clearly has no impact on the stored data. (We will show a more realistic example below ­– work with me here.)

But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:

Error in loadNamespace(name) : there is no package called 'multicore'

There is no way of getting to your precious data without installing the missing package.

If the package has been withdrawn or is no longer available then your data is basically lost.

What can you do?

Some suggestions from the helpful people on R-help:

  1. (Uwe Ligges): You could try to rewrite ./src/main/saveload.R and serialize.R to extract only the parts you need. “This is probably not worth the effort.”
  2. (Prof. Brian Ripley): You could try installing the missing package; R CMD INSTALL --fake should be sufficient to let you load the data. Also suggests that the proposal above would be very hard indeed.
  3. (Martin Morgan): Don't store package functions with your code.

That is three good answers from three of the heavy-weights in the R community. Thank you all!

Martin’s comment is worth expanding. We can change the above example to:

library("multicore")
computeFunction <- function(...) {
    if (require(multicore)) mclapply(...)
    else lapply(...) 
}
a <- list(data = 1:10, fun = computeFunction)
save(a, file = "a.RData")

Now everything works fine! No data is horribly lost: the file loads fine on the ‘multicore’-less machine.

And for the more realistic example, I had been using caret::rfe as Martin knew in the example he provided:

library("caret")
data(BloodBrain)

x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)])
x <- x[, -findCorrelation(cor(x), .8)]
x <- as.data.frame(x)

set.seed(1)
lmProfile <- rfe(x, logBBB,
                 sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65),
                 rfeControl = rfeControl(functions = lmFuncs,
                   number = 5,
                   computeFunction=mclapply))
save(lmProfile, file = "lmProfile.RData")

Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.

For old files I will use the R CMD INSTALL --fake suggestion, but for new data I am going with the last approach and using a computeFunction like this:

### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl 
### that does not leave a reference to the multicore package in the save file
MCCompute <- function(X, FUN, ...) {
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    if (require("multicore")) mclapply(X, FUN, ...)
    else lapply(X, FUN, ...)
}

I know that Max Kuhn is rewriting the caret package which should make this a moot point in the near future for that specific case. But the indirection approach is generally useful and will also be relevant in other situations.

Recommendations

My recommendations:

  1. Save data in a data format, not using the save() function which is really for objects (data and code). Suitable formats include CSV and variants, HDF5, and CDF, as well as others.
  2. Avoid references to packages in your objects by using the one level indirection trick exemplified by the MCCompute function shown.

What is your approach? Suggestions in the comments below, please.

Jump to comments.

You may also like these posts:

  1. [0.46] R tips: Determine if function is called from specific package

    I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is.

  2. [0.43] R: Eliminating observed values with zero variance

    I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

  3. [0.42] R tips: Keep your packages up-to-date

    In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.

  4. [0.39] Big data for R

    Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…

  5. [0.38] R tips: Eliminating the “save workspace image” prompt on exit

    When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.

To leave a comment for the author, please follow the link and comment on his blog: CYBAEA Data and Analysis.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.