The save()
function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.
How to lose your data with save()
I am using Windows on my travel laptop and Linux on my workstation. To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.
To illustrate the problem with the save file format, I created a file on the Linux machine simply as:
library("multicore") a < list(data = 1:10, fun = mclapply) save(a, file = "a.RData")
What could be simpler? The mclapply
is a function from the ‘multicore’ package but it clearly has no impact on the stored data. (We will show a more realistic example below – work with me here.)
But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:
Error in loadNamespace(name) : there is no package called 'multicore'
There is no way of getting to your precious data without installing the missing package.
If the package has been withdrawn or is no longer available then your data is basically lost.
What can you do?
Some suggestions from the helpful people on Rhelp:
 (Uwe Ligges): You could try to rewrite
./src/main/saveload.R
andserialize.R
to extract only the parts you need. “This is probably not worth the effort.”  (Prof. Brian Ripley): You could try installing the missing package;
R CMD INSTALL fake
should be sufficient to let you load the data. Also suggests that the proposal above would be very hard indeed.  (Martin Morgan): Don’t store package functions with your code.
That is three good answers from three of the heavyweights in the R community. Thank you all!
Martin’s comment is worth expanding. We can change the above example to:
library("multicore") computeFunction < function(...) { if (require(multicore)) mclapply(...) else lapply(...) } a < list(data = 1:10, fun = computeFunction) save(a, file = "a.RData")
Now everything works fine! No data is horribly lost: the file loads fine on the ‘multicore’less machine.
And for the more realistic example, I had been using caret::rfe
as Martin knew in the example he provided:
library("caret") data(BloodBrain) x < scale(bbbDescr[,nearZeroVar(bbbDescr)]) x < x[, findCorrelation(cor(x), .8)] x < as.data.frame(x) set.seed(1) lmProfile < rfe(x, logBBB, sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65), rfeControl = rfeControl(functions = lmFuncs, number = 5, computeFunction=mclapply)) save(lmProfile, file = "lmProfile.RData")
Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.
For old files I will use the R CMD INSTALL fake
suggestion, but for new data I am going with the last approach and using a computeFunction
like this:
### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl ### that does not leave a reference to the multicore package in the save file MCCompute < function(X, FUN, ...) { FUN < match.fun(FUN) if (!is.vector(X)  is.object(X)) X < as.list(X) if (require("multicore")) mclapply(X, FUN, ...) else lapply(X, FUN, ...) }
I know that Max Kuhn is rewriting the caret package which should make this a moot point in the near future for that specific case. But the indirection approach is generally useful and will also be relevant in other situations.
Recommendations
My recommendations:
 Save data in a data format, not using the
save()
function which is really for objects (data and code). Suitable formats include CSV and variants, HDF5, and CDF, as well as others.  Avoid references to packages in your objects by using the one level indirection trick exemplified by the
MCCompute
function shown.
What is your approach? Suggestions in the comments below, please.
Jump to comments.
You may also like these posts:

R tips: Determine if function is called from specific packageI like the multicore library for a particular task. I can easily write a combination of if(require(multicore,…)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is.

R: Eliminating observed values with zero varianceI needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

R tips: Keep your packages uptodateIn this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages uptodate.

Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…

R tips: Eliminating the “save workspace image” prompt on exitWhen using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...