**CYBAEA Data and Analysis**, and kindly contributed to R-bloggers)

The `save()`

function in the R platform for statistical computing is very convenient and I suspect many of us use it a lot. But I was recently bitten by a “feature” of the format which meant I could not recover my data.

## How to lose your data with `save()`

I am using Windows on my travel laptop and Linux on my workstation. To speed things up on the latter and make use of my many (well, four) cores, I use the ‘multicore’ package, which I do not have available on the Windows machine.

To illustrate the problem with the save file format, I created a file on the Linux machine simply as:

library("multicore") a <- list(data = 1:10, fun = mclapply) save(a, file = "a.RData")

What could be simpler? The `mclapply`

is a function from the ‘multicore’ package but it clearly has no impact on the stored data. (We will show a more realistic example below – work with me here.)

But try to open the save file on a machine without the package installed, like my Windows laptop, and you get:

Error in loadNamespace(name) : there is no package called 'multicore'

**There is no way of getting to your precious data** without installing the missing package.

If the package has been withdrawn or is no longer available then your data is basically lost.

## What can you do?

Some suggestions from the helpful people on R-help:

- (Uwe Ligges): You could try to rewrite
`./src/main/saveload.R`

and`serialize.R`

to extract only the parts you need. “This is probably not worth the effort.” - (Prof. Brian Ripley): You could try installing the missing package;
`R CMD INSTALL --fake`

should be sufficient to let you load the data. Also suggests that the proposal above would be very hard indeed. - (Martin Morgan): Don't store package functions with your code.

That is three good answers from three of the heavy-weights in the R community. Thank you all!

Martin’s comment is worth expanding. We can change the above example to:

library("multicore") computeFunction <- function(...) { if (require(multicore)) mclapply(...) else lapply(...) } a <- list(data = 1:10, fun = computeFunction) save(a, file = "a.RData")

Now everything works fine! No data is horribly lost: the file loads fine on the ‘multicore’-less machine.

And for the more realistic example, I had been using `caret::rfe`

as Martin knew in the example he provided:

library("caret") data(BloodBrain) x <- scale(bbbDescr[,-nearZeroVar(bbbDescr)]) x <- x[, -findCorrelation(cor(x), .8)] x <- as.data.frame(x) set.seed(1) lmProfile <- rfe(x, logBBB, sizes = c(2:25, 30, 35, 40, 45, 50, 55, 60, 65), rfeControl = rfeControl(functions = lmFuncs, number = 5, computeFunction=mclapply)) save(lmProfile, file = "lmProfile.RData")

Slightly less obvious that there is a reference to the external namespace in this code, but easy enough to see if you know what to look for.

For old files I will use the `R CMD INSTALL --fake`

suggestion, but for new data I am going with the last approach and using a `computeFunction`

like this:

### MCCompute: A computeFunction for caret::rfeControl and caret::trainControl ### that does not leave a reference to the multicore package in the save file MCCompute <- function(X, FUN, ...) { FUN <- match.fun(FUN) if (!is.vector(X) || is.object(X)) X <- as.list(X) if (require("multicore")) mclapply(X, FUN, ...) else lapply(X, FUN, ...) }

I know that Max Kuhn is rewriting the caret package which should make this a moot point in the near future for that specific case. But the indirection approach is generally useful and will also be relevant in other situations.

## Recommendations

My recommendations:

**Save data in a data format, not using the**. Suitable formats include CSV and variants, HDF5, and CDF, as well as others.`save()`

function which is really for objects (data and code)- Avoid references to packages in your objects by using the one level indirection trick exemplified by the
`MCCompute`

function shown.

What is your approach? Suggestions in the comments below, please.

Jump to comments.

# You may also like these posts:

R tips: Determine if function is called from specific package

I like the multicore library for a particular task. I can easily write a combination of if(require(multicore,...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions) in which case much CPU trashing and grinding of teeth will result. So, I needed a function to determine if my function was called from any function in the multicore library. Here it is.

R: Eliminating observed values with zero variance

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform . In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast. The final result surprised me a little.

R tips: Keep your packages up-to-date

In this entry in a small series of tips for the use of the R statistical analysis and computing tool, we look at how to keep your addon packages up-to-date.

Revolutions Analytics recently announced their big data solution for R. This is great news and a lovely piece of work by the team at Revolutions. However, if you want to replicate their analysis in standard R , then you can absolutely do so and we show yo…

R tips: Eliminating the “save workspace image” prompt on exit

When using R , the statistical analysis and computing platform, I find it really annoying that it always prompts to save the workspace when I exit. This is how I turn it off.

**leave a comment**for the author, please follow the link and comment on his blog:

**CYBAEA Data and Analysis**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...