Memory Management in R, and SOAR

[This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The more I’ve worked with my really large data set, the more cumbersome the work has become to my work computer.  Keep in mind I’ve got a quad core with 8 gigs of RAM.  With growing irritation at how slow my work computer becomes at times while working with these data, I took to finding better ways of managing my memory in R.

The best/easiest solution I’ve found so far is in a package called SOAR.  To put it simply, it allows you to store specific objects in R (data frames being the most important, for me) as RData files on your hard drive, and gives you the ability to analyze them in R without having them loaded into your RAM.  I emphasized the term analyze because every time I try to add variables to the data frames that I store, the data frame comes back into RAM and once again slows me down.

An example might suffice:

> r = data.frame(a=rnorm(10,2,.5),b=rnorm(10,3,.5))
> r

  a       b

1 1.914092 3.074571
2 2.694049 3.479486
3 1.684653 3.491395
4 1.318480 3.816738
5 2.025016 3.107468
6 1.851811 3.708318
7 2.767788 2.636712
8 1.952930 3.164896
9 2.658366 3.973425
10 1.809752 2.599830
> library(SOAR)
> Sys.setenv(R_LOCAL_CACHE=”testsession”)
> ls()
[1] “r”
> Store(r)
> ls()
character(0)
> mean(r[,1])
[1] 2.067694
> r$c = rnorm(10,4,.5)
> ls()
[1] “r”

So, the first thing I did was to make a data frame with some columns, which got stored in my workspace, and thus loaded into RAM.  Then, I initialized the SOAR library, and set my local cache to “testsession”.  The practical implication of that is that a directory gets created within the current directory that R is working out of (in my case, “/home/inkhorn/testsession”), and that any objects passed to the Store command get saved as RData files in that directory.

Sure enough, you see my workspace before and after I store the r object.  Now you see the object, now you don’t!  But then, as I show, even though the object is not in the workspace, you can still analyze it (in my case, calculate a mean from one of the columns).  However, as soon as I try to make a new column in the data frame… voila … it’s back in my workspace, and thus RAM!

So, unless I’m missing something about how the package is used, it doesn’t function exactly as I would like, but it’s still an improvement.  Every time I’m done making new columns in the data frame, I just have to pass the object to the Store command, and away to the hard disk it goes, and out of my RAM.  It’s quite liberating not having a stupendously heavy workspace, as when I’m trying to leave or enter R, it takes forever to save/load the workspace.  With the heavy stuff sitting on the hard disk, leaving and entering R go by a lot faster.

Another thing I noticed is that if I keep the GLMs that I’ve generated in my workspace, that seems to take up a lot of RAM as well and slow things down.  So, with writing the main dataframe to disk, and keeping GLMs out of memory, R is flying again!


To leave a comment for the author, please follow the link and comment on their blog: Data and Analysis with R, at Work.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)