R Code Optimization

August 16, 2011

(This article was first published on iamdata, and kindly contributed to R-bloggers)

Handling Large Data with R

The following experiments are inspired from this excellent presentation by Ryan Rosario: http://statistics.org.il/wp-content/uploads/2010/04/Big_Memory%20V0.pdf. R presents many I/O functions to the users for reading/writing data such as ‘read.table’ , ‘write.table’ -> http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files. With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.

From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O.

Testing bigmemory package

Test Background & Motivation

R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.

Testing tools
Test Scenario

Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.

x1 <- matrix(rnorm(10000, 1.0, 10.0), nrow=10000, ncol=10000)

ii. Write and read a large matrix using read.table and write.table.

timeit({ foo = read.csv(filepath)})
timeit({write.table(x1, file = filepath,  sep = "," , eol = "\n", dec = ".", col.names = FALSE)})

iii. Write and read a large matrix using bigmemory package

timeit({big.matrix(x1,nrow = 10000, ncol = 10000, type = "double", separated = FALSE,
backingfile = "BigMem.bin", descriptorfile = "BigMem.desc", shared = TRUE)})

timeit({foo <- read.big.matrix(filepath, sep = ‘,’, header = FALSE, col.names = NULL, row.names = NULL,
has.row.names=FALSE, ignore.row.names=FALSE,
type = “double”, backingfile = “BigMem.bin” ,
descriptorfile = “BigMem.desc”, shared=TRUE)})

iv. Testing using my.read.lines

timeit({ foo = my.read.lines(filepath)})
Test Results

Platform: Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

utils Total Elapsed Time(sec) bigmemory Total Elapsed Time(sec) File size on disk (.csv) Computation Time Saved by bigmemory
write.table 369.79 big.matrix 1.51 1.7GB MB 99%
read.csv 313.03 read.big.matrix 141.50 1.7GB 55%

* my.read.lines(filepath) took 23.73 secs.

Test Discussion

The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters ‘backingfile’ and ‘descriptorfile’. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix(‘BigMem.desc’). This way several R processes can share memory objects via ‘call by reference’.
The .desc file is an S4 type object -> https://github.com/hadley/devtools/wiki/S4

i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using ‘call by reference’

To leave a comment for the author, please follow the link and comment on their blog: iamdata.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)