Taking R to the Limit, Part II – Large Datasets in R

August 20, 2010

(This article was first published on Byte Mining » R, and kindly contributed to R-bloggers)


For Part I, Parallelism in R, click here.

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.


My edited slides are posted on SlideShare, and available for download here.

Topics included:

  • bigmemory, biganalytics and bigtabulate
  • ff
  • HadoopStreaming
  • brief mention of Rhipe


The corresponding demonstration code is here.


Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).

Large datasets:

  • On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called airline.csv in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
  • The Twitter dataset appears to no longer be available. Instead, use anna.txt which comes with HadoopStreaming. Simply replace twitter.tsv with anna.txt.


The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.

For Part I, Parallelism in R, click here.

To leave a comment for the author, please follow the link and comment on their blog: Byte Mining » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)