# Taking R to the Limit, Part II – Large Datasets in R

**For Part I, Parallelism in R, click here.**

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

Topics included:

- bigmemory, biganalytics and bigtabulate
- ff
- HadoopStreaming
- brief mention of Rhipe

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including `trees` and the famous `iris` dataset included in base R. To load these, just use the call `library(iris)` or `library(trees)`.

Large datasets:

- On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called
`airline.csv`in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth! - The Twitter dataset appears to no longer be available. Instead, use
`anna.txt`which comes with`HadoopStreaming`. Simply replace`twitter.tsv`with`anna.txt`.

