**For Part I, Parallelism in R, click here.**

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

**Slides**

My edited slides are posted on SlideShare, and available for download here.

Topics included:

- bigmemory, biganalytics and bigtabulate
- ff
- HadoopStreaming
- brief mention of Rhipe

**Code**

The corresponding demonstration code is here.

**Data**

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including `trees` and the famous `iris` dataset included in base R. To load these, just use the call `library(iris)` or `library(trees)`.

Large datasets:

- On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called
`airline.csv` in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
- The Twitter dataset appears to no longer be available. Instead, use
`anna.txt` which comes with `HadoopStreaming`. Simply replace `twitter.tsv` with `anna.txt`.

**Video**

The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.

**For Part I, Parallelism in R, click here.**

*Related*

To

**leave a comment** for the author, please follow the link and comment on their blog:

** Byte Mining » R**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as:

Data science,

Big Data, R jobs, visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...