An R User Group Mini Conference

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

With a membership that spans the entire Bay Area and the ability to hold meetings from San Jose to San Francisco, the Bay Area useR Group is in the very fortunate position of being able to attract world-class R speakers on a regular basis. Meetings that feature multiple speakers often attract 80 to a 100 or more attendees and generate the excitement and feel of a “mini-conference”. Our recent January meeting set a new benchmark for this kind of event. Not only did we have three excellent speakers: Ryan Hafen, Hadley Wickham and Nick Elprin, but through blind luck there turned out to be some considerable synergy among the talks. Moreover, through the generosity of Mozilla, our host for the meeting, we have a video of the event to share.

Each speaker was allocated about 20 minutes to present. It is extremely challenging to give an informative, technical talk in this limited amount of time. However, as you will see, even though they express dismay at being pressured by the clock all three speakers do a superb job. From the audience's point of view, I think 20 minutes is just about right though. An intense twenty minute talk holds the audience's attention and is enough time for a speaker to convey the main points and present sufficient detail for motivated people to follow up on their own.

In the video below, the first speaker, Ryan Hafen presents an overview of the Tessera project supported by Purdue University, the Pacific Northwest National Laboratory, and Mozilla. Ryan first explains Bill Cleveland's Divide and Recombine strategy and describes the underlying architecture (a key value store that supports the use of Map Reduce or Spark to implement the divide and recombine steps for large data sets). He then goes on to explain the use of Tukey style cognostics (metrics for each panel in an array of plots that indicate how important it is to look at a particular panel) and shows examples of interactive trellis visualizations using the latest ggvis technology.

Hadley Wickham, the second speaker, presents “Pure, predictable, pipeable: creating fluent interfaces with R”. Hadley begins by making the case for his magrittr pipe function, %>%, and then goes on to argue that adhering to the three principles of writing “pure”, “predictable” and “pipeable” functions greatly facilitates the goal of solving complex problems by combining simple pieces. Always an engaging and entertaining speaker, Hadley is at the top of his game here. No one comes better prepared for a talk, and few speakers can match Hadley's passionate delivery and wry sense of humor. (Hadley's examples of inconsistent R expressions are very amusing.)

Last in the line up, and not at all fazed about following Hadley, Nick Elprin presents a tutorial on parallelizing machine learning algorithms in R. He begins by emphasizing that he is targeting applications where the data will fit into the memory of a robust machine, and presents a strategy for parallelizing code that emphasizes understanding your code well enough so that you can employ parallelism where it will do the most good, and taking care to parallelize tasks to match your resources. After spinning up a server using the Domino platform that allows running R code from an Ipython notebook (See Nicks recent post on R Notebooks for details.) Nick runs through a series of thoughtful examples emphasizing different aspects of his parallelization principles.  The following example contrasts execution time of a simple Random Forests model with a parallel version built around a foreach loop.

# random forests
wine <- read.csv( "winequality-red.csv", sep=';', header = TRUE ) 
head(wine)
y_dat = wine$quality
x_dat <- wine[,1:11]
 
 
library(randomForest)
num_trees = 500
system.time({
  randomForest(y = y_dat, x = x_dat, ntree = num_trees)
})
 
 
trees_per_core = floor(num_trees / numCores)
system.time({
  wine_model <- foreach(trees=rep(trees_per_core, numCores), .combine=combine, .multicombine=TRUE) %dopar% {
    randomForest(y = y_dat, x = x_dat, ntree = trees)
  }

The video is well worth watching all the way through. (Well, maybe you should skip the first 20 seconds.) Here are some additional resources for all three presentations.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)