News from UseR!2015 – the RHadoop tutorial

July 1, 2015
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

Today is the first day of UseR!2015 conference in Aalborg in Northern Denmark.  But yesterday was a day packed with 16 tutorials on a range of interesting topics.  I submitted a proposal many months ago to run a session on using R in Hadoop and was very happy to selected to run a session in the morning.

When we first started planning the session, we set a Big Hairy Audacious Goal to run the session using a HortonWorks Hadoop cluster hosted in the Microsoft Azure cloud.

We trialled the session at the Birmingham R user group during May, and then again last week during a Microsoft internal webinar. In both cases, the cluster performed great.

Then I asked the organisers how many people expressed an interest, and I heard that 66 people said they might come and that the room will seat 144 people.

At this point I started having cold sweats!

Just as luck would have it, minutes before start time something went wrong. The cluster would not start and participants were unable to access the server for the first hour of the tutorial. Fortunately, we had a back-up plan. 

You can take a vicarious tour through the analysis of the NYC Taxi Cab data with the rmarkdown slides built with the RStudio presentation suite.

Although I did not cover all the material during the session, the full set of presentations are:

1. Using R with Hadoop

2. Analysing New York taxi data with Hadoop

3. Computing on distributed matrices

4. Using RHive to connect to the hive database

(These presentations, sample data, all the scripts and examples live in github at https://github.com/andrie/RHadoop-tutorial. Use the tag https://github.com/andrie/RHadoop-tutorial/tree/2015-06-30-UseR!2015 to find the state of the repository at the time of the tutorial.)

The central example throughout is to use mapreduce() in the package rmr2 to summarize new York taxi journeys, grouped by hour of the day. This is a dataset of ~200GB of uncompressed csv files.

You can read more about how this data came into the public domain at http://chriswhong.com/open-data/foil_nyc_taxi/. Here is a simple plot of the results, sampled from a 1 in a 1000 sample of data:

   Sample-of-taxi-journeys

The full code to do this is at available in a github repository at https://github.com/andrie/RHadoop-tutorial. Here is an extract:

library(rmr2)
library(rhdfs)
hdfs.init()

rmr.options(backend = "hadoop")

hdfs.ls("taxi")$file
homeFolder <- file.path("/user", Sys.getenv("USER"))
taxi.hdp <- file.path(homeFolder, "taxi")

headerInfo <- read.csv("data/dictionary_trip_data.csv", stringsAsFactors = FALSE)
colClasses <- as.character(as.vector(headerInfo[1, ]))

taxi.format <- make.input.format(format = "csv", sep = ",",
col.names = names(headerInfo),
colClasses = colClasses,
stringsAsFactors = FALSE
)

taxi.map <- function(k, v){
original <- v[[6]]
date <- as.Date(original, origin = "1970-01-01")
wkday <- weekdays(date)
hour <- format(as.POSIXct(original), "%H")
dat <- data.frame(date, hour)
z <- aggregate(date ~ hour, dat, FUN = length)
keyval(z[[1]], z[[2]])
}

taxi.reduce <- function(k, v){
data.frame(hour = k, trips = sum(v), row.names = k)
}

m <- mapreduce(taxi.hdp, input.format = taxi.format,
map = taxi.map,
reduce = taxi.reduce
)

dat <- values(from.dfs(m))

library("ggplot2")
p <- ggplot(dat, aes(x = hour, y = trips, group = 1)) +
geom_smooth(method = loess, span = 0.5,
col = "grey50", fill = "yellow") +
geom_line(col = "blue") +
expand_limits(y = 0) +
ggtitle("Sample of taxi trips in New York")

p

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)