Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)

July 9, 2015
By

(This article was first published on Longhow Lam's Blog » R, and kindly contributed to R-bloggers)

A long time ago in 1991 I had my first programming course (Modula 2) at the Vrije University in Amsterdam. I spend months behind a terminal with a green monochrome display doing the programming exercises using VI. Do you remeber Shift ZZ, and :q!… 🙂 After my university period I did not use VI often… Recently, I got hooked up again!

monochromemonitor

A system very similar to this one where I did my first text editing in VI

I was excited to hear the announcement of sparkR, an R interface to spark and was eager to do some experiments with the software. Unfortunately none of the Hadoop sandboxes have spark 1.4 and sparkR pre-installed to play with. So I needed to undertake some steps myself. Luckily, all steps are beautifully described in great detail on different sites.

Spin up a vps

At argeweb I rented an Ubuntu VPS, 4 cores 8 GB. That is a very small environment for Hadoop, and of course a 1 node environment does not show the full potential of Hadoop / Spark. However, I am not trying to do performance or stress tests on very large data sets, just some functional tests. Moreover, I don’t want to spent more money 🙂,  though the VPS can nicely be used to install nginx and host my little website -> www.longhowlam.nl 

Install R, Rstudio and shiny

A very nice blog post by Dean Atalli, which I am not going to repeat here, describes how easy it is to setup R, RStudio and Shiny. I followed steps 6, 7 and 8 of his blog post and the result is a running Shiny server on my VPS environment. In addition to my local RStudio application on my laptop, I can now also use R on my iPhone through Rstudio server on my VPS. Can be quit handy in a crowded bar when I need to run some R commands….

iphonescreen
using R on my iPhone. You need good eyes!

Install Hadoop 2.7 and Spark

To run Hadoop, you need to install java first, configure SSH, fetch the hadoop tar.gz file, install it, set environment variables in the ~/.bashrc file, modify hadoop configuration files, format the hadoop file system and start it. All steps are described in full detail here. Then in addition to that download the the latest version of Spark, the pre-build for hadoop 2.6 or later worked fine for me. You can just extract the tgz file, set the SPARK_HOME variable and you are done!

In each of the above steps different configuration files needed to be edited. Luckily I can still remember my basic VI skills……

Creating a very simple Shiny App

The SparkR package is already available when Spark is installed, its location is inside the Spark directory. So, when attaching the SparkR library to your R session, specify its location using the lib.loc argument in the library function. Alternatively, add the location of the SparkR library to the Search Paths for packages in R, using the .libPaths function. See some example code below.

library(SparkR, lib.loc = "/usr/local/spark/R/lib")

## initialeze SparkR environment
sc = sparkR.init(sparkHome = '/usr/local/spark')
sqlContext = sparkRSQL.init(sc)

## convert the local R 'faithful' data frame to a Spark data frame
df = createDataFrame(sqlContext, faithful) 

## apply a filter waiting > 50 and show the first few records
df1 = filter(df, df$waiting > 50)
head(df1)

## aggregate and collect the results in a local R data sets
df2 = summarize(groupBy(df1, df1$waiting), count = n(df1$waiting))
df3.R = collect(df2)

Now create a new Shiny app, copy the R code above into the server.R file and instead of a hard coded value 50, let’s make this an input using a slider. That’s basically it, my first shiny app calling SparkR……

Cheers, Longhow

To leave a comment for the author, please follow the link and comment on their blog: Longhow Lam's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)