Last week, I came across sparklyr. Authored by the folks at rstudio, it allows you to integrate your R workflow (and, more importantly, your dplyr workflow) with apache spark. In one of the examples on the sparklyr home page, the author shows how to set up rstudio and sparklyr on an Amazon Elastic Compute Cloud (ECC). In this post, I’d like to share how you can set up your own rstudio/apache cluster environment using Google dataproc. Why dataproc instead of ECC? Because google is kind (desperate?) enough to give you $300 worth of free playtime with their cloud services for 60 days!
The first part of this post covers setting up the apache cluster and rstudio. The second part shows you have to store data on hdfs and load it into hive, and the third part illustrates how you can use dplyr commands with sparklyr.
Setting up Rstudio and the apache cluster
Whereas setting up and using an apache cluster was once an arcane process unsuited for the uninitiated, these days vendors such as Amazon Web Services, Microsoft Azure and Google Cloud Computing make the entire process a lot easier.
The first thing you should do is sign up for an account. Then, create a new project.
Give your project a fancy name.
Go to the ‘API’ section in the console.
Enable the APIs with the red rectangle around them. If they are not in the list, use the ‘enable api’ button (marked in blue) to search for and activate them.
Go to the ‘networking’ pane under the settings menu.
Click on ‘firewall rules’ and then click on ‘add firewall rule’. We need to open up port 8787 (which rstudio server uses) in order to access rstudio from a browser.
Create the rule as shown in the picture and click ‘create’.
Go back to the menu, scroll all the way down, and select ‘Dataproc’
Click on ‘create cluster’ and customize settings to your taste. NB. If you’re on a test plan you cannot exceed 8 cores.
Install the Cloud SDK for your OS and follow the instructions to set it up properly.
Go back to the developers console and visit the ‘compute engine’ section.
Click on the arrow next to the master node ‘ssh’ option and select ‘view gcloud command’.
Copy-paste the command in a terminal, hit enter, and follow the instructions. (essentially, hit enter twice to make a new ssh key for your computer to login securely via ssh).
Congratulations, you are now logged in.
To access rstudio server, we will need a user/password combination. However, the current user ‘Jasper’, does not login using a password, but via ssh. The easiest thing to do is to add a new user. Execute
and fill out a new password for the user. You can simply press enter when prompted to fill out a name, office etc.
Next, go to the rstudio website and copy the link to the latest build.
Go back to the terminal and execute ‘wget
rstudio-server-link’, as in the picture below.
In the terminal, execute
Execute ‘ls -l’, and copy the name of your studio server distribution. Then execute ‘sudo gdebi
your-rstudio-distribution’ in the terminal. When prompted if you want to install the package, type ‘y’ and hit enter.
If all goes well you should see the following:
Let’s install some dependencies for the R packages we’ll be installing later. Copy-paste the following into the terminal and hit enter
Go back to the google developers console, and copy the public IP of your master node:
Open up a new tab in your web browser and visit ‘http://
your-public-ip:8787’. You should see the following page:
Sign in using the credentials you created for user ‘rstudio’. Congrats! You now have a fully functioning stack rstudio plus spark cluster running. Let’s have some fun with it.
Loading data using hive
Then, create a directory for the rstudio user in hdfs (hadoop’s file system), and create a subdirectory for the phoenix data.
The git repository contains a packrat library with all necessary packages, so you will not need to install any packages manually. Go back to the rstudio window in your browser and use the file navigation system at the bottom right-hand corner to locate and open the ‘spark-blog.Rproj’.
Once opened, the packrat library should kick in and start installing packages. If nothing happens, execute the following in the R Console.
Now, open ‘main.R’ and execute the following lines of code to download the daily phoenix datasets. This takes approximately 20 minutes.
Once downloaded, go back to the terminal and copy the downloaded data to hdfs.
We’ll create a SQL-like table of all the individual data files using apache hive. Do to this, execute
in a terminal, and copy-paste the following lines of code to create the hive table:
Next, load the phoenix data sets into the table:
Great! So now we’re all set to connect to the data via sparklyr.
Analyzing Phoenix data using sparklyr
The first thing you should do is load the necessary packages and connect to spark.
Next, we cache the phoenix dataset in memory to speed up queries.
As it turns out, the ‘datestamp’ variable in the phoenix dataset was not formatted as a real date. This is not a huge problem, as we can format it properly via R:
Let’s execute some straightforward queries on the data. First, how many events do we have over the entirety of the dataset?
As we can see, the data fluctuates a lot between August 2014 and January 2015.
It seems that the period between February 1st, 2015 and June 1st, 2016 is the most reliable in terms of data availability. Let’s filter for those dates.
Next, we’d like to know the average goldstein score per day. The goldstein score is an (imperfect) measure of tone ranging from -10 to 10; the worse the event, the lower it is.
On most days, the average goldstein score hovers between 0 and 1. January 2nd seems especially negative.
Let’s look at events in four countries: the USA, China, Russia and Great Britain.
Events that happen in the USA are reported on most. However, there is a significant drop in events reported after October 2015. At the same time, Russia, China and Great Britain seem to increase their share of all daily events. This could have a lot of causes, but given the sudden change I’d say that either the news sources on which the data set is based were changed, or someone tweaked the algorithm used to geolocate the events.
Finally, we can plot a map of all violent events in Syria using the
Here, we observe that most events have no specific location; they are simply located at the center of Syria. Other than that, we also observe that many events occur at hotbeds of violence such as Aleppo, Kobane and around Damascus.