Suppose you are an avid R user, and you would like to use SparkR in Cloudera Hadoop; unfortunately, as of the latest CDH version (5.7), SparkR is still not supported (and, according to a recent discussion in the Cloudera forums, we shouldn’t expect this to happen anytime soon). Is there anything you can do?
Well, indeed there is. In this post we will demonstrate how to use SparkR in a Cloudera Hadoop cluster. The assumptions are the following:
- A Cloudera Hadoop cluster, with R installed in all worker nodes
- A gateway node, through which you connect to the cluster to submit jobs, and in which you naturally have a user account (i.e. no superuser privileges needed)
Step 1: download Spark locally
The first step is to download Spark locally in your gateway home folder; this is very simple actually, and I have provided detailed instructions elsewhere. The best part? You don’t need to download a Spark version that matches the version in your CDH distribution; in our cluster, the CDH version is 5.6, which comes with Spark 1.5.0, while locally I have downloaded Spark 1.6.1, prebuilt for Hadoop 2.6. Here is my local home directory in my gateway:
[[email protected] ~]$ ll total 282640 drwxrwxr-x 2 ctsats ctsats 52 Mar 1 18:16 kaggle drwxr-xr-x. 3 ctsats ctsats 91 Apr 12 18:30 R drwxrwxr-x 2 ctsats ctsats 25 Feb 5 13:21 scripts drwxr-xr-x 12 ctsats ctsats 4096 Feb 27 07:02 spark-1.6.1-bin-hadoop2.6 -rw-rw-r-- 1 ctsats ctsats 289405702 Apr 11 12:55 spark-1.6.1-bin-hadoop2.6.tgz -rw-rw-r--. 1 ctsats ctsats 90 Feb 3 14:29 test.data
Step 2: run SparkR scripts locally from RStudio
Next, we will run a SparkR script locally from an RStudio session. Here is a simple example script, reading a CSV file from HDFS and printing its first elements (detailed explanations below):
Sys.setenv(HADOOP_CONF_DIR='/etc/hadoop/conf.cloudera.hdfs') Sys.setenv(YARN_CONF_DIR='/etc/hadoop/conf.cloudera.yarn') library(SparkR, lib.loc = "/home/ctsats/spark-1.6.1-bin-hadoop2.6/R/lib") library(magrittr) sc <- sparkR.init(sparkHome = "/home/ctsats/spark-1.6.1-bin-hadoop2.6", sparkPackages="com.databricks:spark-csv_2.10:1.4.0") sqlContext <- sparkRSQL.init(sc) df <- read.df(sqlContext, 'recsys/data/interactions_clean.csv', "com.databricks.spark.csv", header="true") df %>% head %>% print sparkR.stop()
Let’s see the script line by line:
- In lines 1 & 2, we set the required Hadoop environmental variables (you can obtain them from your cluster administrator)
- In line 4, we load the
SparkRpackage, providing its location in our local machine
- Spark context is initialized in lines 7 & 8. In line 7, we provide the sparkHome directory, while in line 8 we include any additional Spark packages with the sparkPackages argument (in our case, the
- In line 10, we initialize the
sqlContext, necessary because we are going to use Spark dataframes
- In line 12, we read the CSV file interactions_clean.csv (which includes a header) from the HDFS directory recsys/data/ using the
spark-csvpackage. The result is a Spark dataframe
- In line 13, we use the pipe operators from the R package
magrittr(loaded in line 5), to print out the first elements of our Spark dataframe (including its header).
- Finally, in line 15 we stop the Spark context.
We can run the script from RStudio, either the whole of it or line by line in an interactive manner. We can also save it (sparkr-test.R) and run it from the command line:
[[email protected] ~]$ Rscript 'R/sparkr_test.R' [...] user_id item_id interaction_type 1 7 1006839 1 2 9 944146 3 3 9 1053485 1 4 13 2444782 1 5 23 501722 1 6 23 1305844 1 [...]
So far so good. But you may be thinking “OK, this is nothing new; you have just downloaded again Spark locally, and are using this to run SparkR. We already knew this, so what is the fuss about?“
So, here comes the really nice stuff…
Step 3: submit SparkR applications to YARN
It turns out that you can indeed submit SparkR applications to YARN, despite the fact that SparkR is not available in your (Cloudera) Hadoop cluster. Here is how.
First, we need to make some changes to our script above; here is the new script, saved as sparkr-submit_test.R:
library(SparkR, lib.loc = "/home/ctsats/spark-1.6.1-bin-hadoop2.6/R/lib") library(magrittr) sc <- sparkR.init(appName = 'SparkR-submit-test', sparkEnvir = list(spark.driver.memory="2g", spark.executor.cores='2', # STRINGS here for spark-submit!!! spark.executor.instances='12')) sqlContext <- sparkRSQL.init(sc) df <- read.df(sqlContext, 'recsys/data/interactions_clean.csv', "com.databricks.spark.csv", header="true") df %>% head %>% print sparkR.stop()
The changes in our new script are:
- We don’t set the environmental variables
- We also don’t declare any additional Spark packages, like spark-csv
- We include further configuration settings in our Spark context (lines 4-7)
Now, let’s try to submit this script to YARN; to do so, we have to:
- Set the necessary environmental variables from the terminal
- Change our working directory to where our local Spark executables are, i.e. spark-1.6.1-bin-hadoop2.6/bin in our case
- Precede the
- Include the spark-csv package in the command line
Here is the procedure and the results:
[[email protected] ~]$ export HADOOP_CONF_DIR=/etc/hadoop/conf.cloudera.hdfs [[email protected] ~]$ export YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn [[email protected] ~]$ cd spark-1.6.1-bin-hadoop2.6/bin/ [[email protected] bin]$ ./spark-submit --master yarn-client --driver-memory 2g --packages com.databricks:spark-csv_2.10:1.4.0 ~/R/sparkr-submit_test.R [...] 16/04/22 19:20:04 INFO spark.SparkContext: Running Spark version 1.6.1 [...] user_id item_id interaction_type 1 7 1006839 1 2 9 944146 3 3 9 1053485 1 4 13 2444782 1 5 23 501722 1 6 23 1305844 1
As we can see, not only the job is submitted to YARN and executed successfully, but it also used Spark 1.6.1 (highlighted line above), and not the existing Spark version in our Cloudera Hadoop cluster (1.5.0).
Notice also that neither
magrittr R packages are installed in our cluster workers, the only requirement for them being an R installation.
Finally, we notice that we can use a similar approach to run other Spark components (e.g. PySpark jobs) in a Spark version newer than the one shipped with our Cloudera Hadoop version, which might also come handy (for instance, I have found a significant performance boost in some MLlib algorithms of Spark 1.6.1 compared to Spark 1.5.0).
We have presented a simple and straightforward way for running SparkR applications in a Cloudera Hadoop cluster, both locally and in YARN. Apart from R being present to the worker nodes, no modification to the cluster software is required, neither superuser privileges in the local working client machine (gateway). The approach demonstrated can also be used to “bypass” the Spark version shipped with Cloudera Hadoop (which usually lags behind the latest one) with a newer one, if necessary.