Series of Apache Spark posts:
- Dec 01: What is Apache Spark
- Dec 02: Installing Apache Spark
- Dec 03: Getting around CLI and WEB UI in Apache Spark
- Dec 04: Spark Architecture – Local and cluster mode
- Dec 05: Setting up Spark Cluster
- Dec 06: Setting up IDE
- Dec 07: Starting Spark with R and Python
- Dec 08: Creating RDD files
- Dec 09: RDD Operations
- Dec 10: Working with data frames
When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.
Variety of extensions are available from CRAN repository or from Github. Spark with flint, spark with Avro, Spark with EMR and many more. For data analysis and machine learning, you can take for example:
sparktf (with Tensor flow),
xgboost (compatible for Spark), geospark for working with geospatial data, spark for R on Google Cloud, and many omre. A simple way to start is to install extensions:
library(sparkextension) library(sparklyr) sc <- spark_connect(master = "spark://192.168.0.184:7077")
and set it to master and I can have all additional packages installed on Spark master.
rsparkling extension, gives you even more capabilities and enables you to use H2O in Spark with R.
Downloading from the cloud and installing the
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-yates/5/R") install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.3/31/R")
And working with H2O is besides defining the:
library(rsparkling) library(sparklyr) library(h2o) sc <- spark_connect(master = "local", version = "2.3", config = list(sparklyr.connect.timeout = 120)) #getting data iris_spark <- copy_to(sc, iris) #converting to h2o on spark dataframe iris_spark_h2o <- as_h2o_frame(sc, iris_spark)
With Python, the extensibility is also rich as with R, and There are even more packages available. Python also has the extenstion called pySparkling with H2O Python packages.
pip install h2o_pysparkling_2.2 pip install requests pip install tabulate pip install future
And running a cluster:
from pysparkling import * import h2o hc = H2OContext.getOrCreate()
And passing the spark dataframe to H2O
import h2o frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv") sparkDF = hc.asSparkFrame(frame) sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string")) [trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
And you can start working with anything relating to dataframes, machine learning and more.
from pysparkling.ml import H2OAutoML automl = H2OAutoML(labelCol="CAPSULE", ignoredCols=["ID"])
Tomorrow we will look Spark SQL and how to get on-board.
Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers
Happy Spark Advent of 2021!