Advent of 2021, Day 24 – Data Visualisation with Spark

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Apache Spark posts:

In previous posts, we have seen that Spark Dataframes (datasets) are compatible with other classes, functions. Regarding the preferred language (Scala, R, Python, Java).

Using Python

You can use any of the popular Python packages to do the visualisation; Plotly, Dash, Seaborn, Matplotlib, Bokeh, Leather, Glam, to name the couple and many others. Once the data is persisted in dataframe, you can use any of the packages. With the use of PySpark, plugin the Matplotlib. Here is an example

from pyspark.sql import SparkSession
import matplotlib.pyplot as plt

spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.read.format("csv").option("header", "true").load("sampleData.csv")
sampled_data = df.select('x','y').sample(False, 0.8).toPandas()

# and at the end lets use our beautiful matplotlib
plt.scatter(sampled_data.x,sampled_data.y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('relation of y and x')
plt.show()

Using R

With help of

library(sparklyr)
library(ggplot2)
library(dplyr)

#connect
sc <- spark_connect(master = "local")

# data wrangling
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
delay <- flights_tbl %>%
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect

# plot delays
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)

Using Scala

The best way to use the visualisation with Scala is to use the notebooks. It can be a Databricks notebook, the Binder notebook, Zeppelin notebook. Store the results in dataframe and you can visualise the results fast, easy and practically with no coding.

val ds = spark.read.json("/databricks-datasets/iot/iot_devices.json").as[DeviceIoTData]
display(ds)

And now we can create graphs, that are available to the bottom left side as buttons on Azure Databricks notebooks. Besides the graphs, you can also do data profiling out-of-the-box.

Tomorrow we will look into Spark Literature and where to go for next steps.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021! 🙂

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)