Advent of 2021, Day 10 – Working with data frames

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Apache Spark posts:

We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.

On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.

Dataframes with R

Start a session and get going:

spark_path <- file.path(spark_home, "bin", "spark-class")

# Start cluster manager master node
system2(spark_path, "org.apache.spark.deploy.master.Master", wait = FALSE)

# Start worker node, find master URL at http://localhost:8080/
system2(spark_path, c("org.apache.spark.deploy.worker.Worker", "spark://"), wait = FALSE)

sparkR.session(appName = "R Dataframe Session", sparkConfig = list("org.apache.spark.deploy.worker.Worker" = "spark://"))

And start working with dataframe by importing a short and simple json file (copy this and store it to people.json file):

{"name":"Michael", "age":29, "height":188}
{"name":"Andy", "age":30, "height":201}
{"name":"Justin", "age":19, "height":175}

df <- read.json("usr/library/main/resources/people.json")

And we can do many several transformations:

head(select(df, df$name, df$age + 1))

head(where(df, df$age > 21))

# Count people by age
head(count(groupBy(df, "age")))

And also by adding and combining additional packages (e.g.: dplyr, ggplot):

dbplot_histogram(height, age)

# adding ggplot

df_new %>% 
    gather(df, age, height) %>% 
    ggplot(aes(df, age, fill = factor(height))) + 

There are many other functions that can be used with Spark Dataframes API with R. Alternatively, we can also do the same with Python.

Dataframes with Python

Start a session and get going:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Dataframe Session API") \
    .config("org.apache.spark.deploy.worker.Worker", "spark://") \

And we can start with importing data into data frame.

df <- read.json("examples/src/main/resources/people.json")

# or show complete dataset

And working with filters and subsetting the dataframe – as it would be a normal numpy/pandas dataframe['name'], df['age'] + 1).show()

#filtering by age
df.filter(df['age'] > 21).show()

#grouping by age and displaying the count

Tomorrow we will look how to plug the R or Python dataframe with packages and get more out of the data.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository:

Happy Spark Advent of 2021! ?

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)