Advent of 2020, Day 30 – Monitoring and troubleshooting of Apache Spark

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

Yesterday we looked into performance tuning for improving day to day usage of Spark and Azure Databricks. And today we will look explore monitoring (as we have started on Day 15) and troubleshooting for most common mistakes or error a user in Azure Databricks will encounter.

1.Monitoring

Spark in Databricks is relatively taken care of and can be monitored from Spark UI. Since Databricks is a encapsulated platform, in a way Azure is managing many of the components for you, from Network, to JVM (Java Virtual Machine), hosting operating system and many of the cluster components, Mesos, YARN and any other spark cluster application.

As we have seen on the Day 15 post, you can monitor Query, tasks, jobs, Spark Logs and Spark UI in Azure Databricks. Spark Logs will help you pinpoint the problem that you are encountering. It is also good for creating a history logs to understand the behaviour of the job or the task over time and for possible future troubleshooting.

Spark UI is a good visual way to monitor what is happening to your cluster and offers a great value of metrics for troubleshooting.

It also gives you detailed information on Spark Tasks and great visual presentation of the task run, SQL run and detailed run of all the stages.

All of the tasks can be visualized also as DAG:

2. Troubleshooting

Approaching Spark debugging, let me give you some causes or views and symptoms of problems in your Spark Jobs and Spark engine itself. There are many issues one can account, I will try to tackle couple of those that can be found as a return message in Databricks notebooks or in Spark UI in general.

2.1. Spark job not started

This issue can appear frequently, especially if you are beginner but can also happen when there is Spark running standalone (not in Azure Databricks).

Sign and symptoms:
– Spark job don’t start
– Spark UI does not show any nodes on cluster (except the driver)
– Spark UI is reporting vague information

Potential Solution:
– Cluster is not started or is starting up,
– This often happens with poorly configured cluster (usually when running Spark applications and (almost) never with Azure Databricks), either IP or network or VNet,
– It can be a memory configuration issue and should be reconfigured in the start up scripts.

2.2. Error during execution of notebook

During work in notebooks on a cluster that is already running, it can happen that some part of the code or the Spark Job, that was previously running ok, started to fail.

Sign and symptoms:
– a job on a Cluster runs successfully over all clusters, but on one fails
– code blocks in notebook runs normally in sequences, but one run fails
– HiveSQL table or R/Python Dataframe, that used to to be created normally, can not be created

Potential Solution:
– check if your data still exists on the expected location or if the data is still in the same file format
– if you are running a SQL query, check if the query is valid and all the column names are correct
– try to go through stack trace and try to figure out which component is failing

2.3. Cluster unresponsive

When running notebook commands or using Spark Apps (widgets, etc), you can get a message that cluster. This is a severe error and should be

Sign and symptoms:
– code block is not executed and fails with loads of JVM responses
– you get a error message, that cluster is unresponsive
– Spark job is running, with no return or error message.

Potential Solution:
– restart the cluster and attach the notebook to a cluster
– check the dataset for any inconsistencies, data size (limitations of the file uploaded or distribution of the files over DBFS),
– check the compatibility of the installed libraries and spark version on your cluster.
– change the cluster setting from standard, GPU, ML to LTS. Long-Term Support Spark installation tend to have greater span of compatibility.
– if you are using high-concurrency cluster, check who and what they are doing, it there is a potential “dead-lock” in some tasks that consume too many resources.

2.4. Fail to load data

Loading data is probably the most important task in Azure Databricks. And there can be many ways, that data can not be presented in the notebook.

Sign and symptoms:
– data is stored in blob storage and can not be accessed or loaded to Databricks
– data is taking too long to load, and I stop the load process
– data should be at the location, but it is not

Potential Solution:
– if you are reading the data from Azure blob storage, check that Azure Databricks have all the needed credentials for access
– loading data files that are wide (have 1000+ columns) might cause some problems with Spark. Load schema first and create a DataFrame with Scala and later insert the data into the frame
– check if the persistent data (DBFS) is on the correct location and in expected data format. It can also happen that different sample files are used, which in this case might be missing from standard DBFS path
– DataFrame or Dataset was created in different language that the one, you are trying to read it from. Languages sit on top of Structured API and should be interchangeable, so you should check your code for some inconsistencies.

2.5. Unexpected Null in Results

Sign and symptoms:
– unexpected Null values in Spark transformations
– scheduled jobs that use to work no longer work, or no longer produce the correct result

Potential Solution:
– it can be the cause of underlying data that has had format changed,
– use accumulator to run and try to count the number of rows (or records or observations) or try to parse or process the error where a row (record/observation) is missing,
– check and ensure that transformations in data return a valid SQL query plan; check for some implicit data type conversions (a “15” is a string and not a number, respectively) and Spark can return a strange result or no result.

2.6. Slow Aggregations

This is somehow common problem and also hardest to tackle. Usually happen because of unevenly distributed workload across cluster or because of hardware failure (one disk / VM is unresponsive).

Sign and symptoms:
– slow task by .groupBy() call
– after data aggregation, jobs are still slow

Potential Solution:
– try changing the partitioning of the data, to have less data per partition.
– try changing the partition key on your dataset
– check that your SELECT statement is gaining performance from the partitions
– if you are using RDD, try to create a DataFrame or Dataset to get the aggregations done faster

Tomorrow we will finish with series with looking into sources, documentations and next learning steps and it should be a nice way to wrap up the series.

Complete set of code and the Notebook is available at the Github repository.

Happy Coding and Stay Healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)