Advent of 2023, Day 10 – Creating Job Spark definition

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

n this Microsoft Fabric series:

  1. Dec 01: What is Microsoft Fabric?
  2. Dec 02: Getting started with Microsoft Fabric
  3. Dec 03: What is lakehouse in Fabric?
  4. Dec 04: Delta lake and delta tables in Microsoft Fabric
  5. Dec 05: Getting data into lakehouse
  6. Dec 06: SQL Analytics endpoint
  7. Dec 07: SQL commands in SQL Analytics endpoint
  8. Dec 08: Using Lakehouse REST API
  9. Dec 09: Building custom environments

An Apache Spark job definition is a single computational action, that is normally scheduled and triggered. In Microsoft Fabric (same as in Synapse), you could submit batch/streaming jobs to Spark clusters.

By uploading a binary file, or libraries in any of the languages (Java / Scala, R, Python), you can run any kind of logic (transformation, cleaning, ingest, ingress, …) to the data that is hosted and server to your lakehouse.

When creating a new Job Spark definition, you will get to the definition screen, where you upload the binary file(s)

My R script is just a toy example of how to read the delta table and append all the records to the same delta table. Important (!) Spark context (or session) must be initialized using the code for the Job definition to be successful (otherwise, the job fails). Still not sure, I understand why the context must be set additionally (??)

sparkR.session(master = "", appName = "SparkR", sparkConfig = list())
df_iris <- read.df("abfss://[email protected]/a574d1a3-xxxxxxxxx-7128f/Tables/iris_data")

#every run we append the whole delta table into itself
         source = "delta", 
         path = "abfss://[email protected]/a574d1a3-xxxxxxxxx-7128f/Tables/iris_data", 
         mode = "append")

Do not forget to assign Lakehouse workspace to the job definition. Go to Lakehouse Reference and add the preferred Lakehouse.

Once you upload the file, you can schedule the job:

You can always test the job by running it and checking the results:

You can also deep dive into each Job run to get some additional information.

To check if the R code above was successful, I quickly opened a notebook and checked the number of rows (original 150) and we saw there were multiple rows added to the delta table.

Tomorrow we will look the exploring the data science part!

Complete set of code, documents, notebooks, and all of the materials will be available at the Github repository:

Happy Advent of 2023! 🙂

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)