Blog Archives

Using Spark from R for performance with arbitrary code – Part 5 – Exploring the invoke API from R with Java reflection and examining invokes with logs

November 23, 2019
By
Using Spark from R for performance with arbitrary code – Part 5 – Exploring the invoke API from R with Java reflection and examining invokes with logs

Introduction In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs, SQL query generators that can be executed by Spark and how to use the lower-level API to invoke methods on Java object references from R. In this fifth part, we will look into more details around sparklyr’s invoke() API, investigate...

Read more »

Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

November 9, 2019
By
Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

Introduction In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs and SQL query generators that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. In this fourth part, we will look at how to write...

Read more »

Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them

October 12, 2019
By
Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them

Introduction In the previous part of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created. In this third part, we will look at how to write R functions that generate...

Read more »

Using Spark from R for performance with arbitrary code – Part 2 – Constructing functions by piping dplyr verbs

September 21, 2019
By
Using Spark from R for performance with arbitrary code – Part 2 – Constructing functions by piping dplyr verbs

Introduction In the first part of this series, we looked at how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We also examined how Apache Arrow can increase the performance of data transfers between the R session and the Spark instance. In this second part, we will look...

Read more »

Using Spark from R for performance with arbitrary code – Part 1 – Spark SQL translation, custom functions, and Arrow

August 31, 2019
By
Using Spark from R for performance with arbitrary code – Part 1 – Spark SQL translation, custom functions, and Arrow

Introduction Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. This series of articles will attempt to provide practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R...

Read more »

Posts

August 10, 2019
By

. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. To leave a comment for the author, please follow the link and comment on their blog:...

Read more »

Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins

August 10, 2019
By
Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins

Introduction In the previous post, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages. In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, combining sources from multiple git repositories and ensuring proper access right...

Read more »

Using environment variables and parametrized builds for automating R applications with Jenkins

July 27, 2019
By
Using environment variables and parametrized builds for automating R applications with Jenkins

Introduction Jenkins is a popular open-source tool that helps teams with automation and implementation of continuous integration and deployment pipelines, comparable to for example Atlassian’s Bamboo, GitLab CI or to some extent Travis. In this post, we share some practical lessons learned when integrating R applications via Jenkins for the purpose of continuous integration and regression testing on runner nodes configured...

Read more »

How data.table’s fread can save you a lot of time and memory, and take input from shell commands

June 22, 2019
By
How data.table’s fread can save you a lot of time and memory, and take input from shell commands

Introduction Recently I was involved in a task that included reading and writing quite large amounts of data, totaling more than 1 TB worth of csvs without the standard big data infrastructure. After trying multiple approaches, the one that made this possible was using data.table’s reading and writing facilities - fread() and fwrite(). This motivated me to look at benchmarking data.table’s...

Read more »

How to interactively examine any R code – 4 ways to not just read the code, but delve into it step-by-step

May 25, 2019
By
How to interactively examine any R code – 4 ways to not just read the code, but delve into it step-by-step

Introduction As pointed out by a recent read the R source post on the R hub’s website, reading the actual code, not just the documentation is a great way to learn more about programming and implementation details. But there is one more activity to get even more hands-on experience and understanding of the code in practice. In this post, we provide...

Read more »

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)