Articles by Jozef's Rblog

Optimizing partitioning for Apache Spark database loads via JDBC for performance

December 26, 2020 | Jozef's Rblog

Introduction Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. A very common task in working with Spark apart from using H...

A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr

August 15, 2020 | Jozef's Rblog

Introduction The {sparklyr} package lets us connect and use Apache Spark for high-performance, highly parallelized, and distributed computations. We can also use Spark’s capabilities to improve and streamline our data processing pipelines, as Spark supports reading and writing from many popular sources such as Parquet, Orc, etc. and most ...

A review of my experience with the Big Data Analysis with Scala and Spark course

July 25, 2020 | Jozef's Rblog

Introduction Apache Spark is an open-source distributed cluster-computing framework implemented in Scala that first came out in 2014 and has since then become popular for many computing applications including machine learning thanks to among other aspects its user-friendly APIs. The popularity also gave rise to many online courses of varied quality. ...

Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package

July 4, 2020 | Jozef's Rblog

Introduction The National Hockey League (NHL) is considered to be the premier professional ice hockey league in the world, founded 102 years ago in 1917. Like many other sports, the data about teams, players, games, and more are a great resource to dive in and analyze using modern software tools. Thanks to ...

A review of my experience with the Functional Programming Principles in Scala course

June 13, 2020 | Jozef's Rblog

Introduction Functional programming is a programming paradigm where programs are constructed by applying and composing functions and it quite popular in the data science application because of some of its useful properties that can help for example...

Automating R package checks across platforms with GitHub Actions and Docker in a portable way

April 18, 2020 | Jozef's Rblog

Introduction Automating the execution, testing and deployment of R code is a powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building. A relatively recent feature in GitHub - GitHub actions - allows us to do just that without using additional tools such ...

Setting up R with Visual Studio Code quickly and easily with the languageserversetup package

March 21, 2020 | Jozef's Rblog

Introduction Over the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the ... [Read more...]

R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years

February 22, 2020 | Jozef's Rblog

Introduction It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language. In this post, we will look back on the 20 years of R with a ... [Read more...]

Releasing and open-sourcing the Using Spark from R for performance with arbitrary code series

January 4, 2020 | Jozef's Rblog

Introduction Over the past months, we published and refined a series of posts on Using Spark from R for performance with arbitrary code. Since the posts have grown in size and scope the blogposts were no longer the best medium to share the content ...

4 great free tools that can make your R work more efficient, reproducible and robust

December 21, 2019 | Jozef's Rblog

Introduction It is Christmas time again! And just like last year, what better time than this to write about the great tools that are available to all interested in working with R. This post is meant as a praise to a few selected tools and packages that helped me to ...

Using Spark from R for performance with arbitrary code – Part 5 – Exploring the invoke API from R with Java reflection and examining invokes with logs

November 23, 2019 | Jozef's Rblog

Introduction In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs, SQL query generators that can be executed by Spark and how to use the lower-level API to invoke methods on Java object references from R. In this fifth part, ...

Using Spark from R for performance with arbitrary code – Part 4 – Using the lower-level invoke API to manipulate Spark’s Java objects from R

November 9, 2019 | Jozef's Rblog

Introduction In the previous parts of this series, we have shown how to write functions as both combinations of dplyr verbs and SQL query generators that can be executed by Spark, how to execute them with DBI and how to achieve lazy SQL statements that only get executed when needed. ...

Using Spark from R for performance with arbitrary code – Part 3 – Using R to construct SQL queries and let Spark execute them

October 12, 2019 | Jozef's Rblog

Introduction In the previous part of this series, we looked at writing R functions that can be executed directly by Spark without serialization overhead with a focus on writing functions as combinations of dplyr verbs and investigated how the SQL is generated and Spark plans created. In this third part, ...

Using Spark from R for performance with arbitrary code – Part 2 – Constructing functions by piping dplyr verbs

September 21, 2019 | Jozef's Rblog

Introduction In the first part of this series, we looked at how the sparklyr interface communicates with the Spark instance and what this means for performance with regards to arbitrarily defined R functions. We also examined how Apache Arrow can increase the performance of data transfers between the R session ...

Using Spark from R for performance with arbitrary code – Part 1 – Spark SQL translation, custom functions, and Arrow

August 31, 2019 | Jozef's Rblog

Introduction Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. This series of articles will attempt to provide practical insights into using the sparklyr interface to gain the benefits ...

Posts

August 10, 2019 | Jozef's Rblog

[This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your [Read more...]

Using parallelization, multiple git repositories and setting permissions when automating R applications with Jenkins

August 10, 2019 | Jozef's Rblog

Introduction In the previous post, we focused on setting up declarative Jenkins pipelines with emphasis on parametrizing builds and using environment variables across pipeline stages. In this post, we look at various tips that can be useful when automating R application testing and continuous integration, with regards to orchestrating parallelization, ...

Using environment variables and parametrized builds for automating R applications with Jenkins

July 27, 2019 | Jozef's Rblog

Introduction Jenkins is a popular open-source tool that helps teams with automation and implementation of continuous integration and deployment pipelines, comparable to for example Atlassian’s Bamboo, GitLab CI or to some extent Travis. In this post, we share some practical lessons learned when integrating R applications via Jenkins for ...

How data.table’s fread can save you a lot of time and memory, and take input from shell commands

June 22, 2019 | Jozef's Rblog

Introduction Recently I was involved in a task that included reading and writing quite large amounts of data, totaling more than 1 TB worth of csvs without the standard big data infrastructure. After trying multiple approaches, the one that made this possible was using data.table’s reading and writing facilities ...

How to interactively examine any R code – 4 ways to not just read the code, but delve into it step-by-step

May 25, 2019 | Jozef's Rblog

Introduction As pointed out by a recent read the R source post on the R hub’s website, reading the actual code, not just the documentation is a great way to learn more about programming and implementation details. But there is one more activity to get even more hands-on experience ...

1 2 3 4 »

Copyright © 2025 | MH Corporate basic by MH Themes