Optimising your R code – a guided example

[This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Optimising your R code is not always the priority. But when you run out of memory, or it just takes too long, you start to wonder if there are better ways to do things! In this blog post, I will show you my way of optimising my R code and the process behind it. As an example, I will use the code from my last blog post about gas prices in Germany – if you haven’t read it yet, click here!

There are different steps to optimise your code. From debugging and profiling over benchmarking to rethinking the whole method. You have to repeat these steps until you are satisfied with the result.

A short recap of the initial code

I had around 1700 .csv files, each containing the price changes of one day for all German gas stations from 2014 up till 2019. My plan was to analyse the price patterns, but first I had to prepare and load all the data. Normally, I would just load all the files, combine them into one data.table() , do some data preparation steps and then start my analysis. This was not possible here since the raw data was around 20 GB big and I wanted to rearrange them so that I would have half-hour prices. So, I settled for the solution of reading the files one after another, while preparing and aggregating each one at a time, and combining them at the end. There were eight steps during my loop:

# pseudo code
for (i.file in price_files) {
    # load data
    # remove wrong prices (e.g. -0.001, 0 or 8.888)
    # set TIME and DATE
    # build a time grid including opening times
    # add last day price and save next day price
    # NA handling - last observation carried forward (locf)
    # adding brand, motorway and post code info
    # calculating mean and quantiles
}

Let’s see where I can optimise my steps!

Identifying the bottlenecks

Since I want to optimise a loop, I can just concentrate on a few iterations instead of the whole loop. Also, I do not need to include every station for this test. Therefore, I subset my data to stations which are in the area around Frankfurt am Main.

stations_dt <- stations_dt[post_code >= "60000" & post_code < "66000",] 

Note: Yes, you can filter strings like this – they are sorted alphabetically. Also, they are characters and not numbers since there are cases like “01896”.

With this small test data, I will continue my optimisation process. To get an overview of bottlenecks I use the package profvis. Since RStudio version 1.0, it is quite easy to use. I just select the lines of code with my for-loop and click Profile > Profile Selected Line(s) to get a report with the time and memory consumption in a new tab.

I filter only for those lines which have a non-zero value to have a nicer overview. My main task is to reduce time consumption, therefore I will focus on the longest time first. There is one major time consumer: stat_fun(). It takes around 75% of the total 30000ms. So, what is this function of mine?

a deeper look at stat_fun()

This function returns a data.table() with the mean, the 10% and 90% quantile and the number of observations. I use stat_fun on each group defined by this_by <- c("DATE", "TIME", "AUTOBAHN", "BRAND", "plz").

# get mean and quantiles 0.1 and 0.9
stat_fun <- function(x, na.rm = TRUE) {
  #x <- price_time_dt[,.SD, .SDcol = c("diesel", "e10")]
  x_quant1 <- x[, lapply(.SD, quantile, probs = c(0.1), na.rm = na.rm),
                .SDcols = names(x)]
  x_quant9 <- x[, lapply(.SD, quantile, probs = c(0.9), na.rm = na.rm),
                .SDcols = names(x)]
  x_mean <- x[, lapply(.SD, mean, na.rm = na.rm), .SDcols = names(x)]
  x_obs <- x[, lapply(.SD, function(y) sum(!is.na(y))), .SDcols = names(x)]

  setnames(x_quant1, paste0(names(x), "_Q10"))
  setnames(x_quant9, paste0(names(x), "_Q90"))
  setnames(x_mean, paste0(names(x), "_MEAN"))
  setnames(x_obs, paste0(names(x), "_OBS"))

  out <- data.table(x_quant1, x_quant9, x_mean, x_obs)
  return(out)
}

price_time_dt2 <- price_time_dt[, stat_fun(.SD), .SDcols = price_vars, by = this_by]

Why did I build it like this? My goal was to have one function, which returns all the values I need for my analysis later: the mean, the quantiles and the number of observations. Let’s check how we can optimise this!

First, I look into the function with profvis again. Since there is not much to see but that it takes 7900ms for one iteration I profile the lines in the stat_fun. Therefore I set the inputs and then select the lines in the function for profiling.

x <- price_time_dt[,.SD, .SDcol = c("diesel", "e10")]
na.rm <- TRUE

The result is quite surprising, there is not much time used even though I used the whole data instead of one group.

How many times does this function get called? Well, for each combination of the by key. For the first file, this is around 1500 times. I will now check if mean() and quantile() are taking a lot of time, if they are used by each group.

  x_mean <- price_time_dt[, lapply(.SD, mean, na.rm = TRUE),
                          .SDcols = price_vars,
                          by = this_by]
  x_quant1 <- price_time_dt[, lapply(.SD, quantile,
                                     probs = 0.1,
                                     na.rm = TRUE),
                            .SDcols = price_vars,
                            by = this_by]
  x_quant9 <- price_time_dt[, lapply(.SD, quantile,
                                     probs = 0.9,
                                     na.rm = TRUE),
                            .SDcols = price_vars,
                            by = this_by]

The mean() function is fast and not a problem, but quantile() takes quite long. I could combine the calculation for the 10% and 90% quantile into one call, but the resulting table I would have to reshape into my desired output – if I cannot find any other tweaks, I will come back to this approach.

While reading the help pages for the quantile() function, I stumble upon the argument names:

names: logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.

I am not using many probs, but this still improves the time by around 700ms.

This is not enough, so I will have to test the other approach!

implemanting stat_fun() as part of the main code

The new approach first creates a data.table() for the mean, the quantiles and the number of observations and then merges them by the groups together. With this approach, I can combine the two quantile calculations, but have to include the merging part.

# calculating the mean
x_mean <- price_time_dt[, lapply(.SD, mean, na.rm = TRUE),
                                .SDcols = price_vars,
                                by = this_by]
# calculating the quantiles
x_quant <- price_time_dt[, lapply(.SD, quantile, probs = c(0.1, 0.9),
                                  na.rm = TRUE, names = FALSE),
                         .SDcols = price_vars,
                         by = this_by]
x_quant[, QUANT := c("Q10", "Q90")]
eq_quant <- paste0(paste0(this_by, collapse = "+"), "~QUANT")
x_quant <- dcast(data = x_quant, formula = eq_quant, value.var = price_vars)
setkeyv(x_quant, NULL)

# number of observations
x_obs <- price_time_dt[, lapply(.SD, function(y) sum(!is.na(y))),
                          .SDcols = price_vars,
                          by = this_by]

setnames(x_mean, price_vars, paste0(price_vars, "_MEAN"))
setnames(x_obs, price_vars, paste0(price_vars, "_OBS"))

setkeyv(price_time_dt, this_by)
price_time_dt2 <- Reduce(merge, list(x_mean, x_quant, x_obs))

As it turns out, merging is fast as well – I would not have thought so! Instead of nearly 30s it only needs 12s now! With the biggest issue out of the way, there seems to be some more potential in other lines as well. Next up are the time formatting and the missing values handling with the last observation carried forward (locf).

last observations slowly carried forward

price_time_dt <- price_time_dt[order(station_uuid, DATE, TIME)]

price_time_dt[, c(price_vars) := lapply(.SD, zoo::na.locf, na.rm = FALSE),
              .SDcols = price_vars,
              by = station_uuid] 

Sometimes it is hard to reinvent the wheel – so l search the vast knowledge of the internet and this article here comes up. They describe exactly my problem: a faster locf-method with grouping.

price_time_dt <- price_time_dt[order(station_uuid, DATE, TIME)]
id_change <- price_time_dt[, c(TRUE, station_uuid[-1] != station_uuid[-.N])]

price_time_dt[, c(price_vars) :=
  lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)]),
  .SDcols = price_vars]

To understand this method I made this table with a toy example. The last column is used as the index, which replaces missing values with the last observation. Because this method is vectorised and works without a by group, it is much faster.

.Iidid_changex!is.na(x)(!is.na(x)|id_change)*.Icummax(…)
11TRUE4TRUE1 * 1 = 11
21FALSENAFALSE0 * 2 = 01
31FALSENAFALSE0 * 3 = 01
42TRUE9TRUE1 * 4 = 44
52FALSENAFALSE0 * 5 = 04
62FALSE5TRUE1 * 6 = 66
72FALSENAFALSE0 * 7 = 06
83TRUENAFALSE1 * 8 = 88
93FALSE17TRUE1 * 9 = 99

even formatting time is relative

The raw data contains a column with the timestamp of a price change. There are multiple options to start to transform this string into a date and time variable:

  • base R with as.POSIXct()
  • lubridate* with ymd_hms()
  • data.table with as.IDate() and as.ITime()

There are two things I want from the time transformation. First, for my aggregation over daily hours analysis, I want the date and the time in separate columns. Second, I want to round to a given minute (e.g. to half hours or hours).

Since I am a data.table person and I think fewer packages in a project are a good thing, my first choices for these tasks are as.IDate() and as.ITime(). But – the go-to package when it comes to time and dates – lubridate with the round_date() function to round to a specific minute, for example, might be a good candidate as well. Time to do some benchmarking!

my benchmark setup

To test the different functions I will use one of the files, reduced to the date, DAY and TIME columns:

                  date        DAY     TIME
1: 2014-06-08 09:50:01 2014-06-08 09:50:01
2: 2014-06-08 09:50:01 2014-06-08 09:50:01
3: 2014-06-08 09:50:01 2014-06-08 09:50:01
4: 2014-06-08 09:50:01 2014-06-08 09:50:01
5: 2014-06-08 09:50:01 2014-06-08 09:50:01

splitting the format into date and time

microbenchmark(
  # lubridate
  "lubridate::ymd()" = DT[, DAY_ymd := ymd(DAY)],
  "lubridate::hms()" = DT[, TIME_hms := as.numeric(hms(TIME))],
  # data.table
  "data.table::as.IDate()" = DT[, DATE_IDate := as.IDate(date)],
  "data.table::as.ITime()" = DT[, DATE_ITime := as.ITime(date)],
  # settings
  times = 100L, unit = "ms")
Unit: milliseconds
                 expr        min        lq      mean    median        uq       max neval
     lubridate::ymd()   7.210735  10.18305  12.62154  11.96969  14.31220  27.62012   100
     lubridate::hms()   9.611753  10.26598  11.08295  10.60343  11.38001  16.48317   100
 data.table::as.IDate  74.908261  75.91852  78.98697  77.86806  80.62390  91.97302   100
 data.table::as.ITime 298.047305 306.15576 314.78720 309.89135 315.41061 466.35973   100

Well, lubridate is the fastest! But, the output of hms() cannot simply be saved into a data.table column because it is a nested object of the class Period from lubridate. Therefore, I will use this little workaround:

price_dt[, c("DATE", "TIME") := tstrsplit(date, " ")]
price_dt[, DATE := lubridate::ymd(DATE)]
price_dt[, TIME := as.ITime(as.numeric(lubridate::hms(TIME)))]

With this, I can continue to use the as.ITime() function and its format. Also, using as.ITime() on a numeric is much faster than on the original date.

rounding to the minutes

For the second task, I will compare lubridate::round_date() and ply::round_any()

minute <- 30
microbenchmark(
  # rounding lubridate
  "lubridate::round_date()" = 
    DT[, ROUND_TIME_lb := lubridate::round_date(DATE_ymd_hms, paste0(minute, " mins"))],
  # rounding plyr
  "plyr::round_any()" =
    DT[, ROUND_TIME_pl := as.ITime(plyr::round_any(as.numeric(DATE_ITime), minute * 60))],
  # settings
  times = 100L, unit = "ms")
Unit: milliseconds
                    expr       min        lq     mean    median        uq      max neval
 lubridate::round_date() 24.075546 25.582259 27.27308 26.944734 28.120435 39.62532   100
       plyr::round_any()  1.351571  1.533044  1.97845  1.727221  2.249484  4.37735   100

On the one hand, round_date() has an easy to understand input format, but on the other hand round_any() is much faster since it works on numeric values.

With this setup, I only use lubridate and plyr in one part of my code and for the rest data.table – That makes my inner self.data.table happy and it still is quite fast.

With all the changes I have made, I am curious how much the speed has improved.

fixing the bottlenecks

That is quite an improvement! The code now only takes 4330ms. There are probably even more enhancements possible, but I think that is enough of a speed boost.

to sum it all up

There are a lot of ways how you can improve and optimise your code. Some changes might be quite settled whilst others demand a change in your way of thinking about a problem. There are also a lot of packages out there which are optimised in doing a specific job (e.g. lubridate). So the question often is not „Can this be improved?“ but more like „Which package is better at this job?“.

To be honest, it has taken me quite a while to finish this post. Whilst finding the bottlenecks and fixing them, I also found myself changing parts of the data preparation and adding new data error handling parts. So, the initial code might not be the same as the final one which you can inspect yourself in our github.

I still hope this post can help you to improve your own coding or at least gives you some ideas on how to tackle your optimisation problems.

Über den Autor
Jakob Gepp

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

Der Beitrag Optimising your R code – a guided example erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)