Efficient R: Performant data.frame constructors

[This article was first published on shikokuchuo{net}, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

                                                            sha256
1 eb5d71529ab540bc4865c181a1129e03186e0959c76196a9fbc0c2a16c767856

About as.data.frame

data.frame() or as.data.frame() are such ubiquitous functions that we rarely think twice about using them to create dataframes or to convert other objects to dataframes.

However, they are slow. Extremely slow.

This is somewhat surprising considering how much they are used, and given that the ‘data.frame’ object is the de facto standard for tabular data in R, for their constructors to be so inefficient.

However this is the direct result of the presence of a lot of error checking and validation code, which is perhaps understandable for something as widely used. You simply don’t know what is going to be thrown at the function and so it needs to try to do its best or fail gracefully.

Below, we demonstrate the inefficiencies of as.data.frame() versus efficient ‘data.frame’ constructors from the ‘ichimoku’ package coded for performance.

For benchmarking, the ‘microbenchmark’ package will be used. It is usual to compare the median times averaged over a large number of runs, and we will use 1,000 in the cases below.

Matrix conversion benchmarking

A 100×10 matrix of random data drawn from the normal distribution is created as the object ‘matrix’.

This will be converted into a dataframe using as.data.frame() and ichimoku::matrix_df().

library(ichimoku)
library(microbenchmark)

matrix <- matrix(rnorm(1000), ncol = 10, dimnames = list(1:100, letters[1:10]))

dim(matrix)
[1] 100  10
head(matrix)
           a          b          c          d          e          f
1  1.0470296 -0.6531076  0.8278910  0.9708001 -0.1014626  1.2253514
2 -0.1436921  0.5482620  1.3607562  0.8354925  0.7415475 -0.1541012
3 -0.2369179  1.0897400 -0.8158241  0.2736871 -0.1851880 -1.0202761
4 -0.1883866 -0.4844175 -0.3421133  0.8321749  0.5960344  0.4411143
5 -0.2062340  0.9212781 -0.3687319 -0.2210680 -0.9493628  0.2689948
6 -1.2267639  0.7466243 -0.1845343  1.3502588 -1.1756389 -1.2925598
           g          h           i          j
1  1.2828614  0.4024055 -0.04694549 -1.1447872
2 -0.6869244 -0.2542681  0.48761441 -0.6505677
3 -0.5479265 -0.5446966  0.09914298  0.8869836
4 -0.4678586  0.9396598 -0.89564969  1.0552123
5 -0.6126510  0.4527644 -1.43793557  0.5074292
6  1.0173340  0.2888818  0.09522833 -0.1836863
microbenchmark(as.data.frame(matrix), matrix_df(matrix), times = 1000)
Unit: microseconds
                  expr    min      lq     mean  median      uq
 as.data.frame(matrix) 31.148 32.6130 34.37202 33.2355 34.3305
     matrix_df(matrix) 14.508 15.6355 22.78279 16.0575 16.8175
      max neval
  410.675  1000
 6297.590  1000
identical(as.data.frame(matrix), matrix_df(matrix))
[1] TRUE
all.equal(as.data.frame(matrix), matrix_df(matrix))
[1] TRUE

As can be seen, the outputs are identical, but ichimoku::matrix_df(), which is designed to be a performant ‘data.frame’ constructor, is around twice as fast.

xts conversion benchmarking

The ‘xts’ format is a popular choice for large time series data as each observation is indexed by a unique valid timestamp.

As an example, we use the ichimoku() function from the ‘ichimoku’ package which creates ichimoku objects inheriting the ‘xts’ class. We run ichimoku() on the sample data contained within the package to create an ‘xts’ object ‘cloud’.

This will be converted into a dataframe using as.data.frame() and ichimoku::xts_df().

library(ichimoku)
library(microbenchmark)

cloud <- ichimoku(sample_ohlc_data)

xts::is.xts(cloud)
[1] TRUE
dim(cloud)
[1] 260  12
print(cloud[1:6], plot = FALSE)
            open  high   low close cd tenkan kijun senkouA senkouB
2020-01-02 123.0 123.1 122.5 122.7 -1     NA    NA      NA      NA
2020-01-03 122.7 122.8 122.6 122.8  1     NA    NA      NA      NA
2020-01-05 122.8 123.4 122.4 123.3  1     NA    NA      NA      NA
2020-01-06 123.3 124.3 123.3 124.1  1     NA    NA      NA      NA
2020-01-07 124.1 124.8 124.0 124.8  1     NA    NA      NA      NA
2020-01-08 124.8 125.4 124.5 125.3  1     NA    NA      NA      NA
           chikou cloudTop cloudBase
2020-01-02  122.9       NA        NA
2020-01-03  123.0       NA        NA
2020-01-05  123.9       NA        NA
2020-01-06  123.6       NA        NA
2020-01-07  122.5       NA        NA
2020-01-08  122.6       NA        NA
microbenchmark(as.data.frame(cloud), xts_df(cloud), times = 1000)
Unit: microseconds
                 expr     min       lq      mean   median       uq
 as.data.frame(cloud) 230.269 236.3060 252.35890 240.4095 246.1955
        xts_df(cloud)  33.862  36.9205  45.36266  38.6870  40.7065
      max neval
 6871.517  1000
 5703.421  1000

It can be seen that ichimoku::xts_df(), which is designed to be a performant ‘data.frame’ constructor, is over 6x as fast.

df1 <- as.data.frame(cloud)

is.data.frame(df1)
[1] TRUE
str(df1)
'data.frame':   260 obs. of  12 variables:
 $ open     : num  123 123 123 123 124 ...
 $ high     : num  123 123 123 124 125 ...
 $ low      : num  122 123 122 123 124 ...
 $ close    : num  123 123 123 124 125 ...
 $ cd       : num  -1 1 1 1 1 1 -1 0 -1 -1 ...
 $ tenkan   : num  NA NA NA NA NA ...
 $ kijun    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouA  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouB  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ chikou   : num  123 123 124 124 122 ...
 $ cloudTop : num  NA NA NA NA NA NA NA NA NA NA ...
 $ cloudBase: num  NA NA NA NA NA NA NA NA NA NA ...
df2 <- xts_df(cloud)

is.data.frame(df2)
[1] TRUE
str(df2)
'data.frame':   260 obs. of  13 variables:
 $ index    : POSIXct, format: "2020-01-02 00:00:00" ...
 $ open     : num  123 123 123 123 124 ...
 $ high     : num  123 123 123 124 125 ...
 $ low      : num  122 123 122 123 124 ...
 $ close    : num  123 123 123 124 125 ...
 $ cd       : num  -1 1 1 1 1 1 -1 0 -1 -1 ...
 $ tenkan   : num  NA NA NA NA NA ...
 $ kijun    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouA  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ senkouB  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ chikou   : num  123 123 124 124 122 ...
 $ cloudTop : num  NA NA NA NA NA NA NA NA NA NA ...
 $ cloudBase: num  NA NA NA NA NA NA NA NA NA NA ...

The outputs are slightly different as xts_df() preserves the date-time index of ‘xts’ objects as a new first column ‘index’ which is POSIXct in format. The default as.data.frame() constructor converts the index into the row names, which is not desirable as the dates are coerced to type ‘character’.

So it can be seen that in this case, not only is the performant constructor faster, it is also more fit for purpose.

When to use performant constructors

  1. Data which is not already a ‘data.frame’ object being plotted using ‘ggplot2’. For example if you have time series data in the ‘xts’ format, calling a ‘ggplot2’ plot method automatically converts the data into a dataframe as ggplot() only works with dataframes internally. Fortunately it does not use as.data.frame() but its own constructor ggplot2::fortify(). Benchmarked below, it is slightly faster than as.data.frame() but the performant constructor ichimoku::xts_df() is still almost 4x as fast.
microbenchmark(as.data.frame(cloud), ggplot2::fortify(cloud), xts_df(cloud), times = 1000)
Unit: microseconds
                    expr     min       lq      mean   median       uq
    as.data.frame(cloud) 231.683 240.5755 260.94003 246.4725 253.3840
 ggplot2::fortify(cloud) 132.811 145.2860 170.82074 153.1985 162.7935
           xts_df(cloud)  34.382  38.1695  41.71392  40.5490  42.7240
      max neval
 5246.692  1000
 4828.824  1000
  381.869  1000
  1. In a context where performance is critical. This is usually in interactive environments such as a Shiny app, perhaps with real time data where slow code can reduce responsiveness or cause bottlenecks in execution.

  2. Within packages. It is usually safe to use performant constructors within functions or for internal unexported functions. If following programming best practices the input and output types for functions are kept consistent, and so the input to the constructor can be controlled and hence its function predictable. Setting appropriate unit tests can also catch any issues early.

When to question the use of performant constructors

  1. For user-facing functions. Having no validation or error-checking code means that a performant constructor may behave unpredictably on data that is not intended to be an input. Within a function, there is a specific or at most finite range of objects that a constructor can receive. When that limit is removed, if the input is not the intended input for a constructor then an error can be expected. As long as this is made clear to the user and there are adequate instructions on proper usage, in an environment where the occasional error message is acceptable, then proceed to use the performant constructor.

  2. When the constructor needs to handle a range of input types. as.data.frame() is actually an S3 generic with a variety of methods for different object classes. If required to handle a variety of different types of input, it may be easier (if not more performant) to rely on as.data.frame() rather than write code which handles different scenarios.

What is a performant constructor

First of all, it is possible to directly use the functions matrix_df() and xts_df() which are exported from the ‘ichimoku’ package. Given the nature of the R ecosystem, this is indeed encouraged.

However, having seen the advantages of using a performant constructor above, we can now turn to the ‘what’ for the curious.

What lies behind those functions? Some variation of the below:

# structure() is used to set the 'class' and other attributes on an object
structure(list(vec1, vec2, vec3),
          class = "data.frame",
          row.names = seq_len(length(vec1)))
  1. A data.frame is simply a list (where each element must be the same length).
  2. It has an attribute ‘class’ which equals ‘data.frame’.
  3. It must have row names, which is usually just an integer sequence.

Note:

  1. The vectors in the list (vec1, vec2, vec3, etc.) must be the same length, othwerwise a corrupt data.frame warning will be generated.
  2. If row names are missing then the data will still be present but dim() will show a 0-row dataframe and its print method will not work.
  3. Row names are not limited to an integer sequence. They can be dates for example. However if dates are set as row names, they are first coerced to type ‘character’.

In conclusion, dataframes are not complicated structures but essentially lists with a couple of constraints. Indeed you can see that the underlying data type of a dataframe is just a list:

class(df1)
[1] "data.frame"
typeof(df1)
[1] "list"
class(df2)
[1] "data.frame"
typeof(df2)
[1] "list"

References

ichimoku R package site: https://shikokuchuo.net/ichimoku/

ichimoku CRAN page: https://CRAN.R-project.org/package=ichimoku

To leave a comment for the author, please follow the link and comment on their blog: shikokuchuo{net}.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)