Fast Class-Agnostic Data Manipulation in R

[This article was first published on R, Econometrics, High Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In previous posts I introduced collapse, a powerful (C/C++ based) new framework for data transformation and statistical computing in R – providing advanced grouped, weighted, time series, panel data and recursive computations in R at superior execution speeds, greater flexibility and programmability.

collapse 1.4 released this week additionally introduces an enhanced attribute handling system which enables non-destructive manipulation of vector, matrix or data frame based objects in R. With this post I aim to briefly introduce this attribute handling system and demonstrate that:

  1. collapse non-destructively handles all major matrix (time series) and data frame based classes in R.

  2. Using collapse functions on these objects yields uniform handling at higher computation speeds.

Data Frame Based Objects

The three major data frame based classes in R are the base R data.frame, the data.table and the tibble, for which there also exists grouped (dplyr) and time based (tsibble, tibbletime) versions. Additional notable classes are the panel data frame (plm) and the spatial features data frame (sf).

For the former three collapse offer extremely fast and versatile converters qDF, qDT and qTBL that can be used to turn many R objects into data.frame’s, data.table’s or tibble’s, respectively:

library(collapse); library(data.table); library(tibble)
options(datatable.print.nrows = 10, 
        datatable.print.topn = 2)

identical(qDF(mtcars), mtcars)
## [1] TRUE

mtcarsDT <- qDT(mtcars, row.names.col = "car")
mtcarsDT
##               car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
##  1:     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
##  2: Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## ---                                                                 
## 31: Maserati Bora 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
## 32:    Volvo 142E 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2

mtcarsTBL <- qTBL(mtcars, row.names.col = "car")
print(mtcarsTBL, n = 3)
## # A tibble: 32 x 12
##   car             mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##                      
## 1 Mazda RX4      21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 Wag  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710     22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## # ... with 29 more rows

These objects can then be manipulated using an advanced and attribute preserving set of (S3 generic) statistical and data manipulation functions. The following infographic summarizes the core collapse namespace:

More details are provided in the freshly released cheat sheet, and further in the documentation and vignettes.











The statistical functions internally handle grouped and / or weighted computations on vectors, matrices and data frames, and seek to keep the attributes of the object.

# Simple data frame: Grouped mean by cyl -> groups = row.names  
fmean(fselect(mtcars, mpg, disp, drat), g = mtcars$cyl)
##        mpg     disp     drat
## 4 26.66364 105.1364 4.070909
## 6 19.74286 183.3143 3.585714
## 8 15.10000 353.1000 3.229286

With fgroup_by, collapse also introduces a fast grouping mechanism that works together with grouped_df versions of all statistical and transformation functions:

# Using Pipe operators and grouped data frames
library(magrittr)
mtcars %>% fgroup_by(cyl) %>% 
  fselect(mpg, disp, drat, wt) %>% fmean  
##   cyl      mpg     disp     drat       wt
## 1   4 26.66364 105.1364 4.070909 2.285727
## 2   6 19.74286 183.3143 3.585714 3.117143
## 3   8 15.10000 353.1000 3.229286 3.999214

# This is still a data.table 
mtcarsDT %>% fgroup_by(cyl) %>% 
  fselect(mpg, disp, drat, wt) %>% fmean
##    cyl      mpg     disp     drat       wt
## 1:   4 26.66364 105.1364 4.070909 2.285727
## 2:   6 19.74286 183.3143 3.585714 3.117143
## 3:   8 15.10000 353.1000 3.229286 3.999214

# Same with tibble: here computing weighted group means -> also saves sum of weights in each group
mtcarsTBL %>% fgroup_by(cyl) %>% 
  fselect(mpg, disp, drat, wt) %>% fmean(wt)
## # A tibble: 3 x 5
##     cyl sum.wt   mpg  disp  drat
##        
## 1     4   25.1  25.9  110.  4.03
## 2     6   21.8  19.6  185.  3.57
## 3     8   56.0  14.8  362.  3.21

A specialty of the grouping mechanism is that it fully preserves the structure / attributes of the object, and thus permits the creation of a grouped version of any data frame like object.

# This created a grouped data.table
gmtcarsDT <- mtcarsDT %>% fgroup_by(cyl)
gmtcarsDT
##               car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
##  1:     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
##  2: Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## ---                                                                 
## 31: Maserati Bora 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
## 32:    Volvo 142E 21.4   4  121 109 4.11 2.780 18.60  1  1    4    2
## 
## Grouped by:  cyl  [3 | 11 (3.5)]
# The print shows: [N. groups | Avg. group size (SD around avg. group size)]

# Subsetting drops groups 
gmtcarsDT[1:2]
##              car mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1:     Mazda RX4  21   6  160 110  3.9 2.620 16.46  0  1    4    4
## 2: Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

# Any class-specific methods are independent of the attached groups
gmtcarsDT[, new := mean(mpg)]
gmtcarsDT[, lapply(.SD, mean), by = vs, .SDcols = -1L] # Again groups are dropped
##    vs      mpg      cyl     disp        hp     drat       wt     qsec        am     gear     carb
## 1:  0 16.61667 7.444444 307.1500 189.72222 3.392222 3.688556 16.69389 0.3333333 3.555556 3.611111
## 2:  1 24.55714 4.571429 132.4571  91.35714 3.859286 2.611286 19.33357 0.5000000 3.857143 1.785714
##         new
## 1: 20.09062
## 2: 20.09062

# Groups are always preserved in column-subsetting operations
gmtcarsDT[, 9:13] 
##     vs am gear carb      new
##  1:  0  1    4    4 20.09062
##  2:  0  1    4    4 20.09062
## ---                         
## 31:  0  1    5    8 20.09062
## 32:  1  1    4    2 20.09062
## 
## Grouped by:  cyl  [3 | 11 (3.5)]

The grouping is also dropped in aggregations, but preserved in transformations keeping data dimensions:

# Grouped medians 
fmedian(gmtcarsDT[, 9:13])
##    cyl vs am gear carb      new
## 1:   4  1  1    4  2.0 20.09062
## 2:   6  1  0    4  4.0 20.09062
## 3:   8  0  0    3  3.5 20.09062
# Note: unique grouping columns are stored in the attached grouping object 
# and added if keep.group_vars = TRUE (the default)

# Replacing data by grouped median (grouping columns are not selected and thus not present)
fmedian(gmtcarsDT[, 4:5], TRA = "replace")
##      disp    hp
##  1: 167.6 110.0
##  2: 167.6 110.0
## ---            
## 31: 350.5 192.5
## 32: 108.0  91.0
## 
## Grouped by:  cyl  [3 | 11 (3.5)]

# Weighted scaling and centering data (here also selecting grouping column)
mtcarsDT %>% fgroup_by(cyl) %>% 
  fselect(cyl, mpg, disp, drat, wt) %>% fscale(wt)
##     cyl    wt         mpg       disp      drat
##  1:   6 2.620  0.96916875 -0.6376553 0.7123846
##  2:   6 2.875  0.96916875 -0.6376553 0.7123846
## ---                                           
## 31:   8 3.570  0.07335466 -0.8685527 0.9844833
## 32:   4 2.780 -1.06076989  0.3997723 0.2400387
## 
## Grouped by:  cyl  [3 | 11 (3.5)]

As mentioned, this works for any data frame like object, even a suitable list:

# Here computing a weighted grouped standard deviation
as.list(mtcars) %>% fgroup_by(cyl, vs, am) %>% 
  fsd(wt) %>% str
## List of 11
##  $ cyl   : num [1:7] 4 4 4 6 6 8 8
##  $ vs    : num [1:7] 0 1 1 0 1 0 0
##  $ am    : num [1:7] 1 0 1 1 0 0 1
##  $ sum.wt: num [1:7] 2.14 8.8 14.2 8.27 13.55 ...
##  $ mpg   : num [1:7] 0 1.236 4.833 0.655 1.448 ...
##  $ disp  : num [1:7] 0 11.6 19.25 7.55 39.93 ...
##  $ hp    : num [1:7] 0 17.3 22.7 32.7 8.3 ...
##  $ drat  : num [1:7] 0 0.115 0.33 0.141 0.535 ...
##  $ qsec  : num [1:7] 0 1.474 0.825 0.676 0.74 ...
##  $ gear  : num [1:7] 0 0.477 0.32 0.503 0.519 ...
##  $ carb  : num [1:7] 0 0.477 0.511 1.007 1.558 ...
##  - attr(*, "row.names")= int [1:7] 1 2 3 4 5 6 7

The function fungroup can be used to undo any grouping operation.

identical(mtcarsDT,
          mtcarsDT %>% fgroup_by(cyl, vs, am) %>% fungroup)
## [1] TRUE

Apart from the grouping mechanism with fgroup_by, which is very fast and versatile, collapse also supports regular grouped tibbles created with dplyr:

library(dplyr)
# Same as summarize_all(sum) and considerably faster
mtcars %>% group_by(cyl) %>% fsum
## # A tibble: 3 x 11
##     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1     4  293. 1157.   909  44.8  25.1  211.    10     8    45    17
## 2     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24
## 3     8  211. 4943.  2929  45.2  56.0  235.     0     2    46    49

# Same as muatate_all(sum)
mtcars %>% group_by(cyl) %>% fsum(TRA = "replace_fill") %>% head(3)
## # A tibble: 3 x 11
## # Groups:   cyl [2]
##     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
##             
## 1     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24
## 2     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24
## 3     4  293. 1157.   909  44.8  25.1  211.    10     8    45    17

One major goal of the package is to make R suitable for (large) panel data, thus collapse also supports panel-data.frames created with the plm package:

library(plm)
pwlddev <- pdata.frame(wlddev, index = c("iso3c", "year"))

# Centering (within-transforming) columns 9-12 using the within operator W()
head(W(pwlddev, cols = 9:12), 3)
##          iso3c year W.PCGDP  W.LIFEEX W.GINI W.ODA
## ABW-1960   ABW 1960      NA -6.547351     NA    NA
## ABW-1961   ABW 1961      NA -6.135351     NA    NA
## ABW-1962   ABW 1962      NA -5.765351     NA    NA

# Computing growth rates of columns 9-12 using the growth operator G()
head(G(pwlddev, cols = 9:12), 3)
##          iso3c year G1.PCGDP G1.LIFEEX G1.GINI G1.ODA
## ABW-1960   ABW 1960       NA        NA      NA     NA
## ABW-1961   ABW 1961       NA 0.6274558      NA     NA
## ABW-1962   ABW 1962       NA 0.5599782      NA     NA

Perhaps a note about operators is necessary here before proceeding: collapse offers a set of transformation operators for its vector-valued fast functions:

# Operators
.OPERATOR_FUN
##  [1] "STD"  "B"    "W"    "HDB"  "HDW"  "L"    "F"    "D"    "Dlog" "G"

# Corresponding (programmers) functions
setdiff(.FAST_FUN, .FAST_STAT_FUN)
## [1] "fscale"     "fbetween"   "fwithin"    "fHDbetween" "fHDwithin"  "flag"       "fdiff"     
## [8] "fgrowth"

These operators are principally just function shortcuts that exist for parsimony and in-formula use (e.g. to specify dynamic or fixed effects models using lm(), see the documentation). They however also have some useful extra features in the data.frame method, such as internal column-subsetting using the cols argument or stub-renaming transformed columns (adding a ‘W.’ or ‘Gn.’ prefix as shown above). They also permit grouping variables to be passed using formulas, including options to keep (default) or drop those variables in the output. We will see this feature when using time series below.

To round things off for data frames, I demonstrate the use of collapse with classes it was not directly built to support but can also handle very well. Through it’s built in capabilities for handling panel data, tsibble’s can seamlessly be utilized:

library(tsibble)
tsib <- as_tsibble(EuStockMarkets)

# Computing daily and annual growth rates on tsibble
head(G(tsib, c(1, 260), by = ~ key, t = ~ index), 3)
## # A tsibble: 3 x 4 [1s] 
## # Key:       key [1]
##   key   index               G1.value L260G1.value
##                             
## 1 DAX   1991-07-01 02:18:33   NA               NA
## 2 DAX   1991-07-02 12:00:00   -0.928           NA
## 3 DAX   1991-07-03 21:41:27   -0.441           NA

# Computing a compounded annual growth rate
head(G(tsib, 260, by = ~ key, t = ~ index, power = 1/260), 3)
## # A tsibble: 3 x 3 [1s] 
## # Key:       key [1]
##   key   index               L260G1.value
##                         
## 1 DAX   1991-07-01 02:18:33           NA
## 2 DAX   1991-07-02 12:00:00           NA
## 3 DAX   1991-07-03 21:41:27           NA

Similarly for tibbletime:

library(tibbletime); library(tsbox)
# Using the tsbox converter
tibtm <- ts_tibbletime(EuStockMarkets)

# Computing daily and annual growth rates on tibbletime
head(G(tibtm, c(1, 260), t = ~ time), 3)
## # A time tibble: 3 x 9
## # Index: time
##   time                G1.DAX L260G1.DAX G1.SMI L260G1.SMI G1.CAC L260G1.CAC G1.FTSE L260G1.FTSE
##                                                  
## 1 1991-07-01 02:18:27 NA             NA NA             NA  NA            NA  NA              NA
## 2 1991-07-02 12:01:32 -0.928         NA  0.620         NA  -1.26         NA   0.679          NA
## 3 1991-07-03 21:44:38 -0.441         NA -0.586         NA  -1.86         NA  -0.488          NA
# ...

Finally lets consider the simple features data frame:

library(sf)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
## Reading layer `nc' from data source `C:\Users\Sebastian Krantz\Documents\R\win-library\4.0\sf\shape\nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27

# Fast selecting columns (need to add 'geometry' column to not break the class)
plot(fselect(nc, AREA, geometry))

# Subsetting 
fsubset(nc, AREA > 0.23, NAME, AREA, geometry)
## Simple feature collection with 3 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
##       NAME  AREA                       geometry
## 1  Sampson 0.241 MULTIPOLYGON (((-78.11377 3...
## 2  Robeson 0.240 MULTIPOLYGON (((-78.86451 3...
## 3 Columbus 0.240 MULTIPOLYGON (((-78.65572 3...

# Standardizing numeric columns (by reference)
settransformv(nc, is.numeric, STD, apply = FALSE)
# Note: Here using using operator STD() instead of fscale() to stub-rename standardized columns.
# apply = FALSE uses STD.data.frame on all numeric columns instead of lapply(data, STD)
head(nc, 2)
## Simple feature collection with 2 features and 26 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -81.74107 ymin: 36.23436 xmax: -80.90344 ymax: 36.58965
## geographic CRS: NAD27
##    AREA PERIMETER CNTY_ CNTY_ID      NAME  FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 BIR79 SID79
## 1 0.114     1.442  1825    1825      Ashe 37009  37009        5  1091     1      10  1364     0
## 2 0.061     1.231  1827    1827 Alleghany 37005  37005        3   487     0      10   542     3
##   NWBIR79                       geometry  STD.AREA STD.PERIMETER STD.CNTY_ STD.CNTY_ID STD.FIPSNO
## 1      19 MULTIPOLYGON (((-81.47276 3... -0.249186    -0.4788595 -1.511125   -1.511125  -1.568344
## 2      12 MULTIPOLYGON (((-81.23989 3... -1.326418    -0.9163351 -1.492349   -1.492349  -1.637282
##   STD.CRESS_ID  STD.BIR74  STD.SID74 STD.NWBIR74  STD.BIR79  STD.SID79 STD.NWBIR79
## 1    -1.568344 -0.5739411 -0.7286824  -0.7263602 -0.5521659 -0.8863574  -0.6750055
## 2    -1.637282 -0.7308990 -0.8571979  -0.7263602 -0.7108697 -0.5682866  -0.6785480




Matrix Based Objects

collapse also offers a converter qM to efficiently convert various objects to matrix:

m <- qM(mtcars)

Grouped and / or weighted computations and transformations work as with with data frames:

# Grouped means
fmean(m, g = mtcars$cyl)
##        mpg cyl     disp        hp     drat       wt     qsec        vs        am     gear     carb
## 4 26.66364   4 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
## 6 19.74286   6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
## 8 15.10000   8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

# Grouped and weighted standardizing
head(fscale(m, g = mtcars$cyl, w = mtcars$wt), 3)
##                      mpg cyl        disp         hp       drat         wt       qsec         vs
## Mazda RX4      0.9691687 NaN -0.63765527 -0.5263758  0.7123846 -1.6085211 -1.0438559 -1.2509539
## Mazda RX4 Wag  0.9691687 NaN -0.63765527 -0.5263758  0.7123846 -0.8376064 -0.6921302 -1.2509539
## Datsun 710    -0.7333024 NaN -0.08822497  0.4896429 -0.5526066 -0.1688057 -0.4488514  0.2988833
##                     am        gear      carb
## Mazda RX4     1.250954  0.27612029  0.386125
## Mazda RX4 Wag 1.250954  0.27612029  0.386125
## Datsun 710    0.719370 -0.09429567 -1.133397

Various matrix-based time series classes such as xts / zoo and timeSeries are also easily handled:

# ts / mts
# Note: G() by default renames the columns, fgrowth() does not
plot(G(EuStockMarkets))

# xts
library(xts) 
ESM_xts <- ts_xts(EuStockMarkets) # using tsbox::ts_xts
head(G(ESM_xts), 3)
##                         G1.DAX     G1.SMI    G1.CAC    G1.FTSE
## 1991-07-01 02:18:27         NA         NA        NA         NA
## 1991-07-02 12:01:32 -0.9283193  0.6197485 -1.257897  0.6793256
## 1991-07-03 21:44:38 -0.4412412 -0.5863192 -1.856612 -0.4877652

plot(G(ESM_xts), legend.loc = "bottomleft")

# timeSeries
library(timeSeries) # using tsbox::ts_timeSeries
ESM_timeSeries <- ts_timeSeries(EuStockMarkets)
# Note: G() here also renames the columns but the names of the series are also stored in an attribute
head(G(ESM_timeSeries), 3)
## GMT
##                            DAX        SMI       CAC       FTSE
## 1991-06-30 23:18:27         NA         NA        NA         NA
## 1991-07-02 09:01:32 -0.9283193  0.6197485 -1.257897  0.6793256
## 1991-07-03 18:44:38 -0.4412412 -0.5863192 -1.856612 -0.4877652

plot(G(ESM_timeSeries), plot.type = "single", at = "pretty")
legend("bottomleft", colnames(G(qM(ESM_timeSeries))), lty = 1, col = 1:4)

Aggregating these objects yields a plain matrix with groups in the row-names:

# Aggregating by year: creates plain matrix with row-names (g is second argument)
EuStockMarkets %>% fmedian(round(time(.)))
##           DAX     SMI    CAC    FTSE
## 1991 1628.750 1678.10 1772.8 2443.60
## 1992 1649.550 1733.30 1863.5 2558.50
## 1993 1606.640 2061.70 1837.5 2773.40
## 1994 2089.770 2727.10 2148.0 3111.40
## 1995 2072.680 2591.60 1918.5 3091.70
## 1996 2291.820 3251.60 1946.2 3661.65
## 1997 2861.240 3906.55 2297.1 4075.35
## 1998 4278.725 6077.40 3002.7 5222.20
## 1999 5905.150 8102.70 4205.4 5884.50

# Same thing with the other objects
all_obj_equal(ESM_xts %>% fmedian(substr(time(.), 1L, 4L)),
              ESM_timeSeries %>% fmedian(substr(time(.), 1L, 4L)))
## [1] TRUE

Benchmarks

Extensive benchmarks and examples against native dplyr / tibble and plm are provided here and here, making it evident that collapse provides both greater versatility and massive performance improvements over the methods defined for these objects. Benchmarks against data.table were provided in a previous post, where collapse compared favorably on a 2-core machine (particularly for weighted and := type operations). In general collapse functions are extremely well optimized, with basic execution speeds below 30 microseconds, and efficiently scale to larger operations. Most importantly, they preserve the data structure and attributes (including column attributes) of the objects passed to them. They also efficiently skip missing values and avoid some of the undesirable behavior endemic of base R1.

Here I will add to the above resources just a small benchmark to prove that computations with collapse are also faster than any native methods and suggested programming principles for the various time series classes:

library(dplyr) # needed for tibbletime / tsibble comparison    
library(microbenchmark)

# Computing the first difference
microbenchmark(ts = diff(EuStockMarkets),
               collapse_ts = fdiff(EuStockMarkets),
               xts = diff(ESM_xts),
               collapse_xts = fdiff(ESM_xts),
               timeSeries = diff(ESM_timeSeries),
               collapse_timeSeries = fdiff(ESM_timeSeries),
               # taking difference function from tsibble
               dplyr_tibbletime = mutate_at(tibtm, 2:5, difference, order_by = tibtm$time),
               collapse_tibbletime_D = D(tibtm, t = ~ time),
               # collapse equivalent to the dplyr method (tfmv() abbreviates ftransformv())
               collapse_tibbletime_tfmv = tfmv(tibtm, 2:5, fdiff, t = time, apply = FALSE),
               # dplyr helpers provided by tsibble package
               dplyr_tsibble = mutate(group_by_key(tsib), value = difference(value, order_by = index)),
               collapse_tsibble_D = D(tsib, 1, 1, ~ key, ~ index),
               # Again we can do the same using collapse (tfm() abbreviates ftransform())
               collapse_tsibble_tfm = tfm(tsib, value = fdiff(value, 1, 1, key, index)))
## Unit: microseconds
##                      expr      min        lq        mean    median         uq        max neval cld
##                        ts 1344.993 1458.7855  1843.66603 1591.7675  1790.3480   9325.697   100 a  
##               collapse_ts   20.974   37.4850    50.27008   49.5340    58.4585    135.213   100 a  
##                       xts   84.788  131.4205   319.51851  147.7085   161.9885  15576.297   100 a  
##              collapse_xts   38.824   60.2440    77.00934   73.1845    85.9030    214.199   100 a  
##                timeSeries 1364.628 1630.3680  1907.73838 1775.3990  2051.8495   2887.227   100 a  
##       collapse_timeSeries   42.840   62.9220    86.59470   77.8705    91.0350    671.157   100 a  
##          dplyr_tibbletime 5835.143 6267.7805  7371.78980 6681.0065  7534.0105  37462.544   100  b 
##     collapse_tibbletime_D  430.630  479.9400   565.78952  536.1675   601.0960    923.288   100 a  
##  collapse_tibbletime_tfmv  412.780  464.9910   557.34657  511.6240   612.4760   1460.570   100 a  
##             dplyr_tsibble 7539.811 8328.7780 11490.09014 8791.9835 10098.8220 223112.537   100   c
##        collapse_tsibble_D  757.730  821.9900  1015.04996  909.2310   996.0265   6766.910   100 a  
##      collapse_tsibble_tfm  729.616  783.8350  1035.57745  862.5980   907.0000  13540.958   100 a

# Sequence of lagged/leaded and iterated differences (not supported by either of these methods)
head(fdiff(ESM_xts, -1:1, diff = 1:2)[, 1:6], 3) 
##                     FD1.DAX FD2.DAX     DAX D1.DAX D2.DAX FD1.SMI
## 1991-07-01 02:18:27   15.12    8.00 1628.75     NA     NA   -10.4
## 1991-07-02 12:01:32    7.12   21.65 1613.63 -15.12     NA     9.9
## 1991-07-03 21:44:38  -14.53  -17.41 1606.51  -7.12      8    -5.5
head(D(tibtm, -1:1, diff = 1:2, t = ~ time), 3)
## # A time tibble: 3 x 21
## # Index: time
##   time                FD1.DAX FD2.DAX   DAX D1.DAX D2.DAX FD1.SMI FD2.SMI   SMI D1.SMI D2.SMI FD1.CAC
##                                         
## 1 1991-07-01 02:18:27   15.1     8.00 1629.  NA     NA      -10.4   -20.3 1678.   NA     NA      22.3
## 2 1991-07-02 12:01:32    7.12   21.7  1614. -15.1   NA        9.9    15.4 1688.   10.4   NA      32.5
## 3 1991-07-03 21:44:38  -14.5   -17.4  1607.  -7.12   8.00    -5.5    -3   1679.   -9.9  -20.3     9.9
## # ... with 9 more variables: FD2.CAC , CAC , D1.CAC , D2.CAC , FD1.FTSE ,
## #   FD2.FTSE , FTSE , D1.FTSE , D2.FTSE 
head(D(tsib, -1:1, diff = 1:2, ~ key, ~ index), 3)
## # A tsibble: 3 x 7 [1s] 
## # Key:       key [1]
##   key   index               FD1.value FD2.value value D1.value D2.value
##                                    
## 1 DAX   1991-07-01 02:18:33     15.1       8.00 1629.    NA       NA   
## 2 DAX   1991-07-02 12:00:00      7.12     21.7  1614.   -15.1     NA   
## 3 DAX   1991-07-03 21:41:27    -14.5     -17.4  1607.    -7.12     8.00

microbenchmark(collapse_xts = fdiff(ESM_xts, -1:1, diff = 1:2),
               collapse_tibbletime = D(tibtm, -1:1, diff = 1:2, t = ~ time),
               collapse_tsibble = D(tsib, -1:1, diff = 1:2, ~ key, ~ index))
## Unit: microseconds
##                 expr     min      lq      mean    median        uq        max neval cld
##         collapse_xts  99.067 127.404 4328.5683  146.8155  177.1610 222804.179   100   a
##  collapse_tibbletime 504.707 561.827  613.9219  585.7020  614.7075   1100.003   100   a
##     collapse_tsibble 849.657 945.600 1060.1478 1011.1990 1083.4915   1729.659   100   a

Conclusion

This concludes this short demonstration. collapse is an advanced, fast and versatile data manipulation package. If you have followed until here I am convinced you will find it very useful, particularly if you are working in advanced statistics, econometrics, surveys, time series, panel data and the like, or if you care much about performance and non-destructive working in R. For more information about the package see the website, study the cheat sheet or call help("collapse-documentation") after install to bring up the built-in documentation.

Appendix: So how does this all actually work?

Statistical functions like fmean are S3 generic with user visible ‘default’, ‘matrix’ and ‘data.frame’ methods, and hidden ‘list’ and ‘grouped_df’ methods. Transformation functions like fwithin additionally have ‘pseries’ and ‘pdata.frame’ methods to support plm objects.

The ‘default’, ‘matrix’ and ‘data.frame’ methods handle object attributes intelligently. In the case of ‘data.frame’s’ only the ‘row.names’ attribute is modified accordingly, other attributes (including column attributes) are preserved. This also holds for data manipulation functions like fselect, fsubset, ftransform etc.. ‘default’ and ‘matrix’ methods preserve attributes as long as the data dimensions are kept.

In addition, the ‘default’ method checks if its argument is actually a matrix, and calls the matrix method if is.matrix(x) && !inherits(x, "matrix") is TRUE. This prevents classed matrix-based objects (such as xts time series) not inheriting from ‘matrix’ being handled by the default method.


  1. For example. mean(NA, na.rm = TRUE) gives NaN, sum(NA, na.rm = TRUE) gives 0 and max(NA, na.rm = TRUE) gives -Inf whereas all_identical(NA_real_, fmean(NA), fsum(NA), fmax(NA)). na.rm = TRUE is the default setting for all collapse functions. Setting na.rm = FALSE also checks for missing values and returns NA if found instead of just running through the entire computation and then returning a NA or NaN value which is unreliable and inefficient.↩︎

To leave a comment for the author, please follow the link and comment on their blog: R, Econometrics, High Performance.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)