Creating empty data frames with dfTemplate() and dfTemplateMatch()

R-bloggers on inSileco

3 years ago

[This article was first published on R-bloggers on inSileco, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Creating a data frame is fairly simple but when you need to create a large empty data frame with columns that have different classes it takes several command lines. A few days ago, I decided to write a function to simplify this operation and I came to realize that such a function would actually be very useful to ease the row binding of data frames whose column names partially match. How so? This post is meant to answer this question!

inSilecoMisc

First of all, the functions I am using in this post are available in inSilecoMisc which is an R package where we gathered the miscellaneous functions we wrote and deem worth sharing on GitHub. So the first step to reproduce the examples below is to install inSilecoMisc which is straightforward with the devtools :

library(devtools)
install_github("inSileco/inSilecoMisc")

Then, load it:

library(inSilecoMisc)

In this post, I’ll exemplify how to use dfTemplate() and dfTemplateMatch() but if you are interested in other functions in the packages, check out the tour vignette.

Generating empty data frames efficiently

Let’s start with dfTemplate() that creates a data frame with a specific number of columns.

df1 <- dfTemplate(cols = 2)
df1
##   Var1 Var2
## 1   NA   NA
class(df1)
## [1] "data.frame"

By default, the data frame created has only 1 row and the columns are filled out with NA. This can readily be changed using arguments nrows and fill.

df2 <- dfTemplate(2, nrows = 4, fill = 0)
df2
##   Var1 Var2
## 1    0    0
## 2    0    0
## 3    0    0
## 4    0    0
df3 <- dfTemplate(cols = 2, nrows = 3, fill = "")
df3
##   Var1 Var2
## 1          
## 2          
## 3

Columns classes are determined by fill:

class(df1[,1])
class(df2[,1])
class(df3[,1])
## [1] "logical"
## [1] "numeric"
## [1] "character"

And col_classes is used to changed these classes:

df4 <- dfTemplate(cols = 2, col_classes = "character")
class(df4[, 1])
class(df4[, 2])
## [1] "character"
## [1] "character"

Arguments fill and col_classes can be vectors that specify content and class for every columns:

df5 <- dfTemplate(2, 5, col_classes = c("character", "numeric"), fill = c("", 5))
df5
class(df5[, 1])
class(df5[, 2])
##   Var1 Var2
## 1         5
## 2         5
## 3         5
## 4         5
## 5         5
## [1] "character"
## [1] "numeric"

Another useful feature of dfTemplate() is that column names of the data frame to be created can be passed as first argument (cols) :

df5 <- dfTemplate(c("category", "value"))

So, now you are able to create custom data frames with a set of column names!

nms <- LETTERS[1:10]
df6 <- dfTemplate(nms, 10, fill = tolower(nms))
df6
##    A B C D E F G H I J
## 1  a b c d e f g h i j
## 2  a b c d e f g h i j
## 3  a b c d e f g h i j
## 4  a b c d e f g h i j
## 5  a b c d e f g h i j
## 6  a b c d e f g h i j
## 7  a b c d e f g h i j
## 8  a b c d e f g h i j
## 9  a b c d e f g h i j
## 10 a b c d e f g h i j

How to flexibly `rbind` a list of data frames

Sometimes we need to rbind data frames that do not have all the columns the final data frame must contain. In such case, we first need to append the missing columns because otherwise the default rbind function won’t work. Another solution is to use a package that has a function that do so. For instance, rbind.fill() from the plyr package allows to perform such flexible rbind. Also, the package data.table includes a rbind() method for data.table objects that handles such situation (see this answer on ). In this last section, I would like to show how to rbind data frames flexibly with dfTemplateMatch() that is written in base R.

Let me first introduces dfTemplateMatch(). This function takes a data frame as the first argument (x) and the second argument (y) could be another data frame or a vector of character strings. Based on x and y, dfTemplateMatch() creates a data frame that has the same number of rows as x and add columns for all names found in y that are not found in x. Before calling dfTemplateMatch() I create two data frames :

df7 <- df6[1:5, 1:4]
df7
##   A B C D
## 1 a b c d
## 2 a b c d
## 3 a b c d
## 4 a b c d
## 5 a b c d
df8 <- df6[4:6]
df8
##    D E F
## 1  d e f
## 2  d e f
## 3  d e f
## 4  d e f
## 5  d e f
## 6  d e f
## 7  d e f
## 8  d e f
## 9  d e f
## 10 d e f

Now I use dfTemplateMatch() to create a third data frame based on two other:

dfTemplateMatch(df7, df8)
##   A B C D  E  F
## 1 a b c d NA NA
## 2 a b c d NA NA
## 3 a b c d NA NA
## 4 a b c d NA NA
## 5 a b c d NA NA

As expected, the output has 5 rows as df6 and columns that are not in df6 but in df7 has been appended to df6. It is possible to use arguments fill and col_classes to custom the columns added.

dfTemplateMatch(df7, df8, fill = 1, col_classes = "numeric")
##   A B C D E F
## 1 a b c d 1 1
## 2 a b c d 1 1
## 3 a b c d 1 1
## 4 a b c d 1 1
## 5 a b c d 1 1

And there is an argument yonly that allows the user to keep only names of y (when yonly = TRUE).

dfTemplateMatch(df7, df8, yonly = TRUE, fill = 1, col_classes = "numeric")
##   D E F
## 1 d 1 1
## 2 d 1 1
## 3 d 1 1
## 4 d 1 1
## 5 d 1 1

Now let me show you how to rbind() a specific subset of columns of a list of data frame that may or may not have these columns. Let me start by creating a list of data frames.

lsdf <- apply(
  replicate(5, sample(nms, 5)),
  2,
  function(x) dfTemplate(x, nrows = 5, fill = tolower(x))
)
lsdf
## [[1]]
##   C F E H B
## 1 c f e h b
## 2 c f e h b
## 3 c f e h b
## 4 c f e h b
## 5 c f e h b
## 
## [[2]]
##   G F D C B
## 1 g f d c b
## 2 g f d c b
## 3 g f d c b
## 4 g f d c b
## 5 g f d c b
## 
## [[3]]
##   E B I J C
## 1 e b i j c
## 2 e b i j c
## 3 e b i j c
## 4 e b i j c
## 5 e b i j c
## 
## [[4]]
##   B C E D F
## 1 b c e d f
## 2 b c e d f
## 3 b c e d f
## 4 b c e d f
## 5 b c e d f
## 
## [[5]]
##   G H E B F
## 1 g h e b f
## 2 g h e b f
## 3 g h e b f
## 4 g h e b f
## 5 g h e b f

So the goal here is to create a data frame that contains only the five first columns, i.e. A, B, C, D, E, the remaining columns must be discarded and when one selected column is missing, it must be added (filled out with NA). To do so, I simply need to call dfTemplateMatch():

lsdf2 <- lapply(lsdf, dfTemplateMatch, LETTERS[1:5], yonly = TRUE)
lsdf2
## [[1]]
##   C E B  A  D
## 1 c e b NA NA
## 2 c e b NA NA
## 3 c e b NA NA
## 4 c e b NA NA
## 5 c e b NA NA
## 
## [[2]]
##   D C B  A  E
## 1 d c b NA NA
## 2 d c b NA NA
## 3 d c b NA NA
## 4 d c b NA NA
## 5 d c b NA NA
## 
## [[3]]
##   E B C  A  D
## 1 e b c NA NA
## 2 e b c NA NA
## 3 e b c NA NA
## 4 e b c NA NA
## 5 e b c NA NA
## 
## [[4]]
##   B C E D  A
## 1 b c e d NA
## 2 b c e d NA
## 3 b c e d NA
## 4 b c e d NA
## 5 b c e d NA
## 
## [[5]]
##   E B  A  C  D
## 1 e b NA NA NA
## 2 e b NA NA NA
## 3 e b NA NA NA
## 4 e b NA NA NA
## 5 e b NA NA NA

And now I can seamlessly rbind() the list lsdf2!

do.call(rbind, lsdf2)
##       C    E B  A    D
## 1     c    e b NA <NA>
## 2     c    e b NA <NA>
## 3     c    e b NA <NA>
## 4     c    e b NA <NA>
## 5     c    e b NA <NA>
## 6     c <NA> b NA    d
## 7     c <NA> b NA    d
## 8     c <NA> b NA    d
## 9     c <NA> b NA    d
## 10    c <NA> b NA    d
## 11    c    e b NA <NA>
## 12    c    e b NA <NA>
## 13    c    e b NA <NA>
## 14    c    e b NA <NA>
## 15    c    e b NA <NA>
## 16    c    e b NA    d
## 17    c    e b NA    d
## 18    c    e b NA    d
## 19    c    e b NA    d
## 20    c    e b NA    d
## 21 <NA>    e b NA <NA>
## 22 <NA>    e b NA <NA>
## 23 <NA>    e b NA <NA>
## 24 <NA>    e b NA <NA>
## 25 <NA>    e b NA <NA>

Voilà! This is what I call a flexible `rbind`! I hope you’ll find this helpful! ????

< details> < summary>

Session info

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] inSilecoMisc_0.4.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4      bookdown_0.18   digest_0.6.25   crayon_1.3.4   
##  [5] magrittr_1.5    evaluate_0.14   blogdown_0.18   rlang_0.4.5    
##  [9] stringi_1.4.6   rmarkdown_2.1   tools_4.0.0     stringr_1.4.0  
## [13] glue_1.4.0      xfun_0.12       yaml_2.2.1      compiler_4.0.0 
## [17] htmltools_0.4.0 knitr_1.28

To leave a comment for the author, please follow the link and comment on their blog: R-bloggers on inSileco.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.