Which function rbinds dataframes together fastest?

First competitor: classic rbind in a for loop over a list of dataframes

Second competitor: do.call(“rbind”, )

Third competitor: rbind.fill()

from the plyr package

The job:

– rbinding a list of dataframes with 4 columns each, one column is the splitting factor, the other 3 hold normally distributed random data

– the number of rows of the original dataframe is varied between 20,000; 50,000; 100,000; 200,000; 300,000; 400,000; 500,000 and 600,000 rows

– the number of levels for the splitting factor (hence the number of list elements after splitting) is varied between 6, 12 and 24 – the total number of rows for the original dataframe is held constant

The machine:

– A blazing fast late 2008 MacBook with a 2 GHz CPU and 4 GBs of RAM running Mountain Lion

– 32-bit R using RGui.app for Mac OS X

The results:

rbind.fill is the fastest function for each number of sub-dataframes (no surprises here). The classic rbind in a for loop is massively influenced by the number of sub-dataframes!

The code:

library(plyr)

time.df <- data.frame()

for (i in c(20000, 50000, 100000, 200000, 300000, 400000, 500000, 600000)) {

cat(i, “\n”)

df <- data.frame(a = rep(c(“A”, “B”, “C”, “D”, “E”, “F”), i),

b = sample(rnorm(i*6), i*6),

c = sample(rnorm(i*6), i*6),

d = sample(rnorm(i*6), i*6))

split.df <- split(df, df$a)

t1 <- Sys.time()

df1 <- data.frame()

for (subdf in split.df) {

df1 <- rbind(df1, subdf) }

t2 <- Sys.time()

t3 <- Sys.time()

df2 <- do.call(“rbind”, split.df)

t4 <- Sys.time()

t5 <- Sys.time()

df3 <- rbind.fill(split.df)

t6 <- Sys.time()

new.row <- data.frame(n = i*6,

classic = difftime(t2, t1),

docall = difftime(t4, t3),

rbindfill = difftime(t6, t5))

time.df <- rbind(time.df, new.row) }

Adapt the creation procedure of df for the different number of sub-dataframes…

*Related*

To

**leave a comment** for the author, please follow the link and comment on their blog:

** Rcrastinate**.

R-bloggers.com offers

**daily e-mail updates** about

R news and

tutorials on topics such as:

Data science,

Big Data, R jobs, visualization (

ggplot2,

Boxplots,

maps,

animation), programming (

RStudio,

Sweave,

LaTeX,

SQL,

Eclipse,

git,

hadoop,

Web Scraping) statistics (

regression,

PCA,

time series,

trading) and more...

If you got this far, why not

__subscribe for updates__ from the site? Choose your flavor:

e-mail,

twitter,

RSS, or

facebook...