data.table or data.frame?

February 2, 2013
By

(This article was first published on - R, and kindly contributed to R-bloggers)

I spent a portion of today trying to convince a colleague that there are times when the data.table package is faster than traditional methods in R. It took a few of the tests below to prove the point.

Generate a data.frame of characters and numbers for easy plotting.

  df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)), 
      numbers = sample(1:100, 1e+08, replace = TRUE))
  head(df)

  ##   letters numbers
  ## 1       f      69
  ## 2       j      65
  ## 3       h      29
  ## 4       c      69
  ## 5       j      12
  ## 6       e      65

Aggregate using the base R function aggregate.

  start <- proc.time()
  aggregate(numbers ~ letters, data = df, FUN = sum)

  ##    letters   numbers
  ## 1        a 504884636
  ## 2        b 504587923
  ## 3        c 505357057
  ## 4        d 505106809
  ## 5        e 504788174
  ## 6        f 505219078
  ## 7        g 504796095
  ## 8        h 504693166
  ## 9        i 505079861
  ## 10       j 505044118

  aggregate_time <- proc.time() - start
  aggregate_time

  ##    user  system elapsed 
  ##  120.13   30.51  261.79

Aggregate using ddply from the package plyr.

  require("plyr")

  ## Loading required package: plyr

  start <- proc.time()
  ddply(df, .(letters), summarize, sums = sum(numbers))

  ##    letters      sums
  ## 1        a 504884636
  ## 2        b 504587923
  ## 3        c 505357057
  ## 4        d 505106809
  ## 5        e 504788174
  ## 6        f 505219078
  ## 7        g 504796095
  ## 8        h 504693166
  ## 9        i 505079861
  ## 10       j 505044118

  ddply_time <- proc.time() - start
  ddply_time

  ##    user  system elapsed 
  ##   22.04   27.38  192.99

Aggregate using the data.table pacakge.

  require("data.table")

  ## Loading required package: data.table

  start <- proc.time()
  dt <- data.table(df, key = "letters")
  dt[, list(sums = sum(numbers)), by = c("letters")]

  ##     letters      sums
  ##  1:       a 504884636
  ##  2:       b 504587923
  ##  3:       c 505357057
  ##  4:       d 505106809
  ##  5:       e 504788174
  ##  6:       f 505219078
  ##  7:       g 504796095
  ##  8:       h 504693166
  ##  9:       i 505079861
  ## 10:       j 505044118

  dt_time <- proc.time() - start
  dt_time

  ##    user  system elapsed 
  ##   7.102   7.017  55.957

Comparison of the system times.

  # how many times slower is aggregate
  aggregate_time[2]/ddply_time[2]

  ## sys.self 
  ##    1.114

  aggregate_time[2]/dt_time[2]

  ## sys.self 
  ##    4.347

  
  # how many times slower is ddply
  ddply_time[2]/aggregate_time[2]

  ## sys.self 
  ##   0.8975

  ddply_time[2]/dt_time[2]

  ## sys.self 
  ##    3.902

  
  # how many times slower is data.table
  dt_time[2]/aggregate_time[2]

  ## sys.self 
  ##     0.23

  dt_time[2]/ddply_time[2]

  ## sys.self 
  ##   0.2563

Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.

  1. ddply requires ~0.8975 more system time than aggregate
  2. aggregate requires ~4.3474x more system time data.table
  3. ddply requires ~3.902x more system time than data.table

Conclusion - data.table for the win.

To leave a comment for the author, please follow the link and comment on his blog: - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.