data.table or data.frame?

February 2, 2013
By

(This article was first published on - R, and kindly contributed to R-bloggers)

I spent a portion of today trying to convince a colleague that there are times when the data.table package is faster than traditional methods in R. It took a few of the tests below to prove the point.

Generate a data.frame of characters and numbers for easy plotting.

  df <- data.frame(letters = as.character(sample(letters[1:10], 1e+08, replace = TRUE)), 
      numbers = sample(1:100, 1e+08, replace = TRUE))
  head(df)

  ##   letters numbers
  ## 1       f      69
  ## 2       j      65
  ## 3       h      29
  ## 4       c      69
  ## 5       j      12
  ## 6       e      65

Aggregate using the base R function aggregate.

  start <- proc.time()
  aggregate(numbers ~ letters, data = df, FUN = sum)

  ##    letters   numbers
  ## 1        a 504884636
  ## 2        b 504587923
  ## 3        c 505357057
  ## 4        d 505106809
  ## 5        e 504788174
  ## 6        f 505219078
  ## 7        g 504796095
  ## 8        h 504693166
  ## 9        i 505079861
  ## 10       j 505044118

  aggregate_time <- proc.time() - start
  aggregate_time

  ##    user  system elapsed 
  ##  120.13   30.51  261.79

Aggregate using ddply from the package plyr.

  require("plyr")

  ## Loading required package: plyr

  start <- proc.time()
  ddply(df, .(letters), summarize, sums = sum(numbers))

  ##    letters      sums
  ## 1        a 504884636
  ## 2        b 504587923
  ## 3        c 505357057
  ## 4        d 505106809
  ## 5        e 504788174
  ## 6        f 505219078
  ## 7        g 504796095
  ## 8        h 504693166
  ## 9        i 505079861
  ## 10       j 505044118

  ddply_time <- proc.time() - start
  ddply_time

  ##    user  system elapsed 
  ##   22.04   27.38  192.99

Aggregate using the data.table pacakge.

  require("data.table")

  ## Loading required package: data.table

  start <- proc.time()
  dt <- data.table(df, key = "letters")
  dt[, list(sums = sum(numbers)), by = c("letters")]

  ##     letters      sums
  ##  1:       a 504884636
  ##  2:       b 504587923
  ##  3:       c 505357057
  ##  4:       d 505106809
  ##  5:       e 504788174
  ##  6:       f 505219078
  ##  7:       g 504796095
  ##  8:       h 504693166
  ##  9:       i 505079861
  ## 10:       j 505044118

  dt_time <- proc.time() - start
  dt_time

  ##    user  system elapsed 
  ##   7.102   7.017  55.957

Comparison of the system times.

  # how many times slower is aggregate
  aggregate_time[2]/ddply_time[2]

  ## sys.self 
  ##    1.114

  aggregate_time[2]/dt_time[2]

  ## sys.self 
  ##    4.347

  
  # how many times slower is ddply
  ddply_time[2]/aggregate_time[2]

  ## sys.self 
  ##   0.8975

  ddply_time[2]/dt_time[2]

  ## sys.self 
  ##    3.902

  
  # how many times slower is data.table
  dt_time[2]/aggregate_time[2]

  ## sys.self 
  ##     0.23

  dt_time[2]/ddply_time[2]

  ## sys.self 
  ##   0.2563

Based on 1 billion observations with the time to conver to a data.table included in the time elapsed.

  1. ddply requires ~0.8975 more system time than aggregate
  2. aggregate requires ~4.3474x more system time data.table
  3. ddply requires ~3.902x more system time than data.table

Conclusion – data.table for the win.

To leave a comment for the author, please follow the link and comment on their blog: - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



plotly webpage

dominolab webpage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)