Testing Different Methods for Merging a set of Files into a Dataframe

June 5, 2011
By

(This article was first published on Psychwire » R, and kindly contributed to R-bloggers)

I previously posted a method I used for merging a set of files into a dataframe. It wasn’t long before I had some very helpful comments from the R-bloggers community suggesting better methods to achieve my goal. In this post, I compare the different methods and see which is the most efficient (i.e., fastest).

The Methods

My original method is outlined in my post. In the comments, you can see two further methods suggested. One by sayan involves the use of the do.call function and lapply. A second by dan involves the use of plyr‘s ldply function. Check out the comments for the full discussion.

I will therefore compare three methods:

  • My original method
  • sayan‘s lapply method
  • dan’s plyr method

Testing

I ran each of the three methods 10 times (not hugely powerful I know, but it still took a while). For testing purposes, I merged two 16MB text files together, containing several thousand rows and several hundred columns. Having not done any real amount of timing in R before, I searched around a bit. In the end, I found two posts which I based my timings on (here and here). If I’ve done this incorrectly, let me know and I will run them again. Anyway, here are the results. The error bars are standard errors. The time taken is in seconds.
As you can see, my method is by far the slowest. Looks like I won’t be using it ever again!

The R Code

For maximum transparency, below is the R code I used to get these numbers.
# lapply method
lap = replicate(N, system.time(
  full_data<- do.call(
  "rbind",lapply(file_list, 
  FUN=function(files){read.table(files,
  header=TRUE, sep="\t")})))[3])
lap

# plyr method
ply =
replicate(N, system.time(
  dataset <- ldply(file_list, read.table, header=TRUE, sep="\t")
)[3])
ply

# original method
orig= 
replicate(N, system.time(
for (file in file_list){
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }
}
)[3])
orig

To leave a comment for the author, please follow the link and comment on his blog: Psychwire » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.