Testing Different Methods for Merging a set of Files into a Dataframe

Posted on June 5, 2011 by Hayward Godwin in R bloggers | 0 Comments

[This article was first published on Psychwire » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I previously posted a method I used for merging a set of files into a dataframe. It wasn’t long before I had some very helpful comments from the R-bloggers community suggesting better methods to achieve my goal. In this post, I compare the different methods and see which is the most efficient (i.e., fastest).

The Methods

My original method is outlined in my post. In the comments, you can see two further methods suggested. One by sayan involves the use of the do.call function and lapply. A second by dan involves the use of plyr‘s ldply function. Check out the comments for the full discussion.

I will therefore compare three methods:

My original method
sayan‘s lapply method
dan’s plyr method

Testing

I ran each of the three methods 10 times (not hugely powerful I know, but it still took a while). For testing purposes, I merged two 16MB text files together, containing several thousand rows and several hundred columns. Having not done any real amount of timing in R before, I searched around a bit. In the end, I found two posts which I based my timings on (here and here). If I’ve done this incorrectly, let me know and I will run them again. Anyway, here are the results. The error bars are standard errors. The time taken is in seconds.

As you can see, my method is by far the slowest. Looks like I won’t be using it ever again!

The R Code

For maximum transparency, below is the R code I used to get these numbers.

# lapply method
lap = replicate(N, system.time(
  full_data<- do.call(
  "rbind",lapply(file_list, 
  FUN=function(files){read.table(files,
  header=TRUE, sep="\t")})))[3])
lap

# plyr method
ply =
replicate(N, system.time(
  dataset <- ldply(file_list, read.table, header=TRUE, sep="\t")
)[3])
ply

# original method
orig= 
replicate(N, system.time(
for (file in file_list){
  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }
  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }
}
)[3])
orig

To leave a comment for the author, please follow the link and comment on their blog: Psychwire » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Testing Different Methods for Merging a set of Files into a Dataframe

The Methods

Testing

The R Code

Related

The Methods

Testing

The R Code

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)