Speed-reading files, revisited

December 29, 2009
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In a post earlier this month, it seemed as though compressing a data file before reading it into R could save you some time. With some feedback from readers and further experimentation, we might need to revisit that conclusion

To recap, in our previous experiment it took 170 seconds to read a 182Mb text file into R. But if we compressed the file first, it only took 65 seconds. Apparently, the benefits of reducing the amount of disk access (by dealing with a smaller file) far outweighed the CPU time required to decompress the file for reading.

In that experiment, though, each file was only read once. If you simply repeat the read statement on the uncompressed file, you see a sudden decrease in the time required to read it:

> system.time(read.table("bigdata.txt", sep=","))

   user  system elapsed 

165.042   1.316 165.807 

> system.time(read.table("bigdata.txt", sep=","))

   user  system elapsed 

 94.248   0.934  94.673 

(This was on MacOS, using the R GUI. I also tried using R from the terminal on MacOS, and also from the R GUI in Windows, using both regular R and REvolution R. There were some slight variations in the timings, but in general I got similar results.)

So what's going on here (other than my embarrassing failure as a statistician to replicate my measurements the first time round)? One possibility is that we're seeing the effects of disk cache: when you access data on a hard drive, most modern drives will temporarily store some of the data in high-speed memory. This makes it faster to access the file in subsequent attempts, for as long as the file data remains in the cache. But that doesn't explain why we don't see a similar speedup in repeated readings of the compressed file:

> system.time(read.table("bigdata-compressed.txt.gz", sep=","))

   user  system elapsed 

 89.464   0.868  90.436 

> system.time(read.table("bigdata-compressed.txt.gz", sep=","))

   user  system elapsed 

 97.651   1.035  98.887 



I'd expect the second reading to be faster if disk cache had an effect, so I don't think disk cache is the culprit here. More revealing is the fact that the first use of read.table in any R session takes longer than subsequent ones. Reading from the gzipped file is slower than reading from the uncompressed file if it's the first read of the session:



> system.time(read.table("bigdata-compressed.txt.gz", sep=","))

   user  system elapsed 

150.429   1.304 152.447 

> system.time(read.table("bigdata.txt", sep=","))

   user  system elapsed 

 78.717   0.986  79.773 

So what's going on here? (This was using R from the terminal under MacOS; I got similar results using the R GUI on MacOS.) I don't have a good explanation, frankly. Maybe the additional time is required by R to load libraries or to page in the R executable (but why would it scale with the file size, then?). Note that we got the speed benefits from reading the uncompressed file second, which rules out disk cache having any significant benefits. If any one has any good explanations, I'd love to hear them.



So what file type is the fastest for reading into R? Reader Peter M. Li took a much more systematic approach to answering that question than I did, running fifty trials for compressed and uncompressed files using both read.table and scan. (We can safely assume that this level of replication nullifies any first-read or caching effects.) He also tested Stata files (an open, binary data file format that R can both read and write). Peter also tested different file sizes for each file type, with files containing one thousand, 10 thousand, 100 thousand, one million and 10 million observations.  His results are summarized in the graph below, with log(file size) on the X axis and log(time to read) on the Y axis:



File read times 

So, what can we conclude from all of this? Let's see:

  • In general, compressing your text data doesn't speed up reading it into R. If anything, it's slower.
  • The only time compressing files might be beneficial is for large files read with read.table (but not scan)
  • There's a speed penalty the first time you use read.table in an R session. 
  • Reading data from Stata files has significant performance benefits compared to text-based files.
And, lastly but most importantly, you should always replicate your measurements.











To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.