R vs Stata: Importing and Saving Datasets

January 6, 2014
By

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

philosoraptor_indeed Today, I got a license of the new Stata/MP 13 (dual core), so I decided to make some succinct comparisons with R (Rstudio). Much more tests will come in the following weeks, but today I focused only on the basics: processing text files. Essentially, reading and writing raw datasets. The results I obtained, surprised me—I've to confess—R outperformed Stata in most of the data assignments I ran.
Although the Stata version I'm using is a multicore one—not the basic and cheaper inter-cooled version (Stata/IC)—the results I obtained go in a negative direction for Stata. Interesting though, my past experience as Stata and R user made me believe that Stata was much faster than R for performing trivial tasks, including loading data tables into memory. However, the evidence I got today contradicts my previous opinion about Stata. Of course, a multicore version of this package doesn't help much, since parallel computation provides benefits only for completing repetitive tasks that take at least one or two seconds to get through (there are quite a few posts about this topic, including my own here ).

The Results:

Any simple work starts by feeding the statistical package with raw data. I tested how quickly these packages get through a semicolon delimited text file of about 450MB. As the following output shows, Stata took 134.37 seconds for reading this raw data. However, R took much less time to import the same file, only 102.49 seconds. Therefore, in this simple but critical task, R outperformed Stata by loading a pure text file 24% faster than did Stata.

Stata importing output: R importing output:

Reading raw text into memory can be tricky, since each software may have different strategies for loading different sort of data. But how about testing for difference in loading their native formats? Once again, I tested how quickly they load the same data but converted to their own file formats (.Rdata and .dta). Overall, R did quite well loading its file in 19.23 seconds, while Stata did so in 89.66 seconds. This mean that R loaded the dataset 78% faster than Stata, or to put differently, my Stata/MP 13 took 4.6 times more than R to load the "same data".

How about exporting data already in the memory to the disk? When it comes to exporting back data from memory to the disk as a text delimited, Stata finally outperformed R. While Stata took 67.25 seconds for writing a file of 458MB of raw text, R needed 5.7 seconds more to do the same (72.93 seconds). Therefore, Stata exported the data 8% faster than R did.

Finally, exporting data from memory to disk but as their native format, R outperformed Stata in few dozens seconds again. While Stata took 118.35 seconds, R took only 42.53 seconds. That is, R took roughly 2/3 of a minute to perform its duty, while Stata did so in roughly 2 minutes. Or to put differently, the Stata Corporation package was 2.78 times slower than the free one. This difference is huge when we think in percentages: R was 57% faster than Stata. It is certainly not trivial.

R and Stata are comparable for the tests I performed because both packages have to load all the data in at once before performing any analysis. However, reading data into memory may take longer because R and Stata produce distinct native formats, which also affect the final size of the file. Actually, one of the marvelous things I like in R is its competence to store data. For instance, the example dataset I'm using for conducting these tests takes 458MB of physical disk as a raw text file. However, if I store this file as Stata format, the outcome file will need 1.16GB of disk, which is 2.59 times more space to store the same amount of information. Nonetheless, storing the same file as R format (Rdata), it will need only 54.3MB of disk.

All in all, R outperformed Stata in 3 out of 4 trivial tasks. Stata outperformed R only when writing data from memory to the disk, although the difference resulted wasn't that big, only 5.7 seconds is too small for Stata to celebrate. Can anyone prove me wrong? Moreover, the results reported here are based only on one-shot test; the ideal design to benchmark for differences in performance would be several repetitions of the same work, so to obtain the average of performance. The idea was set forth.

To leave a comment for the author, please follow the link and comment on his blog: Daniel MarcelinoDaniel Marcelino » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.