It is time for RData files to become the standard for Data Transfer

[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


It is time Rdata files become the primary means of disseminating publicly available data online.

1. R is the most efficient Statistical software at compressing data

I was recently attempting to download weather data from the US government and found myself stymied because the dataset I wanted was considered too large (+5 gigs).  The problem I realized was not that I wanted too much data but that the data transfer format was so poor.  The only datasheet form available was csv.  I was therefore forced to drop many variables and resubmit the data request.

Ultimately, I downloaded only some of the pieces of the data which ended up being a file 627  MB in size.  Importing the data into R via the read.csv command and immediately saving it as an “RData” file reduced the size to 55.3 MB (92% reduction).  As a matter of comparison, I imported data into Stata 12 and then saved it in Stata’s native format which resulted in a reduced size of 318.2 MB (49% reduction).  I compared the zip of both the csv file and the Stata file as well as the R file.  The R file zipped only had trivial gains in size of 54.3 MB while the compressed Stata file made considerable gains, taking up only 79.5 MB.  The CSV file compressed to zip still performed the worse, taking up 120.4 MB.


Native Format
Read Time
Zipped
Comma Separated Values
627.7 MB
120.4 MB
R
55.3 MB
1.12
54.3 MB
Stata
318.2 MB
1.24
79.5 MB

I also clocked how long it would take to read this data into R vs Stata and found that the difference in read times was not substantial.  However, this should not be assumed to be the case for all systems since I am running with a solid state drive which has much higher read speeds than traditional magnetic hard drives.

Looking at other software, I downloaded another dataset recently in which the data was provided in four different formats, Borland Database Format (DBF) 130MB, Microsoft Access Database (MDB) 110MB, and SPSS/PAWS (SAV) 45MB.  After importing the data into R and saving it as an Rdata file, my resulting Rdata file only took up 3.2 MB.

Size
Zipped
Borland Database Format (DBF)
130 MB
4.5 MB
Microsoft Access Database (MDB)
110 MB
7.2 MB
SPSS/PAWS (SAV)
45 MB
4.8 MB
R (Rdata)
3.2 MB
3.1 MB

This efficiency alone presents a significant case as to why one should save and distribute data in Rdata files when possible.

2. R codex is open access

This may not sound like a big deal but the open source nature of R makes it extremely easy to transfer data from R into any other program.  Quick R gives sample code that can be used to easily save (or read) data to (from) SPSS, SAS, or Stata.  In addition to providing an easy means of transferring data between statistical programs, R does not face issues relating to lack of backwards compatibility.

Stata for instance only allows data sets to be saved backwards for up to three or four previous versions.  Thus if you are running Stata 12, 11, or 10 you can only save data sets so that they are compatible with users using Stata 8 or later (see the Stata help topic).  This practice on behalf of Stata seems unnecessarily harsh since it is in effect forcing users to upgrade their version of Stata if only to access data saved by users who can afford access to later versions of Stata.

I suspect that this is not primarily an issue with Stata but one relating to proprietary software in general.  Proprietary software companies would like to encourage soft handedly or strong handedly the purchase of newer software even if it is at the expense of current users.  

However, this issue does not exist using R.  Thus, you can be assured that by saving data in R, anybody should be able to access your data.

3.  R Projects are Easily Bundled

Different types of data files allow for different levels of embedded descriptive information such as Stata’s variable labels.  As far as I know, R has the most extensive option available for bundling information into a single file.  Not only can R save the data files and descriptive labels in a single bundle but functions which are specific to use with the data may be included as well in the same bundle.  For example, if you are working with data on health you may be interested in not only having the BMI index and other health indexes for each individual but also having the functions that calculate these indexes.  Including these functions is simple within an Rdata file.

It may not be clear to some users that this is really a large advantage because nearly all statistical software packages as far as I know include the possibility of producing external script files.  R has this option as well.  The advantage to R is that it not only includes that option but also includes the option of embedding complex or unique functions within Rdata files.

So R, Now What?

If you accept that R is an ideal candidate for use as a standard for sharing statistical data due to R’s superior data compression technology, R’s open access codex, and R’s ability to easily bundle information into a single file, there is still a bit of a problem posed by the R workspace system.

As far as I know, there are no standards for transferring data between R users.  Thus even though transfers are highly efficient, it is not clear how to organize your data within an R workspace.  This is in contrast to Stata data which has a standard spreadsheet structure with added information in terms of variable labels and factor variables.

The easiest solution to this problem would be to include some kind of standard documentation such as a readme function in any Rdata files released.  This function would display a list of objects in the Rdata file and describe their components.  Further refinements to such a standard might include establishing common names for simple data sets such as naming default data “mydata”.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)