Note to self – Remember to serialize R objects as RDS files when it makes sense.
Importing Stata data into R
While my instinctive preference for storing data is to use CSV, in the case of survey data, many/most measurements come with detailed variable and value labels.
Furthermore, as is the case in the European Social Survey, the missing values of survey data generally take several different values to code for different forms of nonresponse, depending on whether the respondents “did not know” what to answer, provided “no answer,” or “refused to answer” the question.
For these reasons, I tried to download the European Social Survey as a Stata dataset, only to realise later that the data had been produced with Stata 14—which means that it cannot be opened with older versions of Stata, unless the data were saved with the
saveold command and with the appropriate argument for my version of Stata.
Fortunately, I was able to read the data in R with
haven. The package, which wraps around the ReadStat C library, can import SAS, SPSS and Stata files. Once imported, the data are available as a standard data frame, with value labels accessible via functions like
Saving the data as a RDS file
Another issue that then I faced with the European Social Survey dataset was its size: while only 103.5 MB compressed, the uncompressed Stata DTA file for the complete (all variables, all waves) version of the cumulative dataset is extremely large: 3.16 GB.
In comparison, the CSV file for the same dataset, which does not contain labels or detailed missing values, is 58.1 MB compressed and 559.7 MB uncompressed.
Here again, R offers a superior alternative to both the CSV and Stata formats: by saving the file as a RDS file, which creates a serialized version of the dataset and then saves it with
gzip compression, I was able to bring the size of the dataset down to 51.6 MB.
Note that, when loaded into R, the RDS object still takes around 3 GB of (live) memory.
The full code used to convert the European Social Survey data from the DTA (Stata) to the RDS (R) format follows. The code requires the
haven package, which is part of Hadley Wickham’s tidyverse package suite.
: having discussed the issue on Twitter, it appears that the data mentioned in this note can be compressed quite efficiently in Stata. That operation, however, requires Stata 14 or above, if Stata keeps its commitment backwards compatibility. There is currently no other way to load the file in lower versions of Stata.