How to read Stata DTA files into R

July 9, 2018
By

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

The file contains 2017 face-to-face post-election survey responses along with explanatory notes. Read the Stata DTA file into R with two these two lines:

 library(haven)
df <- read_dta("http://www.britishelectionstudy.com/wp-content/uploads/2018/01/bes_f2f_2017_v1.2.dta")

The data set is now stored as a dataframe df with 357 variables. To check the properties of the data set we type

str(df)

This gives the following output:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2194 obs. of  357 variables:
$ finalserialno         : atomic  10115 10119 10125 10215 10216 ...
..- attr(*, "label")= chr "Final Serial Number"
..- attr(*, "format.stata")= chr "%12.0g"
$ serial                : atomic  000000399 000000398 000000400 000000347 ...
..- attr(*, "label")= chr "Respondent Serial Number"
..- attr(*, "format.stata")= chr "%9s"
$ a01                   : atomic  nhs brexit society immigration ...
..- attr(*, "label")= chr "A1: Most important issue"
..- attr(*, "format.stata")= chr "%240s"
$ a02                   :Class 'labelled'  atomic [1:2194] 1 0 -1 -1 1 -1 2 -1 2 2 ...
.. ..- attr(*, "label")= chr "Best party on most important issue"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:13] NA NA NA 0 1 2 3 4 5 6 ...
.. .. ..- attr(*, "names")= chr [1:13] "Not stated" "Refused" "Don`t know" "None/No party" ...

The above output shows that the variables are already set to the correct types. The first variable finalserialno is numeric (i.e., atomic), the third variable a01 is character, and the fourth variable a02 has a class of ‘labelled’ which can be converted to a factor or categorical variable (after we handle missing values).
Each variable has an associated label attribute to help with interpretation. For example, without having to look up the explanatory notes, we can see that variable a01 contains the responses to the question “A1: most important issue” and variable a02 contains the responses to “Best party on most important issue”.

Missing values

Stata supports multiple types of missing values.  Read_dta automatically handles missing values in numeric and character variables. For categorical variables, missing values are typically encoded by negative numbers. Section 5.3 of the explanatory notes describes the encoding for this file: -1 (Don’t know), -2 (Refused) and -999 (Not stated). We now convert all three of these values to NA.

for (i in 1:length(df))
{
    if (class(df[[i]] == "labelled")
        df[[i]][df[[i]] < 0] <- NA
}

Encoding categorical variables

The categorical variables of class “labelled” are stored as numeric vectors. Convert them into factors so they are correctly associated with the labels with only a single command:

 df <- as_factor(df)

Note that we do this after converting the missing values to avoid spurious factor levels in the final dataset.

Find out more

You can find out about how to import and read Excel files into Displayr as well.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)