Working With SPSS© Data in R

Posted on June 23, 2017 by Mauricio Vargas S. 帕夏 in R bloggers | 0 Comments

[This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign and haven R packages. I prefer haven because it integrates better with R’s tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

The Data

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

Importing Data

#devtools::install_github("ropenscilabs/skimr")

# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)

# Import foreign statistical formats
library(haven)

# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"

if(!file.exists(sav)){download.file(url,sav)}

survey = read_sav(sav)

Exploring data

To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.

# How many surveys do I have by day?
daily = survey %>%
  mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
  rename(date = Fecha) %>% 
  group_by(date) %>%
  summarise(n = n())

ggplot(daily, aes(date, n)) +
  geom_line()

# How is the age distributed?
summary(survey$Edad_Entrevistado)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   32.00   48.00   47.92   61.00   89.00 

age = survey %>%
  mutate(as.integer(Edad_Entrevistado)) %>% 
  rename(age = Edad_Entrevistado) %>% 
  group_by(age) %>%
  summarise(n = n())

ggplot(age, aes(age, n)) +
  geom_line()

# How is the sex distributed?
survey %>%
  rename(sex_id = Sexo_Entrevistado) %>% 
  group_by(sex_id) %>%
  summarise(n = n())

# A tibble: 2 x 2
     sex_id     n
  <dbl+lbl> <int>
1         1   651
2         2   651

Exploring labels

In the last tibble we have no idea what is 1 and 2.

survey %>%
  select(Sexo_Entrevistado) %>% 
  rename(sex_id = Sexo_Entrevistado) %>% 
  distinct() %>% 
  mutate(sex = as_factor(sex_id))

# A tibble: 2 x 2
     sex_id    sex
  <dbl+lbl> <fctr>
1         2  Mujer
2         1 Hombre

The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.

I could run

survey %>%
  rename(sex = Sexo_Entrevistado) %>% 
  mutate(sex = as.integer(sex)) %>% 
  mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% 
  group_by(sex) %>%
  summarise(n = n())

# A tibble: 2 x 2
     sex     n
   <chr> <int>
1 Female   651
2   Male   651

The column names are labelled as well. Here sjlabelled helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.

Describing the dataset

valid_replies = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="complete") %>% 
  mutate(description = get_label(survey)) %>% 
  select(var,description,everything()) %>% 
  select(-c(stat,level,type)) %>% 
  rename(pcent_valid = value) %>% 
  mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))

histograms = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="hist") %>% 
  select(var,level) %>% 
  rename(histogram = level)

survey_description = valid_replies %>% 
  left_join(histograms) %>% 
  write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")

survey_description

# A tibble: 203 x 4
                 var          description pcent_valid  histogram
               <chr>                <chr>       <chr>      <chr>
 1        PONDERADOR           Ponderador        100% ▂▇▇▅▅▃▁▁▁▁
 2             Folio                Folio        100% ▇▇▇▇▇▇▇▇▇▇
 3            Región               Región        100% ▁▁▂▁▂▁▁▁▇▁
 4            Comuna               Comuna        100% ▁▁▂▁▁▂▁▁▇▁
 5             Fecha     Fecha entrevista        100%       <NA>
 6  Sexo_Encuestador   Sexo Entrevistador         91% ▂▁▁▁▁▁▁▁▁▇
 7               GSE           GSE Visual        100% ▁▁▂▁▇▁▁▆▁▁
 8 Sexo_Entrevistado    Sexo Entrevistado        100% ▇▁▁▁▁▁▁▁▁▇
 9 Edad_Entrevistado    Edad Entrevistado        100% ▇▆▅▆▇▇▅▃▃▂
10       Hora_Inicio Hora Inicio Medición        100%       <NA>
# ... with 193 more rows

Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Working With SPSS© Data in R

Introduction

The Data

Importing Data

Exploring data

Exploring labels

Describing the dataset

Related

Introduction

The Data

Importing Data

Exploring data

Exploring labels

Describing the dataset

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)