Working With SPSS© Data in R

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Introduction

I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign and haven R packages. I prefer haven because it integrates better with R’s tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

The Data

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

Importing Data

#devtools::install_github("ropenscilabs/skimr")

# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)

# Import foreign statistical formats
library(haven)

# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"

if(!file.exists(sav)){download.file(url,sav)}

survey = read_sav(sav)

Exploring data

To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.

# How many surveys do I have by day?
daily = survey %>%
  mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
  rename(date = Fecha) %>% 
  group_by(date) %>%
  summarise(n = n())

ggplot(daily, aes(date, n)) +
  geom_line()

plot of chunk exploring_1

# How is the age distributed?
summary(survey$Edad_Entrevistado)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   32.00   48.00   47.92   61.00   89.00 
age = survey %>%
  mutate(as.integer(Edad_Entrevistado)) %>% 
  rename(age = Edad_Entrevistado) %>% 
  group_by(age) %>%
  summarise(n = n())

ggplot(age, aes(age, n)) +
  geom_line()

plot of chunk exploring_1

# How is the sex distributed?
survey %>%
  rename(sex_id = Sexo_Entrevistado) %>% 
  group_by(sex_id) %>%
  summarise(n = n())
# A tibble: 2 x 2
     sex_id     n
   
1         1   651
2         2   651

Exploring labels

In the last tibble we have no idea what is 1 and 2.

survey %>%
  select(Sexo_Entrevistado) %>% 
  rename(sex_id = Sexo_Entrevistado) %>% 
  distinct() %>% 
  mutate(sex = as_factor(sex_id))
# A tibble: 2 x 2
     sex_id    sex
   
1         2  Mujer
2         1 Hombre

The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.

I could run

survey %>%
  rename(sex = Sexo_Entrevistado) %>% 
  mutate(sex = as.integer(sex)) %>% 
  mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% 
  group_by(sex) %>%
  summarise(n = n())
# A tibble: 2 x 2
     sex     n
    
1 Female   651
2   Male   651

The column names are labelled as well. Here sjlabelled helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.

Describing the dataset

valid_replies = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="complete") %>% 
  mutate(description = get_label(survey)) %>% 
  select(var,description,everything()) %>% 
  select(-c(stat,level,type)) %>% 
  rename(pcent_valid = value) %>% 
  mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))

histograms = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="hist") %>% 
  select(var,level) %>% 
  rename(histogram = level)

survey_description = valid_replies %>% 
  left_join(histograms) %>% 
  write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")

survey_description
# A tibble: 203 x 4
                 var          description pcent_valid  histogram
                                            
 1        PONDERADOR           Ponderador        100% ▂▇▇▅▅▃▁▁▁▁
 2             Folio                Folio        100% ▇▇▇▇▇▇▇▇▇▇
 3            Región               Región        100% ▁▁▂▁▂▁▁▁▇▁
 4            Comuna               Comuna        100% ▁▁▂▁▁▂▁▁▇▁
 5             Fecha     Fecha entrevista        100%       
 6  Sexo_Encuestador   Sexo Entrevistador         91% ▂▁▁▁▁▁▁▁▁▇
 7               GSE           GSE Visual        100% ▁▁▂▁▇▁▁▆▁▁
 8 Sexo_Entrevistado    Sexo Entrevistado        100% ▇▁▁▁▁▁▁▁▁▇
 9 Edad_Entrevistado    Edad Entrevistado        100% ▇▆▅▆▇▇▅▃▃▂
10       Hora_Inicio Hora Inicio Medición        100%       
# ... with 193 more rows

Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)