Vignette: Simulating a minimal SPSS dataset from R
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
What this is about š
I will simulate a minimal labelled survey dataset that can be exported as a SPSS (.SAV) file (with full variable and value labels) in R. I will also attempt to fabricate āmeaningful patternsā to the dataset such that it can be more effectively used for creating demo examples.
Background
Simulating data is one of the most useful skills to have in R. For one, it is helpful when youāre debugging code, and you want to create a reprex (reproducible example) to ask for help more effectively (help others help you , as the saying goes.)^{1} However, regardless of whether youāre a researcher or a business analyst, the data associated with your code is likely to be either confidential so you cannot share it on Stack Overflow, or way too large or complex for you to upload anyway. Creating an example dataset from a few lines of code which you can safely share is an effective way to get around this problem.
Data simulation is slightly more tricky with survey datasets, which are characterised by (1) labels on both variable and values/codes, and (2) a large proportion of ordinal / categorical variables.
For instance, a Net Promoter Score (NPS) variable is usually accompanied with the variable label āOn a scale of 010, how likely are you to recommend X to a friend or family?ā (i.e.Ā the actual question asked in a survey), and is itself an instance of an ordinal variable. If you are trying to produce an example that hinges on an issue where labels are relevant, you would also need to simulate the labels as well.
There are also educational reasons for simulating data: it is useful to simulate data to demo an analysis or a function, because this makes it easy for the audience to reproduce the example. For this purpose, it would be especially beneficial if you can simulate a dataset where there you can introduce some arbitrary relationships between the variables, rather than them being completely random (sample()
all the way).
Personally, I have in the past found it a pain to simulate datasets which are suited for demoing survey related functions, especially when I was working on examples for the {surveytoolbox} package š¦. Hence, this is partly an attempt to simulate a labelled dataset that is minimally sufficient for demonstrating some of the {surveytoolbox} functions.
š· For more information specifically on manipulating labels in R, do check out a previous post Iāve written on working with SPSS labels in R.
Getting started
To run this example, weāll need to load {tidyverse}, {surveytoolbox}, and {haven}. Specifically, Iām using {tidyverse} for its data manipulation functions, {surveytoolbox} for functions to set up variable/value labels, and finally {haven} to export the data as a .SAV file.
Note that {surveytoolbox} is currently not available on CRAN yet, but you can install this by running devtools::install_github("martinctc/surveytoolbox")
. Youāll need {devtools} installed, if you havenāt got it already.
In addition to loading the packages, we will also set the seed^{2} with set.seed()
to make the simulated numbers reproducible:
library(tidyverse) library(surveytoolbox) # Install with devtools::install_github("martinctc/surveytoolbox") library(haven) set.seed(100) # Enable reproducibility  100 is arbitrary
Create individual vectors
For the purpose of clarity and ease of debugging, my approach will be to first set up each simulated variable as individual labelled vectors, and then bind them together into a data frame at the end. To adorn variable and value labels to a numeric vector, I will use set_varl()
and set_vall()
from {surveytoolbox} to do these tasks respectively.
I want to create a dataset with 1000 observations, so I will start with creating v_id
as an ID variable running from 1 to 1000, which can simply be generated with the seq()
function.^{3} I will then use set_varl()
from {surveytoolbox} to set a variable label for the v_id
vector. The second argument of set_varl()
takes in a character vector and assigns it as the variable label of the target variable – super straightforward.
## Record Identifier v_id < seq(1, 1000) %>% set_varl("Record Identifier")
The same goes for v_gender
, but this time I want to also (1) apply an arbitrary probability to the distribution, and (2) give each value in the vector a value label (āMaleā, āFemaleā, āOtherā).
To do (1), I pass a numeric vector to the prob
argument to represent the probabilities that 1, 2, and 3 will fall out for n = 1000.
To do (2), I run set_vall()
and pass the desired labels to the value_labels
argument. set_vall()
acccepts a named character vector to be assigned as value labels.
Finally, I run set_varl()
again to make sure that a variable label is present.
## Gender v_gender < sample(x = 1:3, size = 1000, replace = TRUE, prob = c(.48, .48, .04)) %>% # arbitrary probability set_vall(value_labels = c("Male" = 1, "Female" = 2, "Other" = 3)) %>% set_varl("Q1. Gender")
Now that weāve got our ID variable and a basic grouping variable (gender), letās also create some mock metric variables.
I want to create a 5point scale KPI variable (which could represent customer satisfaction or likelihood to recommend). One way to do this is to simply run sample()
again, and do the same thing we did for v_gender
:
## KPI  #1 simple sampling v_kpi < sample(x = 1:5, size = 1000, replace = TRUE) %>% set_vall(value_labels = c("Extremely dissatisfied" = 1, "Somewhat dissatisfied" = 2, "Neither" = 3, "Satisfied" = 4, "Extremely satisfied" = 5)) %>% set_varl("Q2. KPI")
Whilst the above approach is straightforward, the downside is that the numbers are likely to look completely random if we try to actually analyse the results – which is what sample()
is supposed to do – but clearly isnāt ideal.
I want to simulate numbers that are more realistic, i.e.Ā data which will form a discernible pattern when grouping and summarising by gender. What Iāll therefore do is to iterate through each number in v_gender
, and sample numbers based on the gender of the ārespondentā.
The values that are passed below to the prob
argument within sample()
are completely arbitrary, but are designed to generate results where a bigger KPI value is more likely if v_gender == 1
, followed by v_gender == 3
, then v_gender == 2
.
Note that Iāve used map2_dbl()
here (from the {purrr} package, part of {tidyverse}), which āloopsā through v_gender
and returns a numeric value for each iteration.
## KPI  #2 genderdependent sampling v_kpi < v_gender %>% map_dbl(function(x){ if(x == 1){ sample(1:5, size = 1, prob = c(10, 17, 17, 28, 28)) # Sum to 100 } else if(x == 2){ sample(1:5, size = 1, prob = c(11, 22, 28, 22, 17)) # Sum to 100 } else { sample(1:5, size = 1, prob = c(13, 20, 20, 27, 20)) # Sum to 100 } }) %>% set_vall(value_labels = c("Extremely dissatisfied" = 1, "Somewhat dissatisfied" = 2, "Neither" = 3, "Satisfied" = 4, "Extremely satisfied" = 5)) %>% set_varl("Q2. KPI")
To add a level of complexity, let me also simulate a mock NPS variable. One way to do this is to punch in random numbers like how it is done above with v_kpi
, but this will involve a lot more random punching than is desirable for a 11point scale NPS variable.
I will therefore instead write a custom function called skew_inputs()
that āexpandsā three arbitrary input numbers into 11 numbers, which will then serve as the probability anchors for my sample()
functions later on.
## Generate skew inputs for sample probability ## ## `value1`, `value2` and `value3` ## generate the skewed probabilities ## skew_inputs < function(value1, value2, value3){ all_n < c(rep(value1, 7), # 0  6 rep(value2, 2), # 7  8 rep(value3, 2)) # 9  10 return(sort(all_n)) } ## Outcome KPI  NPS v_nps < v_gender %>% map_dbl(function(x){ if(x == 1){ sample(0:10, size = 1, prob = skew_inputs(1, 1, 8)) } else if(x == 2){ sample(0:10, size = 1, prob = skew_inputs(2, 3, 5)) } else if(x == 3){ sample(0:10, size = 1, prob = skew_inputs(1, 3, 6)) } else { stop("Error  check x") } }) %>% set_varl("Q3. NPS")
Admittedly that the above procedure isnāt minimal, but note that this is a tradeoff to introduce some arbitrary patterns to the data. A āquick and dirtyā alternative simulation would simply be to run sample(x = 0:10, size = 1000, replace = TRUE)
for v_nps
.
There is one slight technicality: the socalled NPS question is strictly speaking a likelihood to recommend question which ranges from 0 to 10, and the Net Promoter Score itself is calculated on a recoded version of that question where Detractors (scoring 0 to 6) have to be coded as 100, Passives (scoring 7 to 8) as 0, and Promoters (scoring 9 to 10) as +100. The Net Promoter Score is simply calculated as a mean of those recoded values.
Fortunately, the {surveytoolbox} package comes shipped with a as_nps()
function that does this recoding for you, and also automatically applies the value labels. letās call this new variable v_nps2
:
## Outcome KPI  Recoded NPS (NPS2) v_nps2 < as_nps(v_nps) %>% set_varl("Q3X. Recoded NPS")
Combine vectors
Now that all the individual variables are set up, I can simply combine them all into a tibble in one swift movement^{4}:
#### Combine individual vectors #### combined_df < tibble(id = v_id, gender = v_gender, kpi = v_kpi, nps = v_nps, nps2 = v_nps2)
Results!
Letās run a few checks on our dataset to confirm that everything has worked out okay.
The classic {dplyr} glimpse()
:
combined_df %>% glimpse() ## Observations: 1,000 ## Variables: 5 ## $ id1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1... ## $ gender 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2,... ## $ kpi 3, 2, 5, 2, 5, 5, 2, 2, 5, 3, 3, 4, 1, 5, 4, 4, 4, 1, 2,... ## $ nps 10, 5, 10, 8, 10, 9, 7, 5, 2, 5, 1, 4, 5, 9, 9, 10, 9, 3, 10... ## $ nps2 100, 100, 100, 0, 100, 100, 0, 100, 100, 100, 100, ...
Then head()
to see the first five rows:
combined_df %>% head() ## # A tibble: 6 x 5 ## id gender kpi nps nps2 #### 1 1 2 [Female] 3 [Neither] 10 100 [Promoter] ## 2 2 2 [Female] 2 [Somewhat dissatisfied] 5 100 [Detractor] ## 3 3 1 [Male] 5 [Extremely satisfied] 10 100 [Promoter] ## 4 4 2 [Female] 2 [Somewhat dissatisfied] 8 0 [Passive] ## 5 5 2 [Female] 5 [Extremely satisfied] 10 100 [Promoter] ## 6 6 1 [Male] 5 [Extremely satisfied] 9 100 [Promoter]
So it appears that the value labels have been properly attached, and the range of values are what weād expect. Now what about the āfake patternsā?
Looking at the topline result of the data, we seem to have succeeded in fabricating some sensible patterns in the data. It appears that this company X will need to work harder at winning over its female customers, who have rated them lower on two KPI metrics:
combined_df %>% group_by(gender) %>% summarise(n = n_distinct(id), kpi = mean(kpi), nps2 = mean(nps2)) ## # A tibble: 3 x 4 ## gender n kpi nps2 #### 1 1 [Male] 490 3.49 31.0 ## 2 2 [Female] 464 3.07 8.62 ## 3 3 [Other] 46 3.15 17.4
Check the labels š·š·š·
Finally Iād like to share a couple of functions that enable you to explore the labels in a labelled dataset. surveytoolbox::varl_tb()
accepts a labelled data frame, and returns a twocolumn data frame with the variable name and its corresponding variable label:
combined_df %>% varl_tb() ## # A tibble: 5 x 2 ## var var_label #### 1 id Record Identifier ## 2 gender Q1. Gender ## 3 kpi Q2. KPI ## 4 nps Q3. NPS ## 5 nps2 Q3X. Recoded NPS
surveytoolbox::data_dict()
takes this further, and shows also the value labels as a third column. This is what effectively whatās typically referred to as a code frame in a market research context:
combined_df %>% select(id) %>% data_dict() ## var label_var ## 1 gender Q1. Gender ## 2 kpi Q2. KPI ## 3 nps Q3. NPS ## 4 nps2 Q3X. Recoded NPS ## label_val ## 1 Male; Female; Other ## 2 Extremely dissatisfied; Somewhat dissatisfied; Neither; Satisfied; Extremely satisfied ## 3 ## 4 Detractor; Passive; Promoter; Missing value ## value ## 1 1; 2; 3 ## 2 1; 2; 3; 4; 5 ## 3 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10 ## 4 100; 0; 100
I would also highly recommend the view_df()
function from {sjPlot}, which exports a similar overview of variables and labels in a nicely formatted HTML table. For huge labelled datasets, this offers a fantastic lightweight way to browse through your variables and labels.
combined_df %>% sjPlot::view_df()
Once weāve checked all the labels and weāre happy with everything, we can then export our dataset with haven::write_sav()
! If everythingās worked properly, all the labels should appear properly if you choose to open your example dataset in SPSS, or Q:
combined_df %>% haven::write_sav("Simulated Dataset.sav")
End notes
I hope youāve found this vignette useful!
If you ever get a chance to try out {surveytoolbox}, I would really appreciate if you can submit any issues/feedback on GitHub, or get in touch with me directly. Iām looking for collaborators to make the package more userfriendly and powerful, so if youāre interested, please donāt be shy and give me a shout! š

Check out this RStudio Community thread to learn more about reprex (the portmanteau reprex is coined by Romain Francois)ā©ļø

If youāre not familiar with this concept / approach, Iād recommend checking out this Stack Overflow thread.ā©ļø

For those who are more ambitious, I would recommend checking out the {uuid} package for generating proper GUIDs (Globally Unique Identifier). However, this then wouldnāt be minimal, so I would just stick with running a simple
seq()
sequence.ā©ļø 
I shouldnāt need to footnote this, but hereās a Rocky Flintstone tribute for any Belinkers out there. š¤£ā©ļø
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.