[This article was first published on Variance Explained, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week Stack Overflow released the full (anonymized) results of the survey at stackoverflow.com/research. To make analysis in R even easier, today I’m also releasing the stacksurveyr package, which contains:
The full survey results as a processed data frame (stack_survey)
A data frame with the survey’s schema, including the original text of each question (stack_schema)
A function that works easily with multiple-response questions (stack_multi)
This makes it easier than ever to explore this rich dataset and answer questions about the world’s developers.
Examples: Basic exploration
I’ll give a few examples of survey analyses using the dplyr package. For instance, you could discover the most common occupations of survey respondents:
We can also use group_by and summarize to find the highest paid (on average) occupations:
This can be visualized in a bar plot:
Examples: Multi-response answers
10 of the questions allow multiple responses, as can be noted in the stack_schema variable:
In these cases, the responses are given delimited by ; . Often, these columns are easier to work with and analyze when they are “unnested” into one user-answer pair per row. The package provides the stack_multi function as a shortcut for that unnesting. For example, consider the tech_do column (““Which of the following languages or technologies have you done extensive development with in the last year?”):
Using this data, we could find the most common answers:
We can join this with the stack_survey dataset using the respondent_id column. For example, we could look at the most common development technologies used by data scientists:
Or we could find out the average age and salary of people using each technology, and compare them:
If we want to be a bit more adventurous, we can use the (in-development) widyr package to find correlations among technologies, and the ggraph package to display them as a network of related technologies:
Try the data out for yourself!
To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.