At Stack Overflow we’ve always been committed to sharing data: all content contributed to the site is CC-BY-SA licensed, and we release regular “data dumps” of our entire history of questions and answers.
I’m excited to announce a new resource specially aimed at data scientists, analysts and other researchers, which we’re calling the StackLite dataset.
What’s in the StackLite dataset?
For each Stack Overflow question asked since the beginning of the site, the dataset includes:
- Question ID
- Creation date
- Closed date, if applicable
- Deletion date, if applicable
- Owner user ID (except for deleted questions)
- Number of answers
This is ideal for performing analyses such as:
- The increase or decrease in questions in each tag over time
- Correlations among tags on questions
- Which tags tend to get higher or lower scores
- Which tags tend to be asked on weekends vs weekdays
- Rates of question closure or deletion over time
- The speed at which questions are closed or deleted
Examples in R
The dataset is provided as csv.gz files, which means you can use almost any language or statistical tool to process it. But here I’ll share some examples of a simple analysis in R.
The question data and the question-tag pairings are stored separately. You can read in the dataset (once you’ve cloned or downloaded it from GitHub) with:
questions file has one row for each question:
question_tags file has one row for each question-tag pair:
As one example, you could find the most popular tags:
Or plot the number of questions asked per week:
Many of the most interesting issues you can examine involve tags, which describe the programming language or technology used in a question. You could compare the growth or decline of particular tags over time:
How this compares to other Stack Overflow resources
Almost all of this data is already public within the Stack Exchange Data Dump. But the official data dump requires a lot of computational overhead to download and process (the Posts fit in a 27 GB XML file), even if the question you want to ask is very simple. The StackLite dataset, in contrast, is designed to be easy to read in and start analyzing. (For example, I was really impressed with Joshua Kunst’s analysis of tags over time, and want to make it straightforward for others to write posts like that).
Similarly, this data can be examined within the Stack Exchange Data Explorer (SEDE), but it requires working with separate queries that each return at most 50,000 rows. The StackLite dataset offers analysts the chance to work with the data locally using their tool of choice.
I’m hoping other analysts find this dataset interesting, and use it to perform meaningful and open research. (Be sure to comment below if you do!)
I’m especially happy to have this dataset public and easily accessible, since it gives me the chance to blog more analyses of Stack Overflow questions and tags while keeping my work reproducible and extendable by others. Keep an eye out for such posts in the future!