At some point you may be looking for a “real world” dataset to practice analysis on or to give to students.
The value of such data is that it gives analysts a chance to develop skills they need for their work, but are hard to master when given “clean” datasets, especially inside a guided course.
I’ve found this dataset below which, apart from being actual, real-life data, has a few characteristics that makes it a good set to learn about data cleaning and then further analyzing.
The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs. I find salary surveys inherently interesting, but here’s some other notable aspects of this dataset.
- There are 17 variables, so its not too overwhelming
- 6 of the variables are free-form text entry, which always results in lots of data cleaning to be done!
- All variables make intuitive sense you don’t need any domain expertise to figure out what they are HOWEVER….
- You can apply some domain expertise to a subset of the data that you are familiar with, be it country, state, job title or sector knowledge.
- The dataset is “live” and constantly growing. In the time it’s taken me to write the first lines of this post, the responses grew from 11,588 to 11,603. This means that fixes you made to earlier analysis may not hold for all new entries.
- When downloading the dataset, there’s also a “timestamp” variable (column A), so you can simulate a growing list by filtering data by longer and longer timespans if it’s no longer receiving any updates.
If you’re using R, you can read the sheet using the googlesheets4 package.
You can of course make a copy of the sheet directly in Google sheets, or you can download it in multiple formats.