(Contributing blogger Joe Rickert has put together a fantastic list of data sources suitable for use with R. If you're looking for data to use in the Applications of R Contest — entries close October 31 — this is a great resource for you — Ed.)
Hardly a day goes by without someone or something reminding me that we are drowning in a sea of data (a bummer day ):, or that the new hero is the data scientist (a Yes! let’s go make some money kind of day!!). This morning I read “…Google grew from processing 100 terrabytes of data a day with MapReduce in 2004 to processing 20 petabytes a day with MapReduce in 2008. (Lin and Dyer, Data-Intensive Text Processing with MapReduce: Morgan&Claypool, 2010 p1) Assuming linear growth, that would mean did about 400 terabytes during the 15 minutes it took me to check my email. Even if Google is getting more than its fair share, data should be everywhere, more data that I could ever need, more than I could process, more than I could ever imagine.
So, how come every time I go to write a blog post or try some new stats I can never find any data? A few hours ago I Googled “free data sets” and got over 74,000,000 hits, but it looks as if it’s going to be another evening of me with iris. What’s wrong here? At the root, it’s a deep problem that gets at the essence of data. What are data anyway? My answer: data are structured information. Part of the structure includes meta-information describing the intention and the integrity with which the data were collected. When looking for a data set, even for some purpose that is not that important we all want some evidence that the data were either collected with intentions that are similar to our intentions to use the data or that the data can be re-purposed. Moreover, we need to establish some comfort level that the data were not collected to deceive, that they are reasonable representative, reasonably randomized, reasonable unbiased etc. The more we importance we place on our project the more we tighten up on these requirements. This is not all philosophy. I think that focusing on intentions and integrity provides some practical guidance of where to search for data on the internet.
Historic financial data are relatively easy to find because the intentions with which they were collected are clear and Yahoo, Google, FRED, the St Louis Fed, Oanda for currency data and others have made it their business to collect and maintain these data. Visit quantmod for R code to read in data from these sites and even more places to find financial data. The next places high on the data intentions and integrity scale are the world’s government agencies; federal, regional and municipal. Data.gov, the official website of the United states Government, the Census Bureau, the Department of Energy, the FBI, and other agencies have interesting data sets to offer. The National Institute of Health even offers some data sets in R format. Here is a microarray data set.
Don’t just confine your search to the US. The UK is on a mission to open up the government. And, don’t just look at the federal level. Look here for a fairly clean data set on London (UK) municipal waste management and here for the dirt on a thousand or so NYC taxi complaints. The Guardian is trying to make it easy to surf the data from the worlds governments.
The sweet spots for data sets on varied topics having thousands to a few milion records are the professional data set aggregators such as infochimps , datamarket and datamob.org. Some of these sites offer data sets for sale as well as offering some free data. They all seem to do a good job of describing the data and get high intention and integrity scores. KDnuggets tracks data sets that are large enough to use for data mining projects.
For the very ambitious: check out the free data sets that are available for analysis in Amazon's cloud.
Finally, there are several bloggers and wiki makers out there trying to build, annotate and maintain their own lists. Some of my favorites are at at blogspot and quora, the Januarist and Revolution. Stackexchange tracks data sources that have R interfaces. And, of course, there are those willing to make interesting data are available for a song.
I'm maintaining a list of public data sources at inside-R.org. Please let me know what I have missed.