Where to find good data sets

[This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

O’Reilly Media has been a big advocate of Open Data and believes that is where a lot of computing is going to be headed in the future.  I think they are definitely on to something.  Yet the future could be now.  There is a lot of opportunities to find good data sources immediately.  One of my favorite blogs, OReilly Radar, has an article by Edd Dumbill on Where To Find Data.  There is plenty of good data available on the internet for download to explore and mine new information.  These places not only offer great sources of data but many of them offer an API to allow quick and seamless access.  Below is a link summary from the article.


An all-things graph database.  The website focuses on trends of certain cultural and interest topics.

Amazon Public Data Sets

Amazon is probably considered the cloud computing mecca next to Google.  Amazon Web Services offers a lot.  One of which is storage of public data sets.  They offer a huge variety of public data.

Windows Azure Data Marketplace

Surprisingly Microsoft has an open data protocol data source.  This data market offers quite a few points of interest data sets.

Yahoo Query Language

YQL is an interesting API that is very similar to SQL.  YQL is essentially a language that allows to grab data from cloud services.  This could be very handy to grabbing data quickly and dynamically.  YQL offers to connect to a lot of data sources as well.


Infochimps is a data marketplace warehouse.  They offer to host, sell, and distribute data sets.  Some of their data comes at a cost but a lot of their data is free as well.  This is an interesting startup and will be very interesting to follow their growth.  Also there is a new Infochimps R package that uses their API to gather data and process Infochimps data.


DBpedia is a wikipedia for data sets.  In fact the data itself comes from Wikipedia. 

Some other sources not from the article include the World Bank open data and the U.S. Census data.

To leave a comment for the author, please follow the link and comment on their blog: Maximize Productivity with Industrial Engineer and Operations Research Tools.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)