Where to find good data sets

December 16, 2010

(This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers)

O’Reilly Media has been a big advocate of Open Data and believes that is where a lot of computing is going to be headed in the future.  I think they are definitely on to something.  Yet the future could be now.  There is a lot of opportunities to find good data sources immediately.  One of my favorite blogs, OReilly Radar, has an article by Edd Dumbill on Where To Find Data.  There is plenty of good data available on the internet for download to explore and mine new information.  These places not only offer great sources of data but many of them offer an API to allow quick and seamless access.  Below is a link summary from the article.


An all-things graph database.  The website focuses on trends of certain cultural and interest topics.

Amazon Public Data Sets

Amazon is probably considered the cloud computing mecca next to Google.  Amazon Web Services offers a lot.  One of which is storage of public data sets.  They offer a huge variety of public data.

Windows Azure Data Marketplace

Surprisingly Microsoft has an open data protocol data source.  This data market offers quite a few points of interest data sets.

Yahoo Query Language

YQL is an interesting API that is very similar to SQL.  YQL is essentially a language that allows to grab data from cloud services.  This could be very handy to grabbing data quickly and dynamically.  YQL offers to connect to a lot of data sources as well.


Infochimps is a data marketplace warehouse.  They offer to host, sell, and distribute data sets.  Some of their data comes at a cost but a lot of their data is free as well.  This is an interesting startup and will be very interesting to follow their growth.  Also there is a new Infochimps R package that uses their API to gather data and process Infochimps data.


DBpedia is a wikipedia for data sets.  In fact the data itself comes from Wikipedia. 

Some other sources not from the article include the World Bank open data and the U.S. Census data.

To leave a comment for the author, please follow the link and comment on their blog: Maximize Productivity with Industrial Engineer and Operations Research Tools.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)