New Data Sources for R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

Over the past few months, a number of new CRAN packages have appeared that make it easier for R users to gain access to curated data. Most of these provide interfaces to a RESTful API written by the data publishers while a few just wrap the data set inside the package. Some of the new packages are only very simple, one function wrappers to the API. Others offer more features providing functions to control searches and the format of the returned data. Some of the packages require a user to first obtain login credentials while others don’t.

Here are 17 packages that connect to data sources of all sorts. It is by no means complete. New packages in this class seem to be arriving daily at CRAN.

ameco V 0.1: Contains the entire European Commission Annual macro-economic (AMECO) database. The vignette shows a nice example of data munging to get a plot of population data.

censusr V0.0.2: Provides an interface to the US Census Data API. The vignette shows exactly how to go about getting the key to use the API.

ckanr V 0.1.0: Provides and interface to the Comprehensive Knowledge Archive Network CKAN which bills itself as the “world’s leading open-source data portal platform”. The vignette walks you through installation and provides some examples.

dieZeit V 0.1.0 Provides access to Die Zeit's online content. This includes archives going back to 1946! The vignette shows how to get API access with a limit of 10,000 accesses per day.

ecb V0.1: Provides an interface to the European Central Bank's Statistical Warehouse API. The following plot from the vignette shows “headline” and core Harmonized Index of Consumer Prices (HICP) inflation numbers.

Ecb_package_plot

gesis V0.1: Provides an interface to the GESIS Catalogue of more than 5,000 data sets maintained by the Leibniz-Institute of the Social Sciences. The vignette shows you how to get started.

gtrendsR V1.3.0: Provides functions to perform and display Google Trend queries.

hdr V0.1: Provides an interface to the United Nations Development Program Human Development Report API. The vignette provides an example of accessing and plotting some data.

inegiR V1.0.2: Provides functions to download and parse information form the official Mexican statistics agency: INEGI.

maddison V0.1: Contains the Maddison Project database which provides estimates of GDP per capita for all countries between AD 1 and 2010. The following plot from the vignette shows estimated GDP per capita going back to 1800. Look at the WW II years.

Maddison

mldr.datasets V0.3.1 Provides tools for the manipulation and exploration of multi-label data sets. Contains a large collection of multi-label data sets. The vignette which is very well done contains some theory, illustrative R code ans some spectacular visualizations. Here is a pretty cool plot in one line of code:

plot(genbase, labelIndices = genbase$labels$index[1:11])

Mldr_plot

 

pageviews V0.1.1: Provides an API client for Wikimedia traffic data. The following code, adapted from the vignette, plots views of the article R (programming language) for 2015.


library(pageviews)
library(ggplot2)

res <- article_pageviews(project = "en.wikipedia",
                                    article = “R_(programming_language)”,
                                   start = “2015010100”, end = “2015123124”)

# Fiddle with the string to get it into proper format for forming a date.
# The regular expression comes from those kind folks at stackoverflow: http://bit.ly/1RKC9iQ
date <- gsub('^(.{7})(.*)$', '\1-\2', gsub('^(.{4})(.*)$', '\1-\2', res$timestamp))
res$date <- as.Date(substr(date,1,10))

p <- ggplot(res, aes(date, views)) + geom_line() +
                 geom_point(colour=”red”,size = 0.5) +
                 xlab(“2015”) + ylab(“Daily Views”) +
                 ggtitle(“Wikipedia article: R (programming language)”)
p

Wiki_R

pangaear V0.1.0: Provides tools for interacting with the PANGAEA database. This should be of interest to environmental scientists.

prism V0.0.7; Allows you to download and visualize climate data from Oregon State's PRISM Project. The vignette is maintained on GitHub.

rstatscn V1.0: Provides functions to query Chinese National Data.

SocialMediaLab V0.19.0: Provides tools to collect data from Instagram, Facebook, Twitter and YouTube, construct networks and plot them.

wordbankr V0.1: Contains functions for connecting to Wordbank, Stanford's database of children's development vocabulary that spans 14 languages. There is a vignette.

Of course, there are countless sources for data that are easily accessible from R. Our data page on MRAN lists quite a few and there are even more on Mango Solution's data set page. Please let us know about any others that you think we should track.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)