The rest of the month is going to be super-hectic and it’s unlikely I’ll be able to do any more to help the push to CRAN 10K, so here’s a breakdown of CRAN and GitHub new packages & package updates that I felt were worth raising awareness on:
I mentioned this one last week but it wasn’t really a package announcement post.
epidata is now on CRAN and is a package to pull data from the Economic Policy Institute (U.S. gov economic data, mostly). Their “hidden” API is well thought out and the data has been nicely curated (and seems to update monthly). It makes it super easy to do things like the following:
library(epidata) library(tidyverse) library(stringi) library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc") us_unemp <- get_unemployment("e") glimpse(us_unemp) ## Observations: 456 ## Variables: 7 ## $ date
1978-12-01, 1979-01-01, 1979-02-01, 1979-03-0... ## $ all 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.05... ## $ less_than_hs 0.100, 0.100, 0.099, 0.099, 0.099, 0.099, 0.09... ## $ high_school 0.055, 0.055, 0.054, 0.054, 0.054, 0.053, 0.05... ## $ some_college 0.050, 0.050, 0.050, 0.049, 0.049, 0.049, 0.04... ## $ college 0.032, 0.031, 0.031, 0.030, 0.030, 0.029, 0.03... ## $ advanced_degree 0.021, 0.020, 0.020, 0.020, 0.020, 0.020, 0.02... us_unemp %>% gather(level, rate, -date) %>% mutate(level=stri_replace_all_fixed(level, "_", " ") %>% stri_trans_totitle() %>% stri_replace_all_regex(c("Hs$"), c("High School")), level=factor(level, levels=unique(level))) -> unemp_by_edu col <- ggthemes::tableau_color_pal()(10) ggplot(unemp_by_edu, aes(date, rate, group=level)) + geom_line(color=col) + scale_y_continuous(labels=scales::percent, limits =c(0, 0.2)) + facet_wrap(~level, scales="free") + labs(x=NULL, y="Unemployment rate", title=sprintf("U.S. Monthly Unemployment Rate by Education Level (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")), caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") + theme_hrbrmstr(grid="XY") us_unemp %>% select(date, high_school, college) %>% mutate(date_num=as.numeric(date)) %>% ggplot(aes(x=high_school, xend=college, y=date_num, yend=date_num)) + geom_segment(size=0.125, color=col) + scale_x_continuous(expand=c(0,0), label=scales::percent, breaks=seq(0, 0.12, 0.02), limits=c(0, 0.125)) + scale_y_reverse(expand=c(0,100), label=function(x) format(as_date(x), "%Y")) + labs(x="Unemployment rate", y="Year ↓", title=sprintf("U.S. monthly unemployment rate gap (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")), subtitle="Segment width shows the gap between those with a high school\ndegree and those with a college degree", caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") + theme_hrbrmstr(grid="X") + theme(panel.ontop=FALSE) + theme(panel.grid.major.x=element_line(size=0.2, color="#2b2b2b25")) + theme(axis.title.x=element_text(family="Arial", face="bold")) + theme(axis.title.y=element_text(family="Arial", face="bold", angle=0, hjust=1, margin=margin(r=-14)))
(right edge is high school, left edge is college…I’ll annotate it better next time)
Censys is a search engine by one of the cybersecurity research partners we publish data to at work (free for use by all). The API is moderately decent (it’s mostly a thin shim authentication layer to pass on Google BigQuery query strings to the back-end) and the R package to interface to it
censys is now on CRAN.
The seminal square pie chart package
waffle has been updated on CRAN to work better with recent
ggplot2 2.x changes and has some additional parameters you may want to check out.
The viral package
cdcfluview has had some updates on the GitHub version to add saner behaviour when specifying dates and had to be updated as the CDC hidden API switched to all
https URLs (major push in .gov-land to do that to get better scores on their cyber report cards). I’ll be adding some features before the next CRAN push to enable retrieval of additional mortality data.
If you work with Apache Drill (if you don’t, you should), the
sergeant package (GitHub) will help you whip it into shape. I’ve mentioned it before on the blog but it has a nigh-complete
dplyr interface now that works pretty well. It also has a direct REST API interface and RJDBC interface plus many helper utilities that help you avoid typing SQL strings to get cluster status info. Once I add the ability to create parquet files with it I’ll push it up to CRAN.
The one thing I’d like to do with this package is support any user-defined functions (UDFs in Drill-speak) folks have written. So, if you have a UDF you’ve written or use and you want it wrapped in the package, just drop an issue and I’ll layer it in. I’ll be releasing some open source cybersecurity-related UDFs via the work github in a few weeks.
Drill (in non-standalone mode) relies on Apache Zookeeper to keep everything in sync and it’s sometimes necessary to peek at what’s happening inside the zookeeper cluster, so
sergeant has a sister package
zkcmd that provides an R interface to zookeeper instances.
Some helpful folks tweaked
ggalt for better ggplot2 2.x compatibility (#ty!) and I added a new
geom_cartogram() (before you ask if it makes warped shapefiles: it doesn’t) that restores the old (and what I believe to be the correct/sane/proper) behaviour of
geom_map(). I need to get this on CRAN soon as it has both fixes and many new
geoms folks will want to play with in a non-GitHub context.
There have been some awesome packages released by others in the past month+ and you should add R Weekly to your RSS feeds if you aren’t following it already (there are other things you should have there for R updates as well, but that’s for another blog). I’m definitely looking forward to new packages, visualizations, services and utilities that will be coming this year to the R community.