Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience. Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).
?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220  library(rvest) library(dplyr) # devtools::install_github("hadley/multidplyr") library(multidplyr) library(magrittr) library(lubridate)   getCranberriesElmnt <- function(txt, elmnt_name){ desc <- grep(sprintf("^%s:", elmnt_name), txt) if (length(desc) == 1){ txt <- txt[desc:length(txt)] end <- grep("^[A-Za-z/@]{2,}:", txt[-1]) if (length(end) == 0) end <- length(txt) else end <- end[1]   desc <- txt[1:end] %>% gsub(sprintf("^%s: (.+)", elmnt_name), "\1", .) %>% paste(collapse = " ") %>% gsub("[ ]{2,}", " ", .) %>% gsub(" , ", ", ", .) }else if (length(desc) == 0){ desc <- paste("No", tolower(elmnt_name)) }else{ stop("Could not find ", elmnt_name, " in text: n", paste(txt, collapse = "n")) } return(desc) }   convertCharset <- function(txt){ if (grepl("Windows", Sys.info()["sysname"])) txt <- iconv(txt, from = "UTF-8", to = "cp1252") return(txt) }   getAuthor <- function(txt, package){ author <- getCranberriesElmnt(txt, "Author") if (grepl("No author|See AUTHORS file", author)){ author <- getCranberriesElmnt(txt, "Maintainer") }   if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || is.null(author) || nchar(author) <= 2){ cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html", package)) author <- cran_txt %>% html_nodes("tr") %>% html_text %>% convertCharset %>% gsub("(^[ tn]+|[ tn]+$)", "", .) %>% .[grep("^Author", .)] %>% gsub(".*n", "", .) # If not found then the package has probably been # removed from the repository if (length(author) == 1) author <- author else author <- "No author" } # Remove stuff such as: # [cre, auth] # (worked on the...) # <[email protected]> # "John Doe" author %<>% gsub("^Author: (.+)", "\1", .) %>% gsub("[ ]*$[^]]{3,}$[ ]*", " ", .) %>% gsub("$$[^)]+$$", " ", .) %>% gsub("([ ]*<[^>]+>)", " ", .) %>% gsub("[ ]*$[^]]{3,}$[ ]*", " ", .) %>% gsub("[ ]{2,}", " ", .) %>% gsub("(^[ '"]+|[ '"]+$)", "", .) %>% gsub(" , ", ", ", .) return(author) }   getDate <- function(txt, package){ date <- grep("^Date/Publication", txt) if (length(date) == 1){ date <- txt[date] %>% gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*", "\1", .) }else{ cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html", package)) date <- cran_txt %>% html_nodes("tr") %>% html_text %>% convertCharset %>% gsub("(^[ tn]+|[ tn]+$)", "", .) %>% .[grep("^Published", .)] %>% gsub(".*n", "", .) # The main page doesn't contain the original date if # new packages have been submitted, we therefore need # to check first entry in the archives if(cran_txt %>% html_nodes("tr") %>% html_text %>% gsub("(^[ tn]+|[ tn]+$)", "", .) %>% grepl("^Old.{1,4}sources", .) %>% any){ archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/", package)) pkg_date <- archive_txt %>% html_nodes("tr") %>% lapply(function(x) { nodes <- html_nodes(x, "td") if (length(nodes) == 5){ return(nodes[3] %>% html_text %>% as.Date(format = "%d-%b-%Y")) } }) %>% .[sapply(., length) > 0] %>% .[!sapply(., is.na)] %>% head(1)   if (length(pkg_date) == 1) date <- pkg_date[[1]] } } date <- tryCatch({ as.Date(date) }, error = function(e){ "Date missing" }) return(date) }   getNewPkgStats <- function(published_in){ # The parallel is only for making cranlogs requests # we can therefore have more cores than actual cores # as this isn't processor intensive while there is # considerable wait for each http-request cl <- create_cluster(parallel::detectCores() * 4) parallel::clusterEvalQ(cl, { library(cranlogs) }) set_default_cluster(cl) on.exit(stop_cluster())   berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/")) pkgs <- # Select the divs of the package class html_nodes(berries, ".package") %>% # Extract the text html_text %>% # Split the lines strsplit("[n]+") %>% # Now clean the lines lapply(., function(pkg_txt) { pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ t]+", "", x)) > 0}, USE.NAMES = FALSE)] %>% gsub("^[ t]+", "", .) })   # Now we select the new packages new_packages <- pkgs %>% # The first line is key as it contains the text "New package" sapply(., function(x) x[1], USE.NAMES = FALSE) %>% grep("^New package", .) %>% pkgs[.] %>% # Now we extract the package name and the date that it was published # and merge everything into one table lapply(function(txt){ txt <- convertCharset(txt) ret <- data.frame( name = gsub("^New package ([^ ]+) with initial .*", "\1", txt[1]), stringsAsFactors = FALSE )   ret$desc <- getCranberriesElmnt(txt, "Description") ret$author <- getAuthor(txt, ret$name) ret$date <- getDate(txt, ret$name) return(ret) }) %>% rbind_all %>% # Get the download data in parallel partition(name) %>% do({ down <- cran_downloads(.$name[1], from = max(as.Date("2015-01-01"), .$date[1]), to = "2015-12-31")$count cbind(.[1,], data.frame(sum = sum(down), avg = mean(down)) ) }) %>% collect %>% ungroup %>% arrange(desc(avg))   return(new_packages) }   pkg_list <- lapply(2010:2015, getNewPkgStats)   pkgs <- rbind_all(pkg_list) %>% mutate(time = as.numeric(as.Date("2016-01-01") - date), year = format(date, "%Y"))

The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:
?View Code RSPLUS
 1 2 3 4 5 6 7 8  pkgs %<>% mutate(time_yrs = time/365.25) fit <- lm(avg ~ time_yrs, data = pkgs)   # Test for non-linearity library(splines) anova(fit, update(fit, .~.-time_yrs+ns(time_yrs, 2)))
Analysis of Variance Table

Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
Res.Df       RSS Df Sum of Sq      F Pr(>F)
1   7348 189661922
2   7347 189656567  1    5355.1 0.2075 0.6488

Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:
?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  library(quantreg) library(htmlTable) lapply(c(.5, .75, .95, .99), function(tau){ rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau) rq_sum <- summary(rq_fit) c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), 95 % CI = txtRound(rq_sum$coefficients[2, 1] + c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% paste(collapse = " to ")) }) %>% do.call(rbind, .) %>% htmlTable(rnames = c("Median", "Upper quartile", "Top 5%", "Top 1%")) Estimate 95 % CI Median 0.6 0.6 to 0.6 Upper quartile 1.2 1.2 to 1.1 Top 5% 9.7 11.9 to 7.6 Top 1% 182.5 228.2 to 136.9 The above table conveys a slightly more interesting picture. Most packages don’t get that much attention while the top 1% truly reach the masses. ## Top downloaded packages In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances). Downloads Name Author Total Average/day Description Top 10 packages published in 2015 xml2 Hadley Wickham, Jeroen Ooms, RStudio, R Foundation 348,222 1635 Work with XML files … rversions Gabor Csardi 386,996 1524 Query the main R SVN… git2r Stefan Widgren 411,709 1303 Interface to the lib… praise Gabor Csardi, Sindre Sorhus 96,187 673 Build friendly R pac… readxl David Hoerl 99,386 379 Import excel files i… readr Hadley Wickham, Romain Francois, R Core Team, RStudio 90,022 337 Read flat/tabular te… DiagrammeR Richard Iannone 84,259 236 Create diagrams and … visNetwork Almende B.V. (vis.js library in htmlwidgets/lib, 41,185 233 Provides an R interf… plotly Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy 9,745 217 Easily translate ggp… DT Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc 24,806 120 Data objects in R ca… Top 10 packages published in 2014 stringi Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc. 1,316,900 3608 stringi allows for v… magrittr Stefan Milton Bache and Hadley Wickham 1,245,662 3413 Provides a mechanism… mime Yihui Xie 1,038,591 2845 This package guesses… R6 Winston Chang 920,147 2521 The R6 package allow… dplyr Hadley Wickham, Romain Francois 778,311 2132 A fast, consistent t… manipulate JJ Allaire, RStudio 626,191 1716 Interactive plotting… htmltools RStudio, Inc. 619,171 1696 Tools for HTML gener… curl Jeroen Ooms 599,704 1643 The curl() function … lazyeval Hadley Wickham, RStudio 572,546 1569 A disciplined approa… rstudioapi RStudio 515,665 1413 This package provide… Top 10 packages published in 2013 jsonlite Jeroen Ooms, Duncan Temple Lang 906,421 2483 This package is a fo… BH John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois 691,280 1894 Boost provides free … highr Yihui Xie and Yixuan Qiu 641,052 1756 This package provide… assertthat Hadley Wickham 527,961 1446 assertthat is an ext… httpuv RStudio, Inc. 310,699 851 httpuv provides low-… NLP Kurt Hornik 270,682 742 Basic classes and me… TH.data Torsten Hothorn 242,060 663 Contains data sets u… NMF Renaud Gaujoux, Cathal Seoighe 228,807 627 This package provide… stringdist Mark van der Loo 123,138 337 Implements the Hammi… SnowballC Milan Bouchet-Valat 104,411 286 An R interface to th… Top 10 packages published in 2012 gtable Hadley Wickham 1,091,440 2990 Tools to make it eas… knitr Yihui Xie 792,876 2172 This package provide… httr Hadley Wickham 785,568 2152 Provides useful tool… markdown JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte 636,888 1745 Markdown is a plain-… Matrix Douglas Bates and Martin Maechler 470,468 1289 Classes and methods … shiny RStudio, Inc. 427,995 1173 Shiny makes it incre… lattice Deepayan Sarkar 414,716 1136 Lattice is a powerfu… pkgmaker Renaud Gaujoux 225,796 619 This package provide… rngtools Renaud Gaujoux 225,125 617 This package contain… base64enc Simon Urbanek 223,120 611 This package provide… Top 10 packages published in 2011 scales Hadley Wickham 1,305,000 3575 Scales map data to a… devtools Hadley Wickham 738,724 2024 Collection of packag… RcppEigen Douglas Bates, Romain Francois and Dirk Eddelbuettel 634,224 1738 R and Eigen integrat… fpp Rob J Hyndman 583,505 1599 All data sets requir… nloptr Jelmer Ypma 583,230 1598 nloptr is an R inter… pbkrtest Ulrich Halekoh Søren Højsgaard 536,409 1470 Test in linear mixed… roxygen2 Hadley Wickham, Peter Danenberg, Manuel Eugster 478,765 1312 A Doxygen-like in-so… whisker Edwin de Jonge 413,068 1132 logicless templating… doParallel Revolution Analytics 299,717 821 Provides a parallel … abind Tony Plate and Richard Heiberger 255,151 699 Combine multi-dimens… Top 10 packages published in 2010 reshape2 Hadley Wickham 1,395,099 3822 Reshape lets you fle… labeling Justin Talbot 1,104,986 3027 Provides a range of … evaluate Hadley Wickham 862,082 2362 Parsing and evaluati… formatR Yihui Xie 640,386 1754 This package provide… minqa Katharine M. Mullen, John C. Nash, Ravi Varadhan 600,527 1645 Derivative-free opti… gridExtra Baptiste Auguie 581,140 1592 misc. functions memoise Hadley Wickham 552,383 1513 Cache the results of… RJSONIO Duncan Temple Lang 414,373 1135 This is a package th… RcppArmadillo Romain Francois and Dirk Eddelbuettel 410,368 1124 R and Armadillo inte… xlsx Adrian A. Dragulescu 401,991 1101 Provide R functions … Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more. ## R-star authors Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where: ?View Code RSPLUS  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82  top_coders <- list( "2015" = pkgs %>% filter(format(date, "%Y") == 2015) %>% partition(author) %>% do({ authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]] authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)] if (length(authors) >= 1){ # If multiple authors the statistic is split among # them but with an added 20% for the extra collaboration # effort that a multi-author envorionment calls for .$sum <- round(.$sum/length(authors)*1.2) .$avg <- .$avg/length(authors)*1.2 ret <- . ret$author <- authors[1] for (m in authors[-1]){ tmp <- . tmp$author <- m ret <- rbind(ret, tmp) } return(ret) }else{ return(.) } }) %>% collect() %>% group_by(author) %>% summarise(download_ave = round(sum(avg)), no_packages = n(), packages = paste(name, collapse = ", ")) %>% select(author, download_ave, no_packages, packages) %>% collect() %>% arrange(desc(download_ave)) %>% head(10), "all" = pkgs %>% partition(author) %>% do({ if (grepl("Jeroen Ooms", .$author)) browser() authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]] authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)] if (length(authors) >= 1){ # If multiple authors the statistic is split among # them but with an added 20% for the extra collaboration # effort that a multi-author envorionment calls for .$sum <- round(.$sum/length(authors)*1.2) .$avg <- .$avg/length(authors)*1.2 ret <- . ret$author <- authors[1] for (m in authors[-1]){ tmp <- . tmp$author <- m ret <- rbind(ret, tmp) } return(ret) }else{ return(.) } }) %>% collect() %>% group_by(author) %>% summarise(download_ave = round(sum(avg)), no_packages = n(), packages = paste(name, collapse = ", ")) %>% select(author, download_ave, no_packages, packages) %>% collect() %>% arrange(desc(download_ave)) %>% head(30))   interactiveTable( do.call(rbind, top_coders) %>% mutate(download_ave = txtInt(download_ave)), align = "lrr", header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"), tspanner = c("Top coders 2015", "Top coders 2010-2015"), n.tspanner = sapply(top_coders, nrow), minimized.columns = 4, rnames = FALSE, col.rgroup = c("white", "#F0F0FF"))
Top coders 2015
Gabor Csardi 2,312 11 sankey, franc, rvers…
Stefan Widgren 1,563 1 git2r
RStudio 781 16 shinydashboard, with…
Hadley Wickham 695 12 withr, cellranger, c…
Jeroen Ooms 541 10 rjade, js, sodium, w…
Richard Cotton 501 22 assertive.base, asse…
R Foundation 490 1 xml2
Sindre Sorhus 409 2 praise, clisymbols
Richard Iannone 294 2 DiagrammeR, stationa…
Top coders 2010-2015
Hadley Wickham 32,115 55 swirl, lazyeval, ggp…
Yihui Xie 9,739 18 DT, Rd2roxygen, high…
RStudio 9,123 25 shinydashboard, lazy…
Jeroen Ooms 4,221 25 JJcorr, gdtools, bro…
Justin Talbot 3,633 1 labeling
Winston Chang 3,531 17 shinydashboard, font…
Gabor Csardi 3,437 26 praise, clisymbols, …
Romain Francois 2,934 20 int64, LSD, RcppExam…
Duncan Temple Lang 2,854 6 RMendeley, jsonlite,…
Adrian A. Dragulescu 2,456 2 xlsx, xlsxjars
JJ Allaire 2,453 7 manipulate, htmlwidg…
Simon Urbanek 2,369 15 png, fastmatch, jpeg…
Dirk Eddelbuettel 2,094 33 Rblpapi, RcppSMC, RA…
Stefan Milton Bache 2,069 3 import, blatr, magri…
Douglas Bates 1,966 5 PKPDmodels, RcppEige…
Renaud Gaujoux 1,962 6 NMF, doRNG, pkgmaker…
Jelmer Ypma 1,933 2 nloptr, SparseGrid
Rob J Hyndman 1,933 3 hts, fpp, demography
Baptiste Auguie 1,924 2 gridExtra, dielectri…
Ulrich Halekoh Søren Højsgaard 1,764 1 pbkrtest
Martin Maechler 1,682 11 DescTools, stabledis…
Mirai Solutions GmbH 1,603 3 XLConnect, XLConnect…
Stefan Widgren 1,563 1 git2r
Edwin de Jonge 1,513 10 tabplot, tabplotGTK,…
Kurt Hornik 1,476 12 movMF, ROI, qrmtools…
Deepayan Sarkar 1,369 4 qtbase, qtpaint, lat…
Tyler Rinker 1,203 9 cowsay, wakefield, q…
Yixuan Qiu 1,131 12 gdtools, svglite, hi…
Revolution Analytics 1,011 4 doParallel, doSMP, r…
Torsten Hothorn 948 7 MVA, HSAUR3, TH.data…
It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.

## My own 2015-R-experience

My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable. When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous. When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:
• DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
• checkmate A neat package for checking function arguments.
• covr An excellent package for testing how much of a package’s code is tested.
• rex A package for making regular easier.
• openxlsx I wish I didn’t have to but I still get a lot of things in Excel-format – perhaps this package solves the Excel-import inferno…
• R6 The successor to reference classes – after working with the Gmisc::Transition-class I appreciate the need for a better system.