A Kaggle Dataset of R Package History for rstudio::conf(2022)
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
It’s summer, and the longawaited Rstudio conference for 2022 is only days away. Next week, a large number of R aficionados will gather in Washington DC for the first time in person since the beginning of the pandemic. A pandemic, mind you, that is far from over. But Covid precautions are in place, and I trust the R community more than most to be responsible and thoughtful. With masks, social distance, and outdoor events: I’m excited to meet new people and see again many familiar faces from my first Rstudio conference in 2020.
To create even more excitement, this time I’m giving a talk about the Kaggle and R communities, and all the good things that can happen when those worlds interact. In addition to this talk, which is aiming at introducing an R audience to the opportunities of Kaggle, I have also prepared a new Kaggle dataset for this audience to get started on the platform. This post is about that dataset: comprehensive data on all R packages currently on CRAN, and on their full release history.
Let’s get started with the packages; including those that I found instrumental for querying info from CRAN: the powerful tools package and the more specialised packageRank package. Together, the functions in those packages made my task much easier than expected.
libs < c('dplyr', 'tibble', # wrangling 'tidyr', 'stringr', # wrangling 'readr', # read files 'tools', 'packageRank', # CRAN package info 'ggplot2', 'ggthemes', # plots 'gt', 'lubridate') # tables & time invisible(lapply(libs, library, character.only = TRUE))
Complete list of CRAN packages
Initially, my thought was to scrape the package information directly from CRAN, the Comprehensive R Archive Network. It is the central repository for R packages. CRAN describes itself as “a network of ftp and web servers around the world that store identical, uptodate, versions of code and documentation for R.” If you’re installing an R package in the standard way then it is provided by one of the CRAN mirrors. (The install.packages
function takes a repos
argument that you can set to any of the mirror or to the central “http://cran.rproject.org”.)
CRAN provides full lists of all available package by name and by date of publication. The latter page in particular has a nice html table with all package names, titles, and dates. This would be easy to scrape. If you want to get an intro to webscraping with the rvest package then check out a previous blogpost of mine.
However, the R community had once again made my task much easier. As I was pondering a respectful and responsible scraping strategy, I came across this post on Scraping Responsibly with R by Steven Mortimer, who was working on scraping CRAN downloads. In it, he quoted a tweet by Maëlle Salmon recommending to use tools::CRAN_package_db
instead as a gentler approach.
This tool is indeed very fast and powerful. It provides a lot of columns. For the sake of a simple dataset, I’m only selecting a subset of those features here. Feel free to explore the full range.
df < tools::CRAN_package_db() %>% as_tibble() %>% janitor::clean_names() %>% select(package, version, depends, imports, license, needs_compilation, author, bug_reports, url, date_published = published, description, title) %>% mutate(needs_compilation = needs_compilation == "yes")
The columns I picked include names, versions, dates, and information about dependencies, authors, descriptions, and web links. Here are the first 50 rows:
df %>% head(50) %>% gt() %>% tab_header( title = md("**A full list of R packages on CRAN derived via tools::CRAN_package_db**") ) %>% opt_row_striping() %>% tab_options(container.height = px(600))
A full list of R packages on CRAN derived via tools::CRAN_package_db  

package  version  depends  imports  license  needs_compilation  author  bug_reports  url  date_published  description  title 
A3  1.0.0  R (>= 2.15.0), xtable, pbapply  NA  GPL (>= 2)  FALSE  Scott FortmannRoe  NA  NA  20150816  Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicable to virtually any predictive model and make comparisons between different methodologies straightforward.  Accurate, Adaptable, and Accessible Error Metrics for Predictive Models 
AATtools  0.0.1  R (>= 3.6.0)  magrittr, dplyr, doParallel, foreach  GPL3  FALSE  Sercan Kahveci [aut, cre]  https://github.com/Spiritspeak/AATtools/issues  NA  20200614  Compute approach bias scores using different scoring algorithms, compute bootstrapped and exact splithalf reliability estimates, and compute confidence intervals for individual participant scores.  Reliability and Scoring Routines for the ApproachAvoidance Task 
ABACUS  1.0.0  R (>= 3.1.0)  ggplot2 (>= 3.1.0), shiny (>= 1.3.1),  GPL3  FALSE  Mintu Nath [aut, cre]  NA  https://shiny.abdn.ac.uk/Stats/apps/  20190920  A set of Shiny apps for effective communication and understanding in statistics. The current version includes properties of normal distribution, properties of sampling distribution, onesample z and t tests, two samples independent (unpaired) t test and analysis of variance.  Apps Based Activities for Communicating and Understanding Statistics 
abbreviate  0.1  NA  NA  GPL3  FALSE  Sigbert Klinke [aut, cre]  NA  https://github.com/sigbertklinke/abbreviate (development version)  20211214  Strings are abbreviated to at least "minlength" characters, such that they remain unique (if they were). The abbreviations should be recognisable.  Readable String Abbreviation 
abbyyR  0.5.5  R (>= 3.2.0)  httr, XML, curl, readr, plyr, progress  MIT + file LICENSE  FALSE  Gaurav Sood [aut, cre]  http://github.com/soodoku/abbyyR/issues  http://github.com/soodoku/abbyyR  20190625  Get text from images of text using Abbyy Cloud Optical Character
Recognition (OCR) API. Easily OCR images, barcodes, forms, documents with
machine readable zones, e.g. passports. Get the results in a variety of formats
including plain text and XML. To learn more about the Abbyy OCR API, see

Access to Abbyy Optical Character Recognition (OCR) API 
abc  2.2.1  R (>= 2.10), abc.data, nnet, quantreg, MASS, locfit  NA  GPL (>= 3)  FALSE  Csillery Katalin [aut], Lemaire Louisiane [aut], Francois Olivier [aut], Blum Michael [aut, cre]  NA  NA  20220519  Implements several ABC algorithms for performing parameter estimation, model selection, and goodnessoffit. Crossvalidation tools are also available for measuring the accuracy of ABC estimates, and to calculate the misclassification probabilities of different models.  Tools for Approximate Bayesian Computation (ABC) 
abc.data  1.0  R (>= 2.10)  NA  GPL (>= 3)  FALSE  Csillery Katalin [aut], Lemaire Louisiane [aut], Francois Olivier [aut], Blum Michael [aut, cre]  NA  NA  20150505  Contains data which are used by functions of the 'abc' package.  Data Only: Tools for Approximate Bayesian Computation (ABC) 
ABC.RAP  0.9.0  R (>= 3.1.0)  graphics, stats, utils  GPL3  FALSE  Abdulmonem Alsaleh [cre, aut], Robert Weeks [aut], Ian Morison [aut], RStudio [ctb]  NA  NA  20161020  It aims to identify candidate genes that are “differentially methylated” between cases and controls. It applies Student’s ttest and delta beta analysis to identify candidate genes containing multiple “CpG sites”.  Array Based CpG Region Analysis Pipeline 
abcADM  1.0  NA  Rcpp (>= 1.0.1)  GPL3  TRUE  Zongjun Liu [aut], ChunHao Yang [aut], John Burkardt [ctb], Samuel W.K. Wong [aut, cre]  NA  NA  20191113  Estimate parameters of accumulated damage (load duration) models based on failure time data under a Bayesian framework, using Approximate Bayesian Computation (ABC). Assess longterm reliability under stochastic load profiles. Yang, Zidek, and Wong (2019) 
Fit Accumulated Damage Models and Estimate Reliability using ABC 
ABCanalysis  1.2.1  R (>= 2.10)  plotrix  GPL3  FALSE  Michael Thrun, Jorn Lotsch, Alfred Ultsch  NA  https://www.unimarburg.de/fb12/datenbionik/softwareen  20170313  For a given data set, the package provides a novel method of computing precise limits to acquire subsets which are easily interpreted. Closely related to the Lorenz curve, the ABC curve visualizes the data by graphically representing the cumulative distribution function. Based on an ABC analysis the algorithm calculates, with the help of the ABC curve, the optimal limits by exploiting the mathematical properties pertaining to distribution of analyzed items. The data containing positive values is divided into three disjoint subsets A, B and C, with subset A comprising very profitable values, i.e. largest data values ("the important few"), subset B comprising values where the yield equals to the effort required to obtain it, and the subset C comprising of nonprofitable values, i.e., the smallest data sets ("the trivial many"). Package is based on "Computed ABC Analysis for rational Selection of most informative Variables in multivariate Data", PLoS One. Ultsch. A., Lotsch J. (2015) 
Computed ABC Analysis 
abclass  0.3.0  R (>= 3.5.0)  Rcpp, stats  GPL (>= 3)  TRUE  Wenjie Wang [aut, cre] ( 
https://github.com/wenjie2wang/abclass/issues  https://wwenjie.org/abclass, https://github.com/wenjie2wang/abclass  20220528  Multicategory anglebased largemargin classifiers.
See Zhang and Liu (2014) 
AngleBased LargeMargin Classifiers 
ABCoptim  0.15.0  NA  Rcpp, graphics, stats, utils  MIT + file LICENSE  TRUE  George Vega Yon [aut, cre], Enyelbert Muñoz [ctb]  NA  http://github.com/gvegayon/ABCoptim, http://mf.erciyes.edu.tr/abc/  20171106  An implementation of Karaboga (2005) Artificial Bee Colony
Optimization algorithm 
Implementation of Artificial Bee Colony (ABC) Optimization 
ABCp2  1.2  MASS  NA  GPL2  FALSE  M. Catherine Duryea, Andrew D. Kern, Robert M. Cox, and Ryan Calsbeek  NA  NA  20160204  Tests the goodness of fit of a distribution of offspring to the Normal, Poisson, and Gamma distribution and estimates the proportional paternity of the second male (P2) based on the best fit distribution.  Approximate Bayesian Computational Model for Estimating P2 
abcrlda  1.0.3  NA  stats  GPL3  FALSE  Dmitriy Fedorov [aut, cre], Amin Zollanvari [aut], Aresh Dadlani [aut], Berdakh Abibullaev [aut]  NA  https://ieeexplore.ieee.org/document/8720003/, https://dx.doi.org/10.1109/LSP.2019.2918485  20200528  Offers methods to perform asymptotically biascorrected regularized linear discriminant analysis (ABC_RLDA) for costsensitive binary classification. The biascorrection is an estimate of the bias term added to regularized discriminant analysis (RLDA) that minimizes the overall risk. The default magnitude of misclassification costs are equal and set to 0.5; however, the package also offers the options to set them to some predetermined values or, alternatively, take them as hyperparameters to tune.
A. Zollanvari, M. Abdirash, A. Dadlani and B. Abibullaev (2019) 
Asymptotically BiasCorrected Regularized Linear Discriminant Analysis 
abctools  1.1.3  R (>= 2.10), abc, abind, parallel, plyr, Hmisc  NA  GPL (>= 2)  TRUE  Matt Nunes [aut, cre], Dennis Prangle [aut], Guilhereme Rodrigues [ctb]  https://github.com/dennisprangle/abctools/issues  https://github.com/dennisprangle/abctools  20180717  Tools for approximate Bayesian computation including summary statistic selection and assessing coverage.  Tools for ABC Analyses 
abd  0.28  R (>= 3.0), nlme, lattice, grid, mosaic  NA  GPL2  FALSE  Kevin M. Middleton 
NA  NA  20150703  The abd package contains data sets and sample code for The Analysis of Biological Data by Michael Whitlock and Dolph Schluter (2009; Roberts & Company Publishers).  The Analysis of Biological Data 
abdiv  0.2.0  NA  ape  MIT + file LICENSE  FALSE  Kyle Bittinger [aut, cre]  https://github.com/kylebittinger/abdiv/issues  https://github.com/kylebittinger/abdiv  20200120  A collection of measures for measuring ecological diversity. Ecological diversity comes in two flavors: alpha diversity measures the diversity within a single site or sample, and beta diversity measures the diversity across two sites or samples. This package overlaps considerably with other R packages such as 'vegan', 'gUniFrac', 'betapart', and 'fossil'. We also include a wide range of functions that are implemented in software outside the R ecosystem, such as 'scipy', 'Mothur', and 'scikitbio'. The implementations here are designed to be basic and clear to the reader.  Alpha and Beta Diversity Measures 
abe  3.0.1  NA  NA  GPL (>= 2)  FALSE  Rok Blagus [aut, cre], Sladana Babic [ctb]  NA  NA  20171030  Performs augmented backward elimination and checks the stability of the obtained model. Augmented backward elimination combines significance or information based criteria with the change in estimate to either select the optimal model for prediction purposes or to serve as a tool to obtain a practically sound, highly interpretable model. More details can be found in Dunkler et al. (2014) 
Augmented Backward Elimination 
abess  0.4.5  R (>= 3.1.0)  Rcpp, MASS, methods, Matrix  GPL (>= 3)  file LICENSE  TRUE  Jin Zhu [aut, cre] ( 
https://github.com/abessteam/abess/issues  https://github.com/abessteam/abess, https://abessteam.github.io/abess/, https://abess.readthedocs.io  20220322  Extremely efficient toolkit for solving the best subset selection problem 
Fast Best Subset Selection 
abglasso  0.1.1  NA  MASS, pracma, stats, statmod  GPL3  FALSE  Jarod Smith [aut, cre] ( 
NA  NA  20210713  Implements a Bayesian adaptive graphical lasso dataaugmented block Gibbs sampler. The sampler simulates the posterior distribution of precision matrices of a Gaussian Graphical Model. This sampler was adapted from the original MATLAB routine proposed in Wang (2012) 
Adaptive Bayesian Graphical Lasso 
ABHgenotypeR  1.0.1  NA  ggplot2, reshape2, utils  GPL3  FALSE  Stefan Reuscher [aut, cre], Tomoyuki Furuta [aut]  http://github.com/StefanReuscher/ABHgenotypeR/issues  http://github.com/StefanReuscher/ABHgenotypeR  20160204  Easy to use functions to visualize marker data from biparental populations. Useful for both analyzing and presenting genotypes in the ABH format.  Easy Visualization of ABH Genotypes 
abind  1.45  R (>= 1.5.0)  methods, utils  LGPL (>= 2)  FALSE  Tony Plate 
NA  NA  20160721  Combine multidimensional arrays into a single array. This is a generalization of 'cbind' and 'rbind'. Works with vectors, matrices, and higherdimensional arrays. Also provides functions 'adrop', 'asub', and 'afill' for manipulating, extracting and replacing data in arrays.  Combine Multidimensional Arrays 
abjData  1.1.2  R (>= 3.3.1)  NA  MIT + file LICENSE  FALSE  Julio Trecenti [aut, cre] ( 
NA  https://abjur.github.io/abjData/  20220615  The Brazilian Jurimetrics Association (ABJ in
Portuguese, see 
Databases Used Routinely by the Brazilian Jurimetrics Association 
abjutils  0.3.2  R (>= 3.6)  dplyr, magrittr, purrr, rlang, rstudioapi, stringi, stringr, tidyr  MIT + file LICENSE  FALSE  Caio Lente [aut, cre] ( 
NA  https://github.com/abjur/abjutils  20220201  The Brazilian Jurimetrics Association (ABJ in Portuguese, see

Useful Tools for Jurimetrical Analysis Used by the Brazilian Jurimetrics Association 
abn  2.71  R (>= 4.0.0)  methods, rjags, nnet, lme4, graph, Rgraphviz, doParallel, foreach  GPL (>= 2)  TRUE  Reinhard Furrer [cre, aut] ( 
https://git.math.uzh.ch/reinhard.furrer/abn//issues  http://rbayesiannetworks.org  20220425  Bayesian network analysis is a form of probabilistic graphical models which derives from empirical data a directed acyclic graph, DAG, describing the dependency structure between random variables.
An additive Bayesian network model consists of a form of a DAG where each node comprises a generalized linear model, GLM. Additive Bayesian network models are equivalent to Bayesian multivariate regression using graphical modelling, they generalises the usual multivariable regression, GLM, to multiple dependent variables.
'abn' provides routines to help determine optimal Bayesian network models for a given data set, where these models are used to identify statistical dependencies in messy, complex data. The additive formulation of these models is equivalent to multivariate generalised linear modelling (including mixed models with iid random effects).
The usual term to describe this model selection process is structure discovery.
The core functionality is concerned with model selection  determining the most robust empirical model of data from interdependent variables. Laplace approximations are used to estimate goodness of fit metrics and model parameters, and wrappers are also included to the INLA package which can be obtained from 
Modelling Multivariate Data with Additive Bayesian Networks 
abnormality  0.1.0  NA  MASS (>= 7.3.0), Matrix  MIT + file LICENSE  FALSE  Michael Marks [aut, cre]  NA  NA  20180313  Contains the functions to implement the methodology and considerations laid out by Marks et al. in the manuscript Measuring Abnormality in High Dimensional Spaces: Applications in Biomechanical Gait Analysis. As of 2/27/2018 this paper has been submitted and is under scientific review. Using highdimensional datasets to measure a subject’s overall level of abnormality as compared to a reference population is often needed in outcomes research. Utilizing applications in instrumented gait analysis, that article demonstrates how using data that is inherently nonindependent to measure overall abnormality may bias results. A methodology is introduced to address this bias to accurately measure overall abnormality in high dimensional spaces. While this methodology is in line with previous literature, it differs in two major ways. Advantageously, it can be applied to datasets in which the number of observations is less than the number of features/variables, and it can be abstracted to practically any number of domains or dimensions. After applying the proposed methodology to the original data, the researcher is left with a set of uncorrelated variables (i.e. principal components) with which overall abnormality can be measured without bias. Different considerations are discussed in that article in deciding the appropriate number of principal components to keep and the aggregate distance measure to utilize.  Measure a Subject's Abnormality with Respect to a Reference Population 
abodOutlier  0.1  cluster, R (>= 3.1.2)  NA  MIT + file LICENSE  FALSE  Jose Jimenez 
NA  NA  20150831  Performs anglebased outlier detection on a given dataframe. Three methods are available, a full but slow implementation using all the data that has cubic complexity, a fully randomized one which is way more efficient and another using knearest neighbours. These algorithms are specially well suited for high dimensional data outlier detection.  AngleBased Outlier Detection 
ABPS  0.3  NA  kernlab  GPL (>= 2)  FALSE  Frédéric Schütz [aut, cre], Alix Zollinger [aut]  NA  NA  20181018  An implementation of the Abnormal Blood Profile Score (ABPS,
part of the Athlete Biological Passport program of the World AntiDoping
Agency), which combines several blood parameters into a single score in
order to detect blood doping (Sottas et al. (2006)

The Abnormal Blood Profile Score to Detect Blood Doping 
abstr  0.4.1  R (>= 4.0.0)  jsonlite (>= 1.7.2), lwgeom (>= 0.2.5), magrittr (>= 2.0.1), methods, od (>= 0.3.1), sf (>= 1.0.1), tibble (>= 3.0.6), tidyr (>= 1.1.3)  Apache License (>= 2)  FALSE  Nathanael Sheehan [aut, cre] ( 
https://github.com/abstreet/abstr/issues  https://github.com/abstreet/abstr, https://abstreet.github.io/abstr/  20211130  Provides functions to convert origindestination data,
represented as straight 'desire lines' in the 'sf' Simple Features
class system, into JSON files that can be directly imported into
A/B Street 
R Interface to the A/B Street Transport System Simulation Software 
abstractr  0.1.0  NA  shiny (>= 1.2.0), ggplot2 (>= 3.0.0), gridExtra (>= 2.3.0), colourpicker, shinythemes, emojifont, rintrojs  GPL3  FALSE  Matthew Kumar 
NA  https://mattkumar.shinyapps.io/portfolio  20190120  An RShiny application to create visual abstracts for original research. A variety of user defined options and formatting are included.  An RShiny Application for Creating Visual Abstracts 
abtest  1.0.1  R (>= 3.0.0)  Rcpp (>= 0.12.14), mvtnorm, sn, qgam, truncnorm, plotrix, grDevices, RColorBrewer, Matrix, parallel  GPL (>= 2)  TRUE  Quentin F. Gronau [aut, cre], Akash Raj [ctb], EricJan Wagenmakers [ths]  NA  NA  20211122  Provides functions for Bayesian A/B testing including prior elicitation
options based on Kass and Vaidyanathan (1992) 
Bayesian A/B Testing 
abundant  1.2  R (>= 2.10), glasso  NA  GPL2  TRUE  Adam J. Rothman  NA  NA  20220104  Fit and predict with the highdimensional principal fitted
components model. This model is described by Cook, Forzani, and Rothman (2012)

HighDimensional Principal Fitted Components and Abundant Regression 
Ac3net  1.2.2  R (>= 3.3.0), data.table  NA  GPL (>= 3)  FALSE  Gokmen Altay  NA  NA  20180226  Infers directional conservative causal core (gene) networks. It is an advanced version of the algorithm C3NET by providing directional network. Gokmen Altay (2018) 
Inferring Directional Conservative Causal Core Gene Networks 
ACA  1.1  R (>= 3.2.2)  graphics, grDevices, stats, utils  GPL  FALSE  Daniel Amorese  NA  NA  20180702  Offers an interactive function for the detection of breakpoints in series.  Abrupt ChangePoint or Aberration Detection in Point Series 
academictwitteR  0.3.1  R (>= 3.4)  dplyr (>= 1.0.0), httr, jsonlite, magrittr, lubridate, usethis, tibble, tidyr, tidyselect, purrr, rlang, utils  MIT + file LICENSE  FALSE  Christopher Barrie [aut, cre] ( 
https://github.com/cjbarrie/academictwitteR/issues  https://github.com/cjbarrie/academictwitteR  20220216  Package to query the Twitter Academic Research Product Track, providing access to fullarchive search and other v2 API endpoints. Functions are written with academic research in mind. They provide flexibility in how the user wishes to store collected data, and encourage regular storage of data to mitigate loss when collecting large volumes of tweets. They also provide workarounds to manage and reshape the format in which data is provided on the client side.  Access the Twitter Academic Research Product Track V2 API Endpoint 
acc  1.3.3  R (>= 2.10), mhsmm  zoo, PhysicalActivity, nleqslv, plyr, methods, DBI, RSQLite, circlize, ggplot2, R.utils, iterators, Rcpp  GPL (>= 2)  TRUE  Jaejoon Song, Matthew G. Cox  NA  NA  20161216  Processes accelerometer data from uniaxial and triaxial devices, and generates data summaries. Also includes functions to plot, analyze, and simulate accelerometer data.  Exploring Accelerometer Data 
acca  0.2  NA  methods, stats, ggplot2, plyr  GPL (>= 2)  FALSE  Livio Finos  NA  NA  20220128  It performs Canonical Correlation Analysis and provides inferential guaranties on the correlation components. The pvalues are computed following the resampling method developed in Winkler, A. M., Renaud, O., Smith, S. M., & Nichols, T. E. (2020). Permutation inference for canonical correlation analysis. NeuroImage, 
A Canonical Correlation Analysis with Inferential Guaranties 
accelerometry  3.1.2  R (>= 3.0.0)  Rcpp (>= 0.12.15), dvmisc  GPL3  TRUE  Dane R. Van Domelen  NA  NA  20180824  A collection of functions that perform operations on timeseries accelerometer data, such as identify nonwear time, flag minutes that are part of an activity bout, and find the maximum 10minute average count value. The functions are generally very flexible, allowing for a variety of algorithms to be implemented. Most of the functions are written in C++ for efficiency.  Functions for Processing Accelerometer Data 
accelmissing  1.4  R (>= 2.10), mice, pscl  NA  GPL (>= 2)  FALSE  Jung Ae Lee 
NA  NA  20180406  Imputation for the missing count values in accelerometer data. The methodology includes both parametric and semiparametric multiple imputations under the zeroinflated Poisson lognormal model. This package also provides multiple functions to preprocess the accelerometer data previous to the missing data imputation. These includes detecting wearing and nonwearing time, selecting valid days and subjects, and creating plots.  Missing Value Imputation for Accelerometer Data 
accept  0.9.1  R (>= 3.6.0)  stats, dplyr, reldist, splines  GPL3  FALSE  Amin Adibi [aut, cre], Mohsen Sadatsafavi [aut, cph], Abdollah Safari [aut], Ainsleigh Hill [aut]  NA  NA  20220715  Allows clinicians to predict the rate and severity of future acute exacerbation in Chronic Obstructive Pulmonary Disease (COPD) patients, based on the clinical prediction model published in Adibi et al. (2020) 
The Acute COPD Exacerbation Prediction Tool (ACCEPT) 
AcceptanceSampling  1.08  methods, R(>= 2.4.0), stats  graphics, utils  GPL (>= 3)  FALSE  Andreas Kiermeier [aut, cre], Peter Bloomfield [ctb]  NA  NA  20220406  Provides functionality for creating and evaluating acceptance sampling plans. Sampling plans can be single, double or multiple.  Creation and Evaluation of Acceptance Sampling Plans 
accessibility  0.1.0  R (>= 3.5.0)  checkmate, data.table, utils  MIT + file LICENSE  FALSE  Rafael H. M. Pereira [aut, cre]
( 
https://github.com/ipeaGIT/accessibility/issues  https://github.com/ipeaGIT/accessibility  20220630  A set of fast and convenient functions to calculate multiple transport accessibility measures. Given a precomputed travel cost matrix in long format combined with landuse data (e.g. location of jobs, healthcare, population), the package allows one to calculate active and passive accessibility levels using multiple accessibility metrics such as: cumulative opportunity measure (using either travel time cutoff or interval), minimum travel cost to closest N number of activities, gravitational measures and different floating catchment area methods.  Transport Accessibility Metrics 
accessrmd  1.0.0  ggplot2, R (>= 2.10)  htmltools, stringr, rlist, knitr, RCurl  MIT + file LICENSE  FALSE  Rich Leyshon [aut, cre], Crown Copyright 2021 [cph]  NA  NA  20220503  Provides a simple method to improve the accessibility of 'rmarkdown' documents. The package provides functions for creating or modifying 'rmarkdown' documents, resolving known errors and alerts that result in accessibility issues for screen reader users.  Improving the Accessibility of 'rmarkdown' Documents 
accrual  1.3  R(>= 3.1.3), tcltk2  fgui, SMPracticals  GPL2  FALSE  Junhao Liu, Yu Jiang, Cen Wu, Steve Simon, Matthew S. Mayo, Rama Raghavan, Byron J. Gajewski  NA  NA  20171020  Subject recruitment for medical research is challenging. Slow patient accrual leads to delay in research. Accrual monitoring during the process of recruitment is critical. Researchers need reliable tools to manage the accrual rate. We developed a Bayesian method that integrates researcher's experience on previous trials and data from the current study, providing reliable prediction on accrual rate for clinical studies. In this R package, we present functions for Bayesian accrual prediction which can be easily used by statisticians and clinical researchers.  Bayesian Accrual Prediction 
accrualPlot  1.0.1  lubridate  dplyr, ggplot2, grid, magrittr, purrr, rlang  MIT + file LICENSE  FALSE  Lukas Bütikofer [cre, aut], Alan G. Haynes [aut]  https://github.com/CTUBern/accrualPlot/issues  https://github.com/CTUBern/accrualPlot  20220509  Tracking accrual in clinical trials is important for trial success. If accrual is too slow, the trial will take too long and be too expensive. If accrual is much faster than expected, time sensitive tasks such as the writing of statistical analysis plans might need to be rushed. 'accrualPlot' provides functions to aid the tracking of accrual and predict when a trial will reach it's intended sample size.  Accrual Plots and Predictions for Clinical Trials 
accSDA  1.1.1  R (>= 3.2)  MASS (>= 7.3.45), ggplot2 (>= 2.1.0), ggthemes (>= 3.2.0), grid (>= 3.2.2), gridExtra (>= 2.2.1)  GPL (>= 2)  FALSE  Gudmundur Einarsson [aut, cre, trl], Line Clemmensen [aut, ths], Brendan Ames [aut], Summer Atkins [aut]  https://github.com/gumeo/accSDA/issues  https://github.com/gumeo/accSDA/wiki  20220405  Implementation of sparse linear discriminant analysis, which is a supervised
classification method for multiple classes. Various novel optimization approaches to
this problem are implemented including alternating direction method of multipliers (ADMM),
proximal gradient (PG) and accelerated proximal gradient (APG) (See Atkins et al. 
Accelerated Sparse Discriminant Analysis 
accucor  0.3.0  NA  nnls, dplyr, stringr, readxl, readr, rlang, tibble, writexl, CHNOSZ  MIT + file LICENSE  FALSE  Xiaoyang Su [aut] ( 
https://github.com/XiaoyangSu/AccuCor/issues  https://github.com/XiaoyangSu/AccuCor  20211117  An isotope natural abundance correction algorithm that
is needed especially for high resolution mass spectrometers. Supports
correction for 13C, 2H and 15N. Su X, Lu W and Rabinowitz J (2017)

Natural Abundance Correction of Mass Spectrometer Data 
ACDC  1.0.0  R (>= 3.5.0), ggplot2  magrittr, deSolve, dplyr, tibble, colorspace, patchwork, latex2exp, tidyr  GPL3  FALSE  Bjørn Tore Kopperud [aut, cre], Sebastian Höhna [aut], Andrew F. Magee [aut]  NA  https://github.com/afmagee/ACDC  20220112  Features tools for exploring congruent phylogenetic birthdeath models. It can construct the pulled speciation and netdiversification rates from a reference model. Given alternative speciation or extinction rates, it can construct new models that are congruent with the reference model. Functionality is included to sample new rate functions, and to visualize the distribution of one congruence class. See also Louca & Pennell (2020) 
Analysis of Congruent Diversification Classes 
acdcR  1.0.0  R (>= 4.0.0), raster, data.table, stats  NA  GPL (>= 2)  FALSE  Seong D. Yun [aut, cre]  NA  https://github.com/ysd2004/acdcR  20220627  The functions are designed to calculate the most widelyused countylevel variables in
agricultural production or agriculturalclimatic and weather analyses. To operate some functions
in this package needs download of the bulk PRISM raster. See the examples, testing versions and
more details from: 
AgroClimatic Data by County 
ACDm  1.0.4.1  R(>= 2.10.0)  plyr, dplyr, ggplot2, Rsolnp, zoo, graphics,  GPL (>= 2)  TRUE  Markus Belfrage  NA  NA  20220708  Package for Autoregressive Conditional Duration (ACD, Engle and Russell, 1998) models. Creates trade, price or volume durations from transactions (tic) data, performs diurnal adjustments, fits various ACD models and tests them.  Tools for Autoregressive Conditional Duration Models 
Now you could take this data, aggregate the date_published
by month, and plot the growth of the R ecosystem for yourself. For instance like this:
df %>% mutate(date = floor_date(ymd(date_published), unit = "month")) %>% filter(!is.na(date)) %>% count(date) %>% arrange(date) %>% mutate(cumul = cumsum(n)) %>% ggplot(aes(date, cumul)) + geom_line(col = "blue") + theme_minimal() + labs(x = "Date", y = "Cumlative count", title = "Cumulative count of CRAN package by date of latest version")
But this doesn’t really show you the true historical growth, does it? The xaxis range and my plot title are already telling you what’s going on here. The date_published
that CRAN_package_db
gives us (and that the CRAN website lists) corresponds to the last published version of the package. See for instance the entry for dplyr
:
df %>% filter(package == "dplyr") %>% select(package, version, date_published) %>% gt()
package  version  date_published 

dplyr  1.0.9  20220428 
The cornerstone of the tidyverse was first published a little while before April 2022. Version 1.0.9 is the most recent one, at the time of writing.
Naturally, that means that in this table those packages with frequent updates will be weighted heavier towards more recent dates. Which is perfectly fine if you’re only interested in the most recent package. But if you, like me, want to see how the R ecosystem grew over time, then you need the historical dates of the first published versions. This is where our next package comes in.
Package history
I found the packageRank package through googling “R package history”. Its documentation on github is detailed, and it performs very well for me. In the packageHistory
function you give it a package name and it does the rest. Let’s find out more about the release history of our favourite dplyr
package:
df_hist_dplyr < packageRank::packageHistory(package = "dplyr", check.package = TRUE) %>% as_tibble() %>% janitor::clean_names() df_hist_dplyr %>% gt() %>% tab_header( title = md("**The release history of the dplyr package**") ) %>% opt_row_striping() %>% tab_options(container.height = px(400))
The release history of the dplyr package  

package  version  date  repository 
dplyr  0.1  20140116  Archive 
dplyr  0.1.1  20140129  Archive 
dplyr  0.1.2  20140224  Archive 
dplyr  0.1.3  20140315  Archive 
dplyr  0.2  20140521  Archive 
dplyr  0.3  20141004  Archive 
dplyr  0.3.0.1  20141008  Archive 
dplyr  0.3.0.2  20141011  Archive 
dplyr  0.4.0  20150108  Archive 
dplyr  0.4.1  20150114  Archive 
dplyr  0.4.2  20150616  Archive 
dplyr  0.4.3  20150901  Archive 
dplyr  0.5.0  20160624  Archive 
dplyr  0.7.0  20170609  Archive 
dplyr  0.7.1  20170622  Archive 
dplyr  0.7.2  20170720  Archive 
dplyr  0.7.3  20170909  Archive 
dplyr  0.7.4  20170928  Archive 
dplyr  0.7.5  20180519  Archive 
dplyr  0.7.6  20180629  Archive 
dplyr  0.7.7  20181016  Archive 
dplyr  0.7.8  20181110  Archive 
dplyr  0.8.0  20190214  Archive 
dplyr  0.8.0.1  20190215  Archive 
dplyr  0.8.1  20190514  Archive 
dplyr  0.8.2  20190629  Archive 
dplyr  0.8.3  20190704  Archive 
dplyr  0.8.4  20200131  Archive 
dplyr  0.8.5  20200307  Archive 
dplyr  1.0.0  20200529  Archive 
dplyr  1.0.1  20200731  Archive 
dplyr  1.0.2  20200818  Archive 
dplyr  1.0.3  20210115  Archive 
dplyr  1.0.4  20210202  Archive 
dplyr  1.0.5  20210305  Archive 
dplyr  1.0.6  20210505  Archive 
dplyr  1.0.7  20210618  Archive 
dplyr  1.0.8  20220208  Archive 
dplyr  1.0.9  20220428  CRAN 
First released in January 2014. It’s been an interesting journey for the tidyverse since then. I started using the tidy packages first in 2017, and I would find it hard to go back to base R now.
With data like this for a single R package, you could for instance investigate the yearly frequency of releases over time:
df_hist_dplyr %>% mutate(year = floor_date(date, unit = "year")) %>% count(year) %>% ggplot(aes(year, n)) + geom_col(fill = "purple") + scale_x_date() + theme_hc() + labs(x = "", y = "", title = "Number of releases of 'dplyr' per year  in July 2022")
Interesting pattern there from 2014 to 2016. Since then the number of releases have been pretty consistent. At the time of writing, we are still in the middle of 2022.
(As a side note, you’ll see that I’ve decided not to use an xaxis label here. I feel like a year axis is often selfexplanatory; and I’ve used a descriptive title to prevent misinterpretation. Let me know if you disagree.)
To get the history for all the entries in our complete list of CRAN packages, we can then simply loop through the package names. The loop takes about an hour, but you don’t have to run it yourself. This is what I created the Kaggle dataset for. You can download “cran_package_history.csv” and start working with it immediately.
df_hist < read_csv("../../static/files/cran_package_history.csv", col_types = cols())
Here are the first 50 rows:
df_hist %>% head(50) %>% gt() %>% tab_header( title = md("**The first rows of the cran_package_history.csv table**") ) %>% opt_row_striping() %>% tab_options(container.height = px(400))
The first rows of the cran_package_history.csv table  

package  version  date  repository 
A3  0.9.1  20130207  Archive 
A3  0.9.2  20130326  Archive 
A3  1.0.0  20150816  CRAN 
AATtools  0.0.1  20200614  CRAN 
ABACUS  1.0.0  20190920  CRAN 
abbreviate  0.1  20211214  CRAN 
abbyyR  0.1  20150612  Archive 
abbyyR  0.2  20150912  Archive 
abbyyR  0.2.1  20151104  Archive 
abbyyR  0.2.2  20151106  Archive 
abbyyR  0.2.3  20151206  Archive 
abbyyR  0.3  20160204  Archive 
abbyyR  0.4.0  20160516  Archive 
abbyyR  0.5.0  20160620  Archive 
abbyyR  0.5.1  20170412  Archive 
abbyyR  0.5.3  20180528  Archive 
abbyyR  0.5.4  20180530  Archive 
abbyyR  0.5.5  20190625  CRAN 
abc  1.0  20101005  Archive 
abc  1.1  20101011  Archive 
abc  1.2  20110115  Archive 
abc  1.3  20110510  Archive 
abc  1.4  20110904  Archive 
abc  1.5  20120808  Archive 
abc  1.6  20120814  Archive 
abc  1.7  20130606  Archive 
abc  1.8  20131029  Archive 
abc  2.0  20140711  Archive 
abc  2.1  20150505  Archive 
abc  2.2.1  20220519  CRAN 
abc.data  1.0  20150505  CRAN 
ABC.RAP  0.9.0  20161020  CRAN 
abcADM  1.0  20191113  CRAN 
ABCanalysis  1.0  20150213  Archive 
ABCanalysis  1.0.1  20150420  Archive 
ABCanalysis  1.0.2  20150615  Archive 
ABCanalysis  1.1.0  20150928  Archive 
ABCanalysis  1.1.1  20160615  Archive 
ABCanalysis  1.1.2  20160823  Archive 
ABCanalysis  1.2.1  20170313  CRAN 
abclass  0.1.0  20220307  Archive 
abclass  0.2.0  20220412  Archive 
abclass  0.3.0  20220528  CRAN 
ABCoptim  0.13.10  20131021  Archive 
ABCoptim  0.13.11  20131106  Archive 
ABCoptim  0.14.0  20161117  Archive 
ABCoptim  0.15.0  20171106  CRAN 
ABCp2  1.0  20130410  Archive 
ABCp2  1.1  20130723  Archive 
ABCp2  1.2  20160204  CRAN 
Now we can filter the initial release date for each package and visualise the number of CRAN packages created over time:
df_hist %>% group_by(package) %>% slice_min(order_by = date, n = 1) %>% ungroup() %>% mutate(month = floor_date(date, unit = "month")) %>% count(month) %>% arrange(month) %>% mutate(cumul = cumsum(n)) %>% ggplot(aes(month, cumul)) + geom_line(col = "purple") + theme_minimal() + labs(x = "Date", y = "Cumlative count", title = "Cumulative count of CRAN package by date of first release")
Still an impressive growth, and now we give proper emphasis to the early history of CRAN; reaching back all the way to before the year 2000. There are many more angles and visuals that this dataset will allow you to explore.
Notes and suggestions:
I will keep updating this dataset on a monthly basis. After the initial collection of all the version histories, from now on I only need to update the histories for those packages that have a new version released. This should simplify the process significantly.
Some ideas for EDA and analysis: how long did packages take from their first release to version 1.0? What type of packages were most frequent in different years? Who are the most productive authors? Can you predict the growth toward 2025?
All of this analysis can be done directly on the Kaggle platform! On the dataset page, on the top right, you will see a button called “New Notebook”. Click that to get an interactive editor in R or Python and start exploring immediately.
Have fun!
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.