by Andrie de Vries
Every once in a while somebody asks me how many packages are on CRAN. (More than 8,000 in April, 2016). A year ago, in April 2015, there were ~6,200 packages on CRAN.
This poses a second question: what is the historical growth of CRAN packages?
One source of information is Bob Muenchen's blog R Now Contains 150 Times as Many Commands as SAS, that contains this graphic showing packages from 2002 through 2014. (Bob fitted a quadratic curve through the data, that fits quite well, except that this model estimates too high in the very early years).
But where does this data come from? Bob's article references an earlier article by John Fox in the R Journal, Aspects of the Social Organization and Trajectory of the R Project. (This is a fascinating article, and I highly recommend you read it). The analysis by John Fox contains this graphic showing data from 2001 through 2009. John fits an exponential growth curve through the data, that again fits very well:
I was particularly interested in trying to see if I can find the original source of the data. The original graphic contains a caption with references to the R source code on SVN, but I could only find the release dates of historical R releases, not the package counts.
Next I put the search term “john fox 2009 cran package data” into my favourite search engine and came across the dataset CRANPackages in the package Ecdat. The Ecdat package contains data sets for econometrics, compiled by Spencer Graves.
I promptly installed the package and inspected the data:
> library(Ecdat) > head(CRANpackages) Version Date Packages Source 1 1.3 2001-06-21 110 John Fox 2 1.4 2001-12-17 129 John Fox 3 1.5 2002-05-29 162 John Fox 4 1.6 2002-10-01 163 John Fox, updated 5 1.7 2003-05-27 219 John Fox 6 1.8 2003-11-16 273 John Fox > tail(CRANpackages) Version Date Packages Source 24 2.15 2012-07-07 4000 John Fox 25 2.15 2012-11-01 4082 Spencer Graves 26 2.15 2012-12-14 4210 Spencer Graves 27 2.15 2013-10-28 4960 Spencer Graves 28 2.15 2013-11-08 5000 Spencer Graves 29 3.1 2014-04-13 5428 Spencer Graves
This data is exactly what I was after, but what is the origin?
> ?CRANpackages Data casually collected on the number of packages on the Comprehensive R Archive Network (CRAN) at different dates.
So it seems this gets compiled and updated by hand, orginally by John Fox, and more recently by Spencer Graves himself.
Can we do better?
This set me thinking. Can we do better and automate this process by scraping CRAN?
However, you will have to scrape the dates from a list of package release dates for each historic release (you can find my code at the bottom of this blog).
I get the following result. Note that the rug marks indicate the release date and number of packages for each release. The data is linear, not log, but the rug marks gives the illusion of a logarithmic scale.
I took a few shortcuts in the analysis:
- For each release, the actual data is a list of packages, as well as the publication date for each package. I took the date of the “release” as the very last package publication date. This means my estimate for the “release date” will be wrong. Specifically, in each case, the actual release would have occurred earlier.
- I made no attempt to find the data prior to 2004.
The analysis can really benefit from fitting some curves through the data. Specifically, I would like to fit an exponential growth curve to see. For example, are there indications that the contribution rate is steady, accelerating or decelerating. Might a S-curve fit the data better?
The plot itself needs additional labels for the dot releases.
I hope to address these in a follow-up post.