Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Andrie de Vries

Every once in a while somebody asks me how many packages are on CRAN. (More than 8,000 in April, 2016).  A year ago, in April 2015, there were ~6,200 packages on CRAN.

This poses a second question: what is the historical growth of CRAN packages?

One source of information is Bob Muenchen's blog R Now Contains 150 Times as Many Commands as SAS, that contains this graphic showing packages from 2002 through 2014. (Bob fitted a quadratic curve through the data, that fits quite well, except that this model estimates too high in the very early years).

CRAN package data through 2014 by Bob Muenchen

But where does this data come from?  Bob's article references an earlier article by John Fox in the R Journal, Aspects of the Social Organization and Trajectory of the R Project. (This is a fascinating article, and I highly recommend you read it). The analysis by John Fox contains this graphic showing data from 2001 through 2009. John fits an exponential growth curve through the data, that again fits very well:

CRAN package data through 2009 by John Fox

I was particularly interested in trying to see if I can find the original source of the data. The original graphic contains a caption with references to the R source code on SVN, but I could only find the release dates of historical R releases, not the package counts.

Next I put the search term “john fox 2009 cran package data” into my favourite search engine and came across the dataset CRANPackages in the package Ecdat. The Ecdat package contains data sets for econometrics, compiled by Spencer Graves.

I promptly installed the package and inspected the data:

> library(Ecdat)

Version       Date Packages            Source
1     1.3 2001-06-21      110         John Fox
2     1.4 2001-12-17      129         John Fox
3     1.5 2002-05-29      162         John Fox
4     1.6 2002-10-01      163 John Fox, updated
5     1.7 2003-05-27      219         John Fox
6     1.8 2003-11-16      273         John Fox

> tail(CRANpackages)
Version       Date Packages         Source
24    2.15 2012-07-07     4000      John Fox
25    2.15 2012-11-01     4082 Spencer Graves
26    2.15 2012-12-14     4210 Spencer Graves
27    2.15 2013-10-28     4960 Spencer Graves
28    2.15 2013-11-08     5000 Spencer Graves
29     3.1 2014-04-13     5428 Spencer Graves



This data is exactly what I was after, but what is the origin?

> ?CRANpackages

Data casually collected on the number of packages on the Comprehensive R Archive Network (CRAN) at different dates.

So it seems this gets compiled and updated by hand, orginally by John Fox, and more recently by Spencer Graves himself.

Can we do better?

This set me thinking. Can we do better and automate this process by scraping CRAN?

This is in fact possible, and you can find the source data at CRAN for older, archived releases (R-1.7 in 2004 through R-2.10 in 2010) as well as more recent releases.

However, you will have to scrape the dates from a list of package release dates for each historic release (you can find my code at the bottom of this blog).

The results

I get the following result. Note that the rug marks indicate the release date and number of packages for each release. The data is linear, not log, but the rug marks gives the illusion of a logarithmic scale.

CRAN package data through 2016 by Andrie de Vries

Caveat

I took a few shortcuts in the analysis:

• For each release, the actual data is a list of packages, as well as the publication date for each package. I took the date of the “release” as the very last package publication date. This means my estimate for the “release date” will be wrong. Specifically, in each case, the actual release would have occurred earlier.
• I made no attempt to find the data prior to 2004.

Further work

The analysis can really benefit from fitting some curves through the data. Specifically, I would like to fit an exponential growth curve to see. For example, are there indications that the contribution rate is steady, accelerating or decelerating. Might a S-curve fit the data better?

The plot itself needs additional labels for the dot releases.

I hope to address these in a follow-up post.