[This article was first published on Topics in R
, and kindly contributed to R-bloggers
]. (You can report issue about the content on this page here
Want to share your content on R-bloggers? click here
if you have a blog, or here
if you don't.
Given the vast amount of R packages available today, it makes sense (at least to me, as a trained economist) to ask a simple yet difficult question: How much value has been created by all those packages?
As all R stuff on CRAN is open-source (which is a blessing), there is no measureable GDP contribution in terms of market value that we can use to provide a quick answer. But all of us R users know the pleasant feeling, if not to say the excitement, of finding a package that provides exactly the functionality we have been looking for so long. This saves us the time of developing the functionality ourselves. So, apparantly, the time saving is one way to estimate the beneficial effect of the package sharing on CRAN.
Here comes a simple (and not too serious) approach to estimating this effect.
(Side note: I am well aware of the extremely high concentration of capable statisticians and data scientists in the R community, so be clement with my approach, I am, as you will see shortly, not aiming at delivering a scientific paper on the matter, although it might be worthwhile to do so; if there are already papers on the topic out there, I am sure they have figured out much better approaches; in this case, please simply leave a comment below).
Without further ado, let’s get right into it:
Since the recordings began, the RStudio CRAN server has seen 1,121,724,508
package downloads as of today (afternoon [CET] of July 14th, 2018) (this number has been generated by running through all the 12,781
R packages identified with the CRAN_package_db()
function from the tools
package, and adding up their download figures which I have retrieved from the CRAN server logs via RStudio CRAN’s HTTP interface; this interface returns a JSON result which can easily be read using the fromJSON()
function from the jsonlite
package; to be a bit more precise: the whole operation was done with the buildIndex()
function from my package packagefinder
as this integrates all this functionality).
Let’s assume 30% of these downloads are ‘false positives‘, i.e. cases in which the user realized the package is not really suitable for his/her purposes (and of course, in a more sophisticated approach we would need to account for package dependencies, as well; we neglect them here for the sake of simplicity). Removing the ‘false positives‘ leaves us with 785,207,156 downloads.
Next, we assume that everyone who has downloaded a package would have developed the package’s functionality on his/her own if the package had not been available on CRAN. And let us further assume that this development effort would have taken one hour of work on average for each package. (You can play with the parameters of the model, but one hour seems really, really low, at least to me, but let’s keep it conservative for now.)
But R users are not only extremely capable, almost ingenious programmers, they also have an incredible work ethic: Of course, everyone who works with R is an Elon Musk-style worker, that means he or she “puts in 80 to 90 hour work weeks, every week” (Musk in his own words). So, let’s be conservative and assume an agreeable 80-hour work week (there should be at least some work-life balance, after all; I mean, some people even have a family!).
Calculating our model with these parameters leads to the almost incredible amount of 188,235 work years saved (if you assume a year of 365/7 = 52.14 weeks; of course, our hard-working R user does not have any time for vacation or any other time off). If you assume a working life is between the age of 18 and 70 this means an amount of time has been saved by sharing packages on CRAN that is is equivalent to 3,620 working lives. A truly incredible number. For all those who want to do the math themselves, here is the R code I used: library(packagefinder) # Attention: The next statement takes quite some time to run # (> 1 hour on my machine) buildIndex(“searchindex.rdata”, TRUE) load(“searchindex.rdata”) package.count <- sum(searchindex$index$DOWNL_TOTAL) package.count.corrected <- package.count*0.7 package.hours <- package.count.corrected*1 package.workweeks <- package.hours/80 package.workyears <- package.workweeks/(365/7) package.worklifes <- package.workyears/(70-18)