Finally! Tracking CRAN packages downloads

June 11, 2013
By

(This article was first published on Nicebread » R, and kindly contributed to R-bloggers)

[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili's post]

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated ;-)
Here are the complete, self-contained R scripts to analyze these log data:

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

?View Code RSPLUS
## ======================================================================
## Step 1: Download all log files
## ======================================================================
 
# Here's an easy way to get all the URLs in R
start <- as.Date('2012-10-01')
today <- as.Date('2013-06-10')
 
all_days <- seq(start, today, by = 'day')
 
year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')
 
# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))
 
dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))
}

 
 

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

?View Code RSPLUS
## ======================================================================
## Step 2: Load single data files into one big data.table
## ======================================================================
 
file_list <- list.files("CRANlogs", full.names=TRUE)
 
logs <- list()
for (file in file_list) {
	print(paste("Reading", file, "..."))
	logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}
 
# rbind together all files
library(data.table)
dat <- rbindlist(logs)
 
# add some keys and define variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, country:=factor(country)]
dat[, weekday:=weekdays(date)]
dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]
 
setkey(dat, package, date, week, country)
 
save(dat, file="CRANlogs/CRANlogs.RData")
 
# for later analyses: load the saved data.table
# load("CRANlogs/CRANlogs.RData")

 
 

Step 3: Analyze it!

?View Code RSPLUS
## ======================================================================
## Step 3: Analyze it!
## ======================================================================
 
library(ggplot2)
library(plyr)
 
str(dat)
 
# Overall downloads of packages
d1 <- dat[, length(week), by=package]
d1 <- d1[order(V1), ]
d1[package=="TripleR", ]
d1[package=="psych", ]
 
# plot 1: Compare downloads of selected packages on a weekly basis
agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]
 
ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))
 
 
agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]
 
ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))

 
 
Here are my two packages, TripleR and RSA. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!Bildschirmfoto 2013-06-11 um 14.11.30

 

To put things in perspective: package psych included in the plot:

Bildschirmfoto 2013-06-11 um 14.11.43

Some psychological sidenotes on social comparisons:

  • Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include ggplot2 into your graphic!
  • Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
  • Of course, things are a bit more complicated than that …

All source code on this post is licensed under the FreeBSD license.

To leave a comment for the author, please follow the link and comment on his blog: Nicebread » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.