What makes an R package popular? The number of people that use a given R package is a common point of discussion, but it turns out that it's kind of tricky to get hard and fast data to answer this question. You can look at the “I use this” number on crantastic.org, but that's a self-reported number (many more than 38 people have installed ggplot2, for example).
To try and answer this question, Drew Conway and John Myles Whyte have collected data from 52 R users about the packages they have installed, and have provided this data as the basis for building a recommendation engine to predict which R packages are most likely to be installed, based on factors like the number of other packages the package author maintains, what packages it depends on, and so on. When blown out as a matrix of all the packages and users, that turns out to be about 100,000 rows of data. The dependency graph for R packages is quite complex, by the way, as illustrated by the snippet below:
Can you use all this data to predict which R packages are installed the most? That's the challenge of the data hacking competition, which will be launched on Kaggle on Sunday. If you haven't heard of it, Kaggle is a website that hosts many such prediction competitions, encouraging data hackers from around to compete to find the best system to predict things like World Cup winners, Eurovision Song Contest votes and grandmaster chess rankings. Many of the competitors use R to build their models (for example, the winner of the HIV progression used the caret package), and it would of course make sense to use R for this competition. (Hence the Yo Dawg reference.)
The competition opens on Sunday and runs for four months. Full details at the link below.