A competition to predict popular R packages

October 8, 2010
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

What makes an R package popular? The number of people that use a given R package is a common point of discussion, but it turns out that it's kind of tricky to get hard and fast data to answer this question. You can look at the "I use this" number on crantastic.org, but that's a self-reported number (many more than 38 people have installed ggplot2, for example).

To try and answer this question, Drew Conway and John Myles Whyte have collected data from 52 R users about the packages they have installed, and have provided this data as the basis for building a recommendation engine to predict which R packages are most likely to be installed, based on factors like the number of other packages the package author maintains, what packages it depends on, and so on. When blown out as a matrix of all the packages and users, that turns out to be about 100,000 rows of data. The dependency graph for R packages is quite complex, by the way, as illustrated by the snippet below: 

Screen shot 2010-10-08 at 7.57.58 AM

Can you use all this data to predict which R packages are installed the most? That's the challenge of the data hacking competition, which will be launched on Kaggle on Sunday. If you haven't heard of it, Kaggle is a website that hosts many such prediction competitions, encouraging data hackers from around to compete to find the best system to predict things like World Cup winners, Eurovision Song Contest votes and grandmaster chess rankings. Many of the competitors use R to build their models (for example, the winner of the HIV progression used the caret package), and it would of course make sense to use R for this competition. (Hence the Yo Dawg reference.)

The competition opens on Sunday and runs for four months. Full details at the link below.

dataists: Using Data Tools to Find Data Tools, the Yo Dawg of Data Hacking

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.