A competition to predict popular R packages

Posted on October 8, 2010 by David Smith in R bloggers, Uncategorized | 0 Comments

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What makes an R package popular? The number of people that use a given R package is a common point of discussion, but it turns out that it's kind of tricky to get hard and fast data to answer this question. You can look at the “I use this” number on crantastic.org, but that's a self-reported number (many more than 38 people have installed ggplot2, for example).

To try and answer this question, Drew Conway and John Myles Whyte have collected data from 52 R users about the packages they have installed, and have provided this data as the basis for building a recommendation engine to predict which R packages are most likely to be installed, based on factors like the number of other packages the package author maintains, what packages it depends on, and so on. When blown out as a matrix of all the packages and users, that turns out to be about 100,000 rows of data. The dependency graph for R packages is quite complex, by the way, as illustrated by the snippet below:

Can you use all this data to predict which R packages are installed the most? That's the challenge of the data hacking competition, which will be launched on Kaggle on Sunday. If you haven't heard of it, Kaggle is a website that hosts many such prediction competitions, encouraging data hackers from around to compete to find the best system to predict things like World Cup winners, Eurovision Song Contest votes and grandmaster chess rankings. Many of the competitors use R to build their models (for example, the winner of the HIV progression used the caret package), and it would of course make sense to use R for this competition. (Hence the Yo Dawg reference.)

The competition opens on Sunday and runs for four months. Full details at the link below.

dataists: Using Data Tools to Find Data Tools, the Yo Dawg of Data Hacking

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A competition to predict popular R packages

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)