# Using Data Tools to Find Data Tools, the Yo Dawg of Data Hacking

October 7, 2010
By

(This article was first published on dataists » R Explorations, and kindly contributed to R-bloggers)

by John Myles White and Drew Conway

Editors’ Note: One theme likely to recur on dataists.com is that data hackers love using their tools to analyze, visualize, and predict everything. Data hackers also love discovering and learning about new tools. So it should come as no surprise that Dataist contributors John Myles White and Drew Conway thought to develop a model that can predict which R packages a particular user would like. And in the spirit of friendly competition, they’re opening it up for others to participate!

### Introduction

A graphical visualization of packages’ “suggestion” relationships. Affectionately referred to as the R Flying Spaghetti Monster. More info below.

As part of the kickoff for dataists, we’re announcing a data hacking contest tailored to the statistical computing community. Contestants will build a recommendation engine for R packages. The contest is being administered in collaboration with Kaggle. If you’re interested in the details of the contest, please read on.

By sponsoring this contest, we’re hoping to encourage the data hacking community to use its skills to build a recommendation engine that will help R programmers to find the best packages on CRAN, the standard repository for R libraries. Like many data-driven projects, the question has evolved with the availability of data. We started with the question, “which packages are best?” and replaced it with the empirical question, “which packages are used most often?” This is quite a difficult question to answer as well, because the appropriate data set is neither readily available nor can it be easily acquired. For that reason, we’ve settled on the more manageable question, “which packages are most often installed by normal R users?”

This last question could potentially be answered in a variety of ways. Our current approach uses a convenience sample of installation data that we’ve collected from volunteers in the R community, who kindly agreed to send us a list of the packages they have on their systems. We’ve anonymized this data and compiled a set of metadata-based predictors that allow us to predict the installation probabilities quite well. We’re releasing all of our current work, including the data we have and all of the code we’ve used so far for our exploratory analyses. The contest itself will go live on Kaggle on Sunday and will end four months from Sunday on February 10, 2011. The rules, prizes and official data sets are all described below.

### Rules and Prizes

To win the contest, you need to predict the probability that a user U has a package P installed on their system for every pair, (U, P). We’ll assess your performance using ROC methods, which will be evaluated against a held out test data set. The winning team will receive 3 UseR! books of their choosing. In order to win the contest, you’ll have to provide your analysis code to us by creating a fork of our GitHub repository. You’ll also be required to provide a written description of your approach. We’re asking for so much openness from the winning team because we want this contest to serve as a stepping stone for the R community. We’re also hoping that enterprising data hackers will extend the lessons learned through this contest to other programming languages.

### Getting Started

To get started, you can go to GitHub to download the primary data sets and code. The sections below describe the data sets that you can download and the baseline model you should try to beat.

### Data Sets

For this contest, there are really three data sets. At the start, you’ll want to download the heavily preprocessed data set that we’ll be providing to you through Kaggle. This data set is also available on GitHub, where it is labeled as training_data.csv. This file contains a matrix with roughly 100,000 rows and 16 columns, representing installation information for all existing R packages and 52 users. The test data set against which your performance will be evaluated contains approximately another 30,000 rows.

Each row of this matrix contains the following information:

1. Package: The name of the current R package.
2. DependencyCount: The number of other R packages that depend upon the current package.
3. SuggestionCount: The number of other R packages that suggest the current package.
4. ImportCount: The number of other R packages that import the current package.
5. ViewsIncluding: The number of task views on CRAN that include the current package.
6. CorePackage: A dummy variable indicating whether the current package is part of core R.
7. RecommendedPackage: A dummy variable indicating whether the current package is a recommended R package.
8. Maintainer: The name and e-mail address of the package’s maintainer.
9. PackagesMaintaining: The number of other R packages that are being maintained by the current package’s maintainer.
10. User: The numeric ID of the current user who may or may not have installed the current package.
11. Installed: A dummy variable indicating whether the current package was installed by the current user.

In addition to these central predictors, we are including logarithmic transforms of the non-binary predictors as we find that this improves the model’s fit to the full data set. For that reason, the last five columns of our data set are,

1. LogDependencyCount
2. LogSuggestionCount
3. LogImportCount
4. LogViewsIncluding
5. LogPackagesMaintaining

The Kaggle data set is really the minimal amount of data you should use to build your model. For most users, you’ll quickly want to move on to the raw metadata that we’re providing on GitHub. This second-level data set is contained in several normalized CSV files inside of the data directory:

1. core.csv: The R and base packages are listed here as core packages.
2. depends.csv: The full dependency graph for CRAN as of 8/28/2010. An edge between A and B indicates that A depends upon B. For example, ggplot2 depends upon plyr, but plyr does not depend upon ggplot2.
3. imports.csv: The full import graph for CRAN as of 8/28/2010. An edge between A and B indicates that A imports B.
4. installations.csv: A list of the packages installed on 52 users’ systems. Each row indicates whether or not user A has installed package B.
5. maintainers.csv: A list of the current maintainers for each package. We use this instead of the Author field because it is generally easier to parse.
6. packages.csv: A list of all of the packages contained in CRAN on 8/28/2010.
7. recommended.csv: A list of the packages recommended for installation by the R Core team.
8. suggests.csv: The full suggestion graph for CRAN as of 8/28/2010. An edge between A and B indicates that A suggests B.
9. views.csv: A list of all of the packages indicated in each of the task views on CRAN as of 8/28/2010.

To give you a taste of this richer data set we’re providing, we’ve built a visualization of the suggestions graph found in suggestions.csv:

In the graph (above), the package names are sized and colored by in-degree centrality (i.e., larger sized and darker colored nodes have higher centrality), which you can think of as a very rough proxy for importance. If you’re interested in producing similar visualizations of this data, you can use Gephi to produce new graphics like this. To better explore the graph toggle to full-screen mode.

For those interested, we’re also providing the R scripts we used to generate the metadata predictors we’re providing, in case you’d like to use them as examples of how to work with the raw data from CRAN. The relevant scripts are:

1. extract_graphs.R: Extracts the dependency, import and suggestion graphs from CRAN.
2. get_maintainers.R: Extracts the package maintainers from CRAN.
3. get_packages.R: Extracts the names of all of the packages on CRAN.
4. get_views.rb: Extracts the packages that are contained in each of the task views on CRAN. This program is written in Ruby, not R.

All of the other data sets described earlier were compiled by hand.

Please note that these data sets are normalized, so we are also providing preprocessing scripts that build one large data frame that contains all of the information we’ve used to build our predictive model. The lib/preprocess_data.R script performs the relevant merging and transformation operations, including logarithmic transformations that we’ve found improve predictive accuracy. The result of this merging is the training data set that we’re providing through Kaggle.

For the truly dedicated, you should consider CRAN itself to be the raw data set for this contest. If you want to use predictors beyond those we’re giving you, you’ll want to download a copy of CRAN that you can work with locally. You can do this using the Perl script, fetch_cran.pl, that we’re providing. To be kind to the CRAN maintainers, this download script sleeps for ten seconds between each step in the spidering process. Obviously you can change this, but please be considerate about the amount of bandwidth you use if you do make changes.

Please note: until you are familiar with the preprocessed data sets that we’re providing, we suggest that you do not download a copy of CRAN. For many users, working directly with a raw copy of CRAN will not be efficient.

### Closing Remarks

We think this contest can help focus data hackers on an unsolved problem: using our current data tools to help us find the best data tools. We hope you’ll consider participating and even extending this work to new contexts. Happy hacking!