R used by KDD 2009 cup winner of slow challenge

Posted on May 31, 2009 by Allan Engelhardt in Uncategorized | 0 Comments

[This article was first published on CYBAEA Data and Analysis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.

The write up (username/password may be required) from Hugh Miller and team at the University of Melbourne includes these points:

Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function
Models fit in an hour or so
Used the R statistical package
Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.

Impressive hardware selection! Well done R. Weka was another popular tool among the top entrants. Key for all of them were clever data preparation and variable substitution. The fast track winners from IBM document this in some detail:

We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.
We tried PCA on the large data set, but it did not seem to help.
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.

Jump to comments.

You may also like these posts:

How to win the KDD Cup Challenge with R and gbm

Hugh Miller, the team leader of the winner of the KDD Cup 2009 Slow Challenge (which we wrote about recently ) kindly provides more information about how to win this public challenge using the R statistical computing and analysis platform on a laptop (!).