(This article was first published on

**Modern Tool Making**, and kindly contributed to R-bloggers)The Kaggle Don’t Overfit competition is over, and I took 11th place! Additionally, I tied with tks for contributing the most to the forum, so thanks to everyone who voted for me! I voted for tks, and I’m very happy to share the prize with him, as most of my code is based off of his work.

The top finishers in this competition did a writeup of their methods in the forums, and they are definitely worth a read. In my last post, I promised a writeup of a method that would beat the benchmark by a signifigant margin, so here that is. This variable selection technique is based on tks’ code .

To start, we set things up in the same way as my previous posts: Load the data, split train vs. test, etc.

To improve over our previous attempt, which fits a glmnet model on all 200 variables, we want to select a subset of those variables that performs better than the entire thing. This whole process must be cross-validated to avoid overfitting, but fortunately the caret package in R has a great function called rfe that handles most of the gruntwork. The process is described in great detail here, but it can be thought of as 2 nested loops: the outer loop resamples your dataset, either using cross-validation or bootstrap sampling, and then feeds the resampled data to the inner loop, which fits a model, calculates variable importances based on that model, and then eliminates the least important variables and re-calculates the model’s fit. The results of the inner loop are collected by the outer loop, which then stores a re-sampled metric of each variable’s importance, as well as the number of variables that produced the best fit. If it sounds complicated, read the document at my previous link for more detailed information.

This section of code initializes an object to control the RFE function. We use the ‘caretFuncs’ object as our framework, and the ‘twoClassSummary‘ function as our summary function, as we are doing a 2-class problem and want to use AUC to evaluate our predictive accuracy. Then we use a custom function to rank the variables from a fitted glmnet object, based on tks’ method. We’ve decided to rank the variables by their coefficients, and consider variables with larger coefficients more important. I’m still not 100% sure why this worked for this competition, but it definitely gives better results than glmnet on its own. Finally, we create our RFE control object, where we specify that we want to use 25 repeats of bootstrap sampling for our outer loop.

Next, we have to setup the training parameters and multicore parameters, all of which are explained in my previous posts. In both loops of my RFE function I am using 25 repeats of bootstrap sampling, which you could turn up to 50 or 100 for higher accuracy at the cost of longer runtimes.

Now we get to actually run the RFE function and recursively eliminate features!

The structure of the RFE function is very similar to caret’s train function. In fact, the inner loop of RFE is fitting a model using the train function, so we need to pass 2 sets of parameters to RFE: one for RFE and one for train. I have indented the parameters for the train function twice so you can see what’s going on. After running RFE, we can access the variables it selected using RFE$optVariables, which we use to construct a formula we will use to fit the final model. We can also plot the RFE object for a graphical representation of how well the various subsets of variables performed.

Lastly, we fit our final model using caret’s train function. This step is a little unnecessary, because RFE also fit a final model on the full dataset, but I included it anyways to try out a longer sequence of lambdas (lambda is a parameter for the glmnet model).

Because we fit our model on Target_Practice, we can score the model ourselves using colAUC. This model gets .91 on Target_Practice, and also scored about .91 on Target_Leaderboard and Target_Evaluate. Unfortunately, .91 was only good for 80th place on the leaderboard, as Ockham released the set of variables used for the leaderboard dataset, and I was unable to fit a good model using this information. Many other competitors were able to use this information to their advantage, but this yielded no edge on the final Target_Evaluate, which is what really mattered.

Overall, I’m very happy with the result of this competition. If anyone has any code they’d like to share that scored higher than .96 on the leaderboard, I’d be happy to hear from you!

To

**leave a comment**for the author, please follow the link and comment on their blog:**Modern Tool Making**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...