Beating the benchmark for the Kaggle Otto Group Product Classification Challenge with a simple R script

March 19, 2015
By

(This article was first published on Numbr Crunch » R, and kindly contributed to R-bloggers)

Two days ago, Kaggle began a new competition called the Otto Group Product Classification Challenge. In this competition, participants are challenged to create a model to correctly classify products between 9 product categories (fashion, electronics, etc.). The data consists of 200k products with 93 features each. The features have been obfuscated so it’s hard to make any subjective inferences about them. Check out the link above for more details about the competition.

So far there has already been a lot of activity with nearly 500 teams contributing in less than 2 days. After taking a quick look at the data and checking out the forums, I decided to take a stab at it myself. For my first pass, I used a random forest since it works well with very little tuning and can deal with lots of features. With the very simple script below, I was able to beat the benchmarks, and by a fair margin.

Random Forest Benchmark – 1.56040
Uniform Probability Benchmark – 2.19722
My Score – 0.59256

Screen Shot 2015-03-19 at 4.21.10 PM

# clear environment workspace
rm(list=ls())
# load data
train <- read.csv("~/Documents/RStudio/kaggle_otto/data/train.csv")
test <- read.csv("~/Documents/RStudio/kaggle_otto/data/test.csv")
sample_sub <- read.csv("~/Documents/RStudio/kaggle_otto/data/sampleSubmission.csv")
# remove id column so it doesn't get picked up by the random forest classifier
train2 <- train[,-1]

# install randomForest package
install.packages('randomForest')
library(randomForest)
# set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
set.seed(12)
# create a random forest model using the target field as the response and all 93 features as inputs
fit <- randomForest(as.factor(target) ~ ., data=train2, importance=TRUE, ntree=100)

# create a dotchart of variable/feature importance as measured by a Random Forest
varImpPlot(fit)

# use the random forest model to create a prediction
pred <- predict(fit,test,type="prob")
submit <- data.frame(id = test$id, pred)
write.csv(submit, file = "firstsubmit.csv", row.names = FALSE)

Note that the number of trees used in the model (line 17) will significantly impact the time it takes to run. The upside is that increasing the number of trees is also one of the easiest ways to improve the model. I was able to increase my rank by 68 spots just by increasing the number of trees from 50 to 100. My rank at submission time was 251/490, not bad for an out-of-the-box model with no tuning, feature selection, feature engineering, or cross-validation. Just goes to show how powerful R and many of its packages are. If you use R, this is a solid starting point to get some ideas on what to do in this competition. Feel free to use the code above or fork my Github repo to keep up to date of any changes.

There are still 60 days left, so plenty of time to take it more seriously and develop an improved solution. If anyone is interested in creating a team to work on this together, ping me on my blog and we can go from there. If you’re new to Kaggle, I think this is a great medium-difficulty competition to get started in that will actually contribute to your ranking. Good luck and happy Kaggling!

To leave a comment for the author, please follow the link and comment on their blog: Numbr Crunch » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)