Beating the benchmark for the Kaggle Otto Group Product Classification Challenge with a simple R script

[This article was first published on Numbr Crunch » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two days ago, Kaggle began a new competition called the Otto Group Product Classification Challenge. In this competition, participants are challenged to create a model to correctly classify products between 9 product categories (fashion, electronics, etc.). The data consists of 200k products with 93 features each. The features have been obfuscated so it’s hard to make any subjective inferences about them. Check out the link above for more details about the competition.

So far there has already been a lot of activity with nearly 500 teams contributing in less than 2 days. After taking a quick look at the data and checking out the forums, I decided to take a stab at it myself. For my first pass, I used a random forest since it works well with very little tuning and can deal with lots of features. With the very simple script below, I was able to beat the benchmarks, and by a fair margin.

Random Forest Benchmark – 1.56040
Uniform Probability Benchmark – 2.19722
My Score – 0.59256

Screen Shot 2015-03-19 at 4.21.10 PM

# clear environment workspace
# load data
train <- read.csv("~/Documents/RStudio/kaggle_otto/data/train.csv")
test <- read.csv("~/Documents/RStudio/kaggle_otto/data/test.csv")
sample_sub <- read.csv("~/Documents/RStudio/kaggle_otto/data/sampleSubmission.csv")
# remove id column so it doesn't get picked up by the random forest classifier
train2 <- train[,-1]

# install randomForest package
# set a unique seed number so you get the same results everytime you run the below model,
# the number does not matter
# create a random forest model using the target field as the response and all 93 features as inputs
fit <- randomForest(as.factor(target) ~ ., data=train2, importance=TRUE, ntree=100)

# create a dotchart of variable/feature importance as measured by a Random Forest

# use the random forest model to create a prediction
pred <- predict(fit,test,type="prob")
submit <- data.frame(id = test$id, pred)
write.csv(submit, file = "firstsubmit.csv", row.names = FALSE)

Note that the number of trees used in the model (line 17) will significantly impact the time it takes to run. The upside is that increasing the number of trees is also one of the easiest ways to improve the model. I was able to increase my rank by 68 spots just by increasing the number of trees from 50 to 100. My rank at submission time was 251/490, not bad for an out-of-the-box model with no tuning, feature selection, feature engineering, or cross-validation. Just goes to show how powerful R and many of its packages are. If you use R, this is a solid starting point to get some ideas on what to do in this competition. Feel free to use the code above or fork my Github repo to keep up to date of any changes.

There are still 60 days left, so plenty of time to take it more seriously and develop an improved solution. If anyone is interested in creating a team to work on this together, ping me on my blog and we can go from there. If you’re new to Kaggle, I think this is a great medium-difficulty competition to get started in that will actually contribute to your ranking. Good luck and happy Kaggling!

To leave a comment for the author, please follow the link and comment on their blog: Numbr Crunch » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)