Two days ago, Kaggle began a new competition called the Otto Group Product Classification Challenge. In this competition, participants are challenged to create a model to correctly classify products between 9 product categories (fashion, electronics, etc.). The data consists of 200k products with 93 features each. The features have been obfuscated so it’s hard to make any subjective inferences about them. Check out the link above for more details about the competition.
So far there has already been a lot of activity with nearly 500 teams contributing in less than 2 days. After taking a quick look at the data and checking out the forums, I decided to take a stab at it myself. For my first pass, I used a random forest since it works well with very little tuning and can deal with lots of features. With the very simple script below, I was able to beat the benchmarks, and by a fair margin.
Random Forest Benchmark – 1.56040
Uniform Probability Benchmark – 2.19722
My Score – 0.59256
# clear environment workspace rm(list=ls()) # load data train <- read.csv("~/Documents/RStudio/kaggle_otto/data/train.csv") test <- read.csv("~/Documents/RStudio/kaggle_otto/data/test.csv") sample_sub <- read.csv("~/Documents/RStudio/kaggle_otto/data/sampleSubmission.csv") # remove id column so it doesn't get picked up by the random forest classifier train2 <- train[,-1] # install randomForest package install.packages('randomForest') library(randomForest) # set a unique seed number so you get the same results everytime you run the below model, # the number does not matter set.seed(12) # create a random forest model using the target field as the response and all 93 features as inputs fit <- randomForest(as.factor(target) ~ ., data=train2, importance=TRUE, ntree=100) # create a dotchart of variable/feature importance as measured by a Random Forest varImpPlot(fit) # use the random forest model to create a prediction pred <- predict(fit,test,type="prob") submit <- data.frame(id = test$id, pred) write.csv(submit, file = "firstsubmit.csv", row.names = FALSE)
Note that the number of trees used in the model (line 17) will significantly impact the time it takes to run. The upside is that increasing the number of trees is also one of the easiest ways to improve the model. I was able to increase my rank by 68 spots just by increasing the number of trees from 50 to 100. My rank at submission time was 251/490, not bad for an out-of-the-box model with no tuning, feature selection, feature engineering, or cross-validation. Just goes to show how powerful R and many of its packages are. If you use R, this is a solid starting point to get some ideas on what to do in this competition. Feel free to use the code above or fork my Github repo to keep up to date of any changes.
There are still 60 days left, so plenty of time to take it more seriously and develop an improved solution. If anyone is interested in creating a team to work on this together, ping me on my blog and we can go from there. If you’re new to Kaggle, I think this is a great medium-difficulty competition to get started in that will actually contribute to your ranking. Good luck and happy Kaggling!