When I first heard of the existence of Hackathons (receive a data set, predict the response as good as possible, win money. All within 24 hours), I had two thoughts:
1. Wow, that sounds greats. Like a huge game for intelligent people.
2. My skills are not good enough to participate.
That was one or two years ago. Now I have finished my bachelor degree in statistics and also gained a little experience with some machine learning techniques (boosting and neural networks). So I felt confident enough to try it out. Then I read about the EMI Music Data Science Hackaton and decided to take part. The cool thing about it was, that it was hosted by kaggle, so you did not have to be in London to participate.
The week before
The next step was to find a team. As a statistics student it is easy to find other statisticians. So I started to ask people around me if they were interested. To my surprise the euphoria to be part of such a competition was huge. My first plan to spend the 24 hours of data hacking in my kitchen (which can handle up to 5 people) was soon discarded, as the team size grew to 11 people.
So we had to find a better place. The answer was the computer room of the statistics department. So I asked the the supervisor of my bachelor thesis if it would be possible to use the computer room in the statistics department for the weekend (even stay there over night from saturday to sunday). I was amazed how uncomplicated it was to get the permission.
24 hours before the first submissions of the results could be made, the data sets were made available. Our team met to discuss how we would organize everything, have a look at the data and think about possible models The response value was a users ranking of a particular song. The data (shared as csv – files) was stored in three tables. One with demographic information about the users, one with the user, artist, track and rating information and one with informations about how much some of the user liked the artists. In a first step we merged the data and made some descriptive analysis. We also added some new features to the variables, which showed up to be very useful later on. All of our team members used R, so it was great that we could share code.
We met very early in the morning to start modeling the response. We got very excited when we could upload our first submissions. But we got disappointed soon. At first we got very bad results, but that was due to the wrong order of cases in the submission file. I tried a boosting model but got bad results as well. One of our team members tried a very simple linear model with manual variable selection. And it was surprisingly good. Compared to the other teams we still had a poorly high RMSE, but at least it performed better than the benchmarks. This was our best model for quite a while, which was very disappointing. Eleven statisticians could not find a better model than a very weak linear model? Why even study then? But then we had a success, when we combined a linear model with boosting results and we went some positions up in the leaderboard. We also tried other methods like random forests, mixture of regression models, gam’s and simple linear models.
Most of us did not leave the university but kept programming over night.
No sleep means, more time for data hacking! But of course there were moments when everyone was tired and we worked really ineffectivly. 1:00 pm London time came closer and we were very busy getting better results. In the end we climbed up the leaderbord by about 40 positions. Our final result was 37th position of 138 teams in total. It was not enough to be among the to 25% but among the best 27%. Our best submission was the mean over the prediction mixture of a simple generalized additive model using the package mgcv and a boosted gam using the package mboost. Our submission over time can be seen here:
- It is useful to look at the data and build new features
- Ensemble learning can be pretty useful. We did not even use sophisticated approaches but simply took the mean of predictions or chose weights manually, and it resulted in smaller RMSEs
- Even simple methods like linear models can be useful
- It is necessary to have some form of cross validation to test models without having to waste a submission