The KDD Cup is an annual competition to build the best predictive model from a large data set. This years' contest tasked entrants to predict the likelihood of a student dropping out from one of XuetangX's massively-online open courses, based on the student's prior activities. The competition closed on July 12, and yesterday, the winning teams were announced. The winner was team “Intercontinental Ensemble” and the runner-up was “FEG&[email protected]”.
I couldn't find any details on what techniques were used — more will be revealed, I expect, at the KDD Conference in Sydney. But if you want to get a sense of what it's like to work with these data, take a look at this Data Until I Die blog post from a competitor who got close to the top of the leaderboard. He or she used a Gradient Boosting Model from the H20 R package, and found (amongst other things) that students who had completed prior courses were more likely to complete the next one.
If you'd like to play around with the data yourself, it's no longer available at the KDD Cup site, but it is available in an experiment in Azure ML Studio. If you haven't used Azure ML Studio before, it's free to get started and all you need is a modern web broswer (I used Chrome on a Mac). The screenshot below just shows the data munging steps, but later on in the flow a Python node is used to fit a predictive model. (This step-by-step tutorial on analyzing the KDD 2015 data walks you through the steps.) It's easy to add an R node as well, which gives you an R instance with 50 Gb of RAM and 8 cores to analyze the data.
For more details on using Azure ML Studio to analyze the KDD Cup data, check out the blog post below.