Many modern data analysis problems in both industry and academia involve building a model that can predict the future based on historical variables. The 2009 KDD Cup was an international data mining competition devoted to this type of problem, where contestants attempted to predict the behaviour of mobile phone customers using an extensive database of historical information. The University of Melbourne team managed to win one part of this challenge, using R almost exclusively. In this talk I’ll give some background to the area and the specific problem, and discuss how we went about building our models. The talk will be fairly accessible, and deal with many of the practical issues encountered in this type of work.
Subscribe for content
Categories
Recent Comments
- Visual Revelations, Howard Wainer | Civil Statistician on useR! 2011 – Jonathan Rougier: Nomograms for visualising relationships between three variables
- orange county real estate agent on Text mining with R
- best project management software on Text mining with R
- Brad on Video: SimpleR tricks and tools: Help, debugging, git, LaTeX, and workflow with R by Prof Rob Hyndman
- Posterior samples « Sam Clifford on Video: SimpleR tricks and tools: Help, debugging, git, LaTeX, and workflow with R by Prof Rob Hyndman
-
Recent Posts
- Video: Using R for causal inference in a study of expensive public policy decisions
- Video: High scale in-database modeling in Greenplum with R
- Video: Survey Package in R
- Video: SimpleR tricks and tools: Help, debugging, git, LaTeX, and workflow with R by Prof Rob Hyndman
- Video on S3 Classes in R by Dr Andrew Robinson
Tags
business.apps C++ cloud data.mining database debugging devtools ec2 excel finance ggplot2 git Hadoop harlan.harris hpc introduction jd.long jeff.horner jeroen.ooms map.reduce melbourne multicore nosql nyc packages panel parallelism predictive.analytics programming protip python R.code Rapache reproducible.research REVo R Programming RStudio ryan.rosario SAS SURF tutorial video visualization web.apps workflowArchives
Blogroll

“The analysis and modelling work was performed almost entirely in the free open source program R. We say \almost”, because the original data chunks were too large to be read into R with our limited hardware, so it was first read into SAS and exported in batches of 200 variables, each of which could then be read into and then deleted from R” ?
I’m curious how this part was done. If you are only batcthing the variables 200 at a time how do you solve for the interaction of all of the variables? Is there a learning or gradient descent optimization being performed?
Thanks for this interesting presentation. Unfortionately the audio was terrible, and listening was a real pain in the a…
Yes, the audio makes it unwatchable. But the talk could be very interesting — could you post a less compressed version?
There are a couple of presentations available as PDF as well.
http://jmlr.csail.mit.edu/proceedings/papers/v7/
This is the specific to Uni of Melb.
http://jmlr.csail.mit.edu/proceedings/papers/v7/miller09/miller09.pdf
http://www.kddcup-orange.com/Slides/Unimelb_slides.pdf
TL;DR