[This article was first published on Modern Tool Making
, and kindly contributed to R-bloggers
]. (You can report issue about the content on this page here
Want to share your content on R-bloggers? click here
if you have a blog, or here
if you don't.
Kaggle is a site for participating in predictive analytics competitions. It is also a great resource for learning how to build powerful predictive models, and the Overfitting competition provides a good introduction to the common tools used by a predictive analyst.
To start, you will need to download R for your platform. If you don’t live near Pittsburgh, PA, and the download is slow, go to the main site for R, click CRAN and select a mirror closer to you. After you install R, you will need to install a few packages so you can follow my code. (If you are running R on linux, make sure you start it with ‘sudo R’ to install packages). Enter the following line of code at the R command prompt to install the R packages caret, reshape2, plyr, and caTools. Caret is a package for building predictive models, reshape2 and plyr are brilliant tools for data management, and caTools will help you score your model using a ROC curve.
After you install R, download the kaggle competition data, and create a new R script in the R gui. (On linux, use your favorite text editor– I’m assuming you already know something about writing code). The first thing your script needs to do is change your working directory to the directory where you saved the completion’s data. If you save your R script and the data to the same directory, you should be able to skip this step, but it is usually good practice. You will need a bit of an understanding of how your operating system handle directories. For example, windows doesn’t seem to handle the ‘~’ character in the same way as Mac and Unix systems, where ‘~/’ represents your home directory. You might want to create a directory called ‘~/Kaggle/Overfitting’ and store your project there.
After changing the directory, I like to load the packages I am going to use for my analysis. This helps force me to think things through ahead of time. In this case, we are going to fit our predictive model using the packages caret and glmnet, which should have been automatically installed when you installed the caret packages as I instructed. ‘e1071’ and ‘ipred’ will be used for feature selection, and ‘caTools’ to score your model.
Next we need to read the data into R, which we do using the read.csv command. We then choose a Target for our analysis, which in this case is ‘Target_Practice.’ We then null out the other Target variables as they are redundant. (I will explain Target_Leaderboard and Target_Evaluate later). We create our Target as a factor in R, which signals that it is a discrete, rather than continuous variable. This also means that we are solving a classification, rather than a regression problem, and helps us avoid using predictive techniques that might be inappropriate for the dataset at hand. We recode our variable from ‘1’ and ‘0’ to ‘X1’ and ‘X0’ because of an issue with how caret handles names.
We need to split our dataset into a training and a testing set, which is something you should should try to do when developing any model. In this case, splitting is easy because the contest’s organizers have helpfully provided the split for us. Oftentimes, you will be required to make this split yourself, and you can employ R’s sample command to randomly assign observations to the training and test sets, but that is a topic for another day.
Finally, we define the formula we will use for our analysis. R has a very powerful formula interface, and all of the caret functions can employ it. In this case, we define a variable ‘xnames,’ which contains the names of the independent variables in this model, which we are going to try to use to predict the dependant variable, in this case ‘Target.’ To review what we have done, try the following commands: