Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Blog 2: Data preparation and research question

About 2 week ago, yes right around the New year, I was browsing Kaggle just for fun. It made me remember how much fun it actually is to play around with random data. Not only that but very often with a cool purpose too. One of my new year goals is to have a little bit more fun with data again,  so it became a quick 1-2 and I have been diving into a challenge for the past 2 weeks. In this series I will discuss 4 distinct steps in my project:

• Challenge goals and initial discussion
• Data setup for modelling
• Data modelling
• Refinements and final insights

## The challenge: ACEA Water Levels

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model, to accurately assess the water level of tomorrow, based on data of today. To shortly note the Kaggle rules, this is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here, but I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address some of the data setup steps I used to get my data ready for machine learning models.

### The how-to’s on setting up your data

Data setup for modelling sounds a bit mystical, this is the crucial step where we move on from our initial insights and start to shape our data to allow for answers on our main research question. Here we add features to our data, transform our variables and put in the business logic. It is the business and can make or break your findings.

In todays world there are roughly 2 ways to go about setting up your data optimally for your machine learning efforts. There is the traditional approach, where we build our feature set and then clean up our dataset before we run the predictive model, for example by removing variables that have a high correlation. And there is the more modern approach where we simply load all our features into the model, and evaluate and trim our data afterwards.

In the traditional approach we use our own reasoning and knowledge of statistics and the dataset to optimize the model outcome, in the modern approach we make use of the advancements in computer power and speed and allow the model to figure out itself what features might be important. In this blog I will work through both options, discuss them and compare the final outcome.

In the traditional approach for data setup we define the following steps:
– Data cleaning
– Statistical preprocessing

In our sample dataset for the aquifer Auser in the Kaggle dataset referenced before, we find a total of 5 outcome variables. Each of these variables can and need to be used for prediction with the final notebook. For illustration purposes we will focus on the LT2 variable, by the data definition this is the only water level measurement in the south part of the aquifer. Today we will focus on the rainfall and temperature variables in the Auser dataset.

In this particular case I considered the following, looking back at our initial findings, one of the main concerns of predicting water levels is the time lags associated with both the temperature and the rainfall variables in the dataset. Hence it is a logical first step to incorporate lags of these variables in our dataset. In the future I can imagine there are other parameters to consider here, such as averaging rainfall for multiple measuring points or using a running average of multiple days.

### Data cleaning

In this part of the process we normally remove missing values. This can be done in multiple ways which are described in depth elsewhere. In a business environment it is often a time intensive step to retain as much data as possible, for example by using imputation to fill up the missing values. This is where your business logic comes in and its an important area to focus on when your presented with real life ‘dirty’ datasets. For now we simply remove all values that are incomplete.

Another important step here is outliers, we find that some measurements in the LT2 variable are unrealistic, from one day to the next the water level moves from the average to 0 and back to the average. This is a faulty measurement and needs to be removed from the data.

### Statistical preprocessing

Now for the most important pre-processing step in the traditional approach, we clean up our variables by removing highly correlated variables (since they have a similar effect on the outcome measure of choice) and near zero variance variables (variables who are for example nearly always 0).

These steps help us gain an understanding of how the data works, and therefore help us better understand how the model gets its final outcome. Finalizing all these steps we find that 18 variables remain, using those in our initial testrun model we find an Rsquared of the model 15%. This is not the best result, but for a first try I’ve had a lot worse!

The next step in the traditional approach is to start tinkering, initially we work backwards, hence we look at our excluded variables and find different cutoff values for the different steps. This will help us expand our model, we add features and attempt to improve the model performance through understanding its inner workings. This step can take days to weeks, depending on how often we rerun our model.

## The modern approach

In what I call a modern approach to data setup we skip the statistical test processing, instead we simply load the data into the model and try to optimize the predictive power. Afterwards we evaluate our model and if possible rerun a simplified version. When running the model without any statistical tests, but with a similar feature set and data cleaning we find an Rsquared of the model at 33.7%. This gives us some idea of the potential in the data.

## Code

Below you find my function for preprocessing and running my model both in the traditional and modern approach. I find that writing this in a function helps keep my code clean, and this function can be applied to all datasets in the Kaggle ACEA challenge as needed.

create_data_model <- function(data, features, outcome, lag, preprocess = NULL){

data_model <- data[,c(outcome,features)]
names(data_model)[which(names(data_model)== outcome)] <- 'outcome'

for(i in 1:length(features)){
for(j in 1:lag){
data_model$temp <- Lag(data_model[,features[i],+j]) names(data_model)[which(names(data_model)=='temp')] <- paste(features[i],j, sep = '_') } } # Data cleaning data_model <- data_model[complete.cases(data_model),] # Remove all rows with missing values data_model <- data_model[which(data_model$outcome!= 0),]    # Remove all outlier measurements

# Statistical preprocessing
if(!is.null(preprocess)){
temp <- data_model[,-1]
nzv <- nearZeroVar(data_model)                                                 # excluding variables with very low frequencies
temp <- temp[, -nzv]
i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
if(length(i) > 0) temp <- temp[, -i]
i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
if(!is.null(i$remove)) temp <- temp[, -i$remove]
data_model <- data_model[,c('outcome', names(temp))]
}

# Modelling:
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 3,
verboseIter = T)

gbmGrid <-  expand.grid(interaction.depth = c(1,2,4,8),
n.trees = 1:2000,
shrinkage = c(0.01,0.001),
n.minobsinnode = c(2,5))

err <- try(load(paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'.RData', sep = ''),sep = '/')))
if(err != 'train1'){
train1 <- train(outcome ~ ., data= data_model, method = 'gbm', trControl = fitControl, tuneGrid=gbmGrid)
save(train1, file = paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'.RData', sep = ''),
sep = '/'))
}

train1
}

tr_preproc <- create_data_model(data = data_auser,
features = grep('Rainfall|Temperature',names(data_auser),value = T),
outcome = 'Depth_to_Groundwater_LT2',
lag = 15,
preprocess = 'yes')

tr_nopreproc <- create_data_model(data = data_auser,
features = grep('Rainfall|Temperature',names(data_auser),value = T),
outcome = 'Depth_to_Groundwater_LT2',
lag = 15,
preprocess = NULL) 

## Discussion

So how to setup your data for modelling? We saw that the traditional approach leaves you with more understanding of the problem but it also required more time investment and often demands more business logic to deal with the problem. The modern approach skips these steps and moves straight to optimize predictive power, utilizing machine learning techniques to optimize the data setup.

Comparing the 2, we found that simply loading all information into the model got us to 33% predictive power, not a bad score. Our first try of using some statistical concepts to preprocess the data only got us to 15%. It did take my laptop 6 hours to run the model on the ‘modern approach’, the traditional model was done after 20 mins.

By heart I am a traditionalist, having studied as an economist I strive for understanding a problem, thinking it through and finding a solution. Ultimately to use the model we often need this complex understanding that you develop through the extra time investment and hard work. How else are we going to convince other people to use a model for their business problem?

When do I prefer the modern approach? When for a particular business problem the outcome is more important than the process, or when the assumption is that the important variables in the dataset that influence the outcome can’t be influenced themselves. This Italian water level issue is actually a perfect example of a situation that tills quite heavily towards the modern approach. As we find that temperature and rainfall influence water levels, and we can’t actually influence temperature or rainfall, all we really care about is to predict the water level. It would be great if we know that the temperature on a mountain is highly predictive for this, but once we know this we can’t influence it in any way.

In short, the mechanism of why something happens is irrelevant if we can’t influence or change the mechanism in the future.

I will be back soon with a more in depth look at the modelling process I’ve gone through for this challenge. Further improvements can definitely by made, such as transformations of the data and making use of lags on the outcome variable.