- Baseline model – usually, this means we assume there are no predictors (i.e., independent variables). Thus, we have to make an educated guess (not a random one), based on the value of the dependent value alone. This model is important because it will allow us to determine how good, or how bad, are the other ones. For example, imagine a fancy model with 97% of accuracy – is it necessarily good and worth implementing? No, it depends; if the baseline accuracy is 60%, it’s probably a good model, but if the baseline is 96.7% it doesn’t seem to add much to what we already know, and therefore its implementation will depend on how much we value this 0.3% edge. It turns out that, in real life, there are many instances where the models, no matter how simple or complex, barely beat the baseline. A simple example: try to predict whether some index of the stock market is going up or down tomorrow, based on the movements of the last N days; you may even add other variables, representing the volatility index, commodities, and so on. Even if you build a neural network with lots of neurons, I’m not expecting you to do much better than simply consider that the direction of tomorrow’s movement will be the same as today’s (in fact, the accuracy of your model can even be worse, due to overfitting!). Predicting stock market movements is a really tough problem;
- A model from inferential statistics – this will be a (generalised) linear model. In the case of a continuous outcome (Part 4a), we will fit a multiple linear regression; for the binary outcome (Part 4b), the model will be a multiple logistic regression;
- Two models from machine learning – we will first build a decision tree (regression tree for the continuous outcome, and classification tree for the binary case); these models usually offer high interpretability and decent accuracy; then, we will build random forests, a very popular method, where there is often a gain in accuracy, at the expense of interpretability.
Let’s now start working on the models for the continuous outcome (i.e., the amount of rain).
Train and test sets
Here is one of the golden rules of machine learning and modelling in general: models are built using training data, and evaluated on testing data. The reason is overfitting: most models’ accuracy can be artificially increased to a point where they “learn” every single detail of the data used to build them; unfortunately, it usually means they lose the capability to generalise. That’s why we need unseen data (i.e., the testing set): if we overfit the training data, the performance on the testing data will be poor. In real life, simple models often beat complex ones, because they can generalise much better. We will do a random 70:30 split in our data set (70% will be for training models, 30% to evaluate them). For reproducibility, we will need to set the seed of the random number generator (it means every time I run the code, I’ll get the same train and test sets. In case you’re following along with the tutorial, you’ll get the same sets, too). Here’s the code:
> # For reproducibility; 123 has no particular meaning > set.seed(123) > # randomly pick 70% of the number of observations (365) > index <- sample(1:nrow(weather),size = 0.7*nrow(weather)) > # subset weather to include only the elements in the index > train <- weather[index,] > # subset weather to include all but the elements in the index > test <- weather [-index,] > nrow(train)  255 > nrow(test)  110
Here is a plot showing which points belong to which set (train or test).
library(ggplot2) # Create a dataframe with train and test indicator... group <- rep(NA,365) group <- ifelse(seq(1,365) %in% index,"Train","Test") df <- data.frame(date=weather$date,rain=weather$rain,group) # ...and plot it ggplot(df,aes(x = date,y = rain, color = group)) + geom_point() + scale_color_discrete(name="") + theme(legend.position="top")
> # Baseline model - predict the mean of the training data > best.guess <- mean(train$rain) > # Evaluate RMSE and MAE on the testing data > RMSE.baseline <- sqrt(mean((best.guess-test$rain)^2)) > RMSE.baseline  13.38996 > MAE.baseline <- mean(abs(best.guess-test$rain)) > MAE.baseline  8.219123
Multiple linear regression
We will now fit a (multiple) linear regression, which is probably the best known statistical model. In simple terms, the dependent variable is assumed to be a linear function of several independent variables (predictors), where each of them has a weight (regression coefficient) that is expected to be statistically significant in the final model. The results are usually highly interpretable and, provided some conditions are met, have good accuracy. Before showing the results, here are some important notes:
- Linear models do not require variables to have a Gaussian distribution (only the errors / residuals must be normally distributed); they do require, however, a linear relation between the dependent and independent variables. As we saw in Part 3b, the distribution of the amount of rain is right-skewed, and the relation with some other variables is highly non-linear. For this reason of linearity, and also to help fixing the problem with residuals having non-constant variance across the range of predictions (called heteroscedasticity), we will do the usual log transformation to the dependent variable. Since we have zeros (days without rain), we can't do a simple ln(x) transformation, but we can do ln(x+1), where x is the rain amount. Why do we choose to apply a logarithmic function? Simply because the regression coefficients can still be interpreted, although in a different way when compared with a pure linear regression. This model we will fit is often called log-linear;
- What I'm showing below is the final model. I started with all the variables as potential predictors and then eliminated from the model, one by one, those that were not statistically significant (p < 0.05). We need to do it one by one because of multicollinearity (i.e., correlation between independent variables). Some of the variables in our data are highly correlated (for instance, the minimum, average, and maximum temperature on a given day), which means that sometimes when we eliminate a non-significant variable from the model, another one that was previously non-significant becomes statistically significant. This iterative process of backward elimination stops when all the variables in the model are significant (in the case of factors, here we consider that at least one level must be significant);
- Our dependent variable has lots of zeros and can only take positive values; if you're an expert statistician, perhaps you would like to fit very specific models that can deal better with count data, such as negative binomial, zero-inflated and hurdle models. There are several packages to do it in R. For simplicity, we'll stay with the linear regression model in this tutorial.
> # Create a multiple (log)linear regression model using the training data > lin.reg <- lm(log(rain+1) ~ season + h.temp + ave.temp + ave.wind + gust.wind + dir.wind + as.numeric(gust.wind.hour), data = train) > # Inspect the model > summary(lin.reg) Call: lm(formula = log(rain + 1) ~ season + h.temp + ave.temp + ave.wind + gust.wind + dir.wind + as.numeric(gust.wind.hour), data = train) Residuals: Min 1Q Median 3Q Max -1.93482 -0.48296 -0.05619 0.39849 2.21069 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.028020 0.445699 0.063 0.949926 seasonSummer 0.370488 0.151962 2.438 0.015523 * seasonAutumn 0.696291 0.153209 4.545 8.87e-06 *** seasonWinter 0.297473 0.159602 1.864 0.063612 . h.temp -0.197772 0.045925 -4.306 2.45e-05 *** ave.temp 0.165975 0.054907 3.023 0.002786 ** ave.wind -0.104736 0.041431 -2.528 0.012140 * gust.wind 0.062181 0.008612 7.220 7.44e-12 *** dir.windENE 0.251267 0.340192 0.739 0.460898 dir.windESE 0.303855 0.796890 0.381 0.703330 dir.windN 0.405041 0.333188 1.216 0.225357 dir.windNE 0.358754 0.339188 1.058 0.291303 dir.windNNE 1.803748 0.508091 3.550 0.000467 *** dir.windNNW 0.390606 0.309850 1.261 0.208715 dir.windNW 0.346885 0.290911 1.192 0.234324 dir.windS 1.060690 0.334293 3.173 0.001714 ** dir.windSE 1.210005 0.326271 3.709 0.000261 *** dir.windSSE 1.488857 0.331211 4.495 1.10e-05 *** dir.windSSW 1.292246 0.349473 3.698 0.000272 *** dir.windSW 0.852924 0.384445 2.219 0.027488 * dir.windW 0.680786 0.467912 1.455 0.147042 dir.windWNW 0.627473 0.342009 1.835 0.067841 . dir.windWSW 1.279782 0.799807 1.600 0.110940 as.numeric(gust.wind.hour) -0.026972 0.011341 -2.378 0.018204 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.741 on 231 degrees of freedom Multiple R-squared: 0.6637, Adjusted R-squared: 0.6302 F-statistic: 19.82 on 23 and 231 DF, p-value: < 2.2e-16
> # What is the multiplicative effect of the wind variable? > exp(lin.reg$coefficients["gust.wind"]) gust.wind 1.064155
> # Apply the model to the testing data (i.e., make predictions) ... > # (Don't forget to exponentiate the results to revert the log transformation) > test.pred.lin <- exp(predict(lin.reg,test))-1 > # ...and evaluate the accuracy > RMSE.lin.reg <- sqrt(mean((test.pred.lin-test$rain)^2)) > RMSE.lin.reg  10.26189 > MAE.lin.reg <- mean(abs(test.pred.lin-test$rain)) > MAE.lin.reg  4.713084
Here are the main conclusions about the model we have just built:
- The R-squared is 0.66, which means that 66% of the variance in our dependent variable can be explained by the set of predictors in the model; at the same time, the adjusted R-squared is not far from that number, meaning that the original R-squared has not been artificially increased by adding variables to the model. Note that the R-squared can only increase or stay the same by adding variables, whereas the adjusted R-squared can even decrease if the variable added doesn't help the model more than what is expected by chance;
- All the variables are statistically significant (p < 0.05), as expected from the way the model was built, and the most significant predictor is the wind gust (p = 7.44e-12). The advantage of doing a log transformation is that, if the regression coefficient is small (i.e. -0.1 to 0.1), a unit increase in the independent variable yields an increase of approximately coeff*100% in the dependent variable. To be clear, the coefficient of the wind gust is 0.062181. It means that a unit increase in the gust wind (i.e., increasing the wind by 1 km/h), increases the predicted amount of rain by approximately 6.22%. You can always exponentiate to get the exact value (as I did), and the result is 6.42%. By the same token, for each degree (ºC) the daily high temperature increases, the predicted rain increases by exp(-0.197772) = 0.82 (i.e., it decreases by 18%);
- Both the RMSE and MAE have decreased significantly when compared with the baseline model, which means that this linear model, despite all the linearity issues and the fact that it predicts negative values of rain in some days, is still much better, overall, than our best guess.
We will see later, when we compare the fitted vs actual values for all models, that this model has an interesting characteristic: it predicts reasonably well daily rain amounts between 0 and 25 mm, but the predicting capability degrades significantly in the 25 to 70mm range.
A decision tree (also known as regression tree for continuous outcome variables) is a simple and popular machine learning algorithm, with a few interesting advantages over linear models: they make no assumptions about the relation between the outcome and predictors (i.e., they allow for linear and non-linear relations); the interpretability of a decision tree could not be higher - at the end of the process, a set of rules, in natural language, relating the outcome to the explanatory variables, can be easily derived from the tree. Furthermore, a decision tree is the basis of a very powerful method that we will also use in this tutorial, called random forest.
How do we grow trees, then? In very simple terms, we start with a root node, which contains all the training data, and split it into two new nodes based on the most important variable (i.e., the variable that better separates the outcome into two groups). For each new node, a split is attempted (always based on the variable that better splits that particular node, which is not necessarily the same variable we started with), until any of the stop crtiteria is met (for instance, in the R implementation, one of the defaults is that the node must have at least 20 observations, otherwise a new split is not attempted). The disadvantage of a decision tree is that they tend to overfit; before a stop criterion is reached, the algorithm keeps splitting nodes over and over, learning all the details of the training data. Big trees are often built that perform very well in the training data, but can't generalise well and therefore do very poorly on the test set.
What to do, then? This is where the extremely important concept of cross-validation comes into play (just as understanding the difference between the training and testing data, it is also of paramount importance to understand what cross-validation is). When we grow a tree in R (or any other software, for that matter), it internally performs a 10-fold cross-validation. What this means is that 10 cycles are run, and in each of them 90% of the training data is used for building the model, and the remaining 10%, even though they're part of the training data, are only used to predict and evaluate the results. In another words, these 10% are momentarily a test set. It should be obvious that, after the 10 cycles, 10 different chunks of data are used to test the model, which means every single observation is used, at some point, not only to train but also to test the model. Since we are testing at the same time we're growing a tree, we have a error measurement, that we use to find the optimal number of splits. We then prune the original tree, and keep only the optimal number of splits. In sum, when growing a tree, this is what we do:
- Grow a full tree, usually with the default settings;
- Examine the cross-validation error (x-error), and find the optimal number of splits. This corresponds, in R, to a value of cp (complexity parameter);
- Prune the tree using the complexity parameter above.
What usually happens is that either the pruned tree performs better on the test set or it performs about the same as the full-grown tree. Even in the latter case, it is useful to prune the tree, because less splits means less decision rules and higher interpretability, for the same level of performance. Now, let's do all of this in practice.
> # Needed to grow a tree > library(rpart) > # To draw a pretty tree (fancyRpartPlot function) > library(rattle) > # rpart function applied to a numeric variable => regression tree > rt <- rpart(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind + gust.wind + dir.wind + dir.wind.8 + as.numeric(h.temp.hour)+ as.numeric(l.temp.hour)+ as.numeric(gust.wind.hour), data=train) > # Full-grown tree with 8 splits using 6 different variables > # (Not running the line below - do it to see the tree) > # fancyRpartPlot(rt) > # As always, predict and evaluate on the test set > test.pred.rtree <- predict(rt,test) > RMSE.rtree <- sqrt(mean((test.pred.rtree-test$rain)^2)) > RMSE.rtree  11.37786 > MAE.rtree <- mean(abs(test.pred.rtree-test$rain)) > MAE.rtree  6.140937
> # Check cross-validation results (xerror column) > # It corresponds to 2 splits and cp = 0.088147 > printcp(rt) Regression tree: rpart(formula = rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind + gust.wind + dir.wind + dir.wind.8 + as.numeric(h.temp.hour) + as.numeric(l.temp.hour) + as.numeric(gust.wind.hour), data = train) Variables actually used in tree construction:  as.numeric(h.temp.hour) ave.temp dir.wind  gust.wind h.temp month Root node error: 32619/255 = 127.92 n= 255 CP nsplit rel error xerror xstd 1 0.358496 0 1.00000 1.00815 0.22993 2 0.102497 1 0.64150 0.72131 0.19653 3 0.088147 2 0.53901 0.71921 0.19592 4 0.045689 3 0.45086 0.75017 0.19972 5 0.028308 4 0.40517 0.79740 0.20348 6 0.024935 5 0.37686 0.80947 0.19980 7 0.011454 6 0.35193 0.79036 0.19880 8 0.010600 7 0.34047 0.75514 0.17788 9 0.010000 8 0.32987 0.75514 0.17788 > # Get the optimal CP programmatically... > min.xerror <- rt$cptable[which.min(rt$cptable[,"xerror"]),"CP"] > min.xerror  0.08814737 > # ...and use it to prune the tree > rt.pruned <- prune(rt,cp = min.xerror) > # Plot the pruned tree > fancyRpartPlot(rt.pruned)
> # Evaluate the new pruned tree on the test set > test.pred.rtree.p <- predict(rt.pruned,test) > RMSE.rtree.pruned <- sqrt(mean((test.pred.rtree.p-test$rain)^2)) > RMSE.rtree.pruned  11.51547 > MAE.rtree.pruned <- mean(abs(test.pred.rtree.p-test$rain)) > MAE.rtree.pruned  6.138062
- If the daily maximum wind speed exceeds 52 km/h (4% of the days), predict a very wet day (37 mm);
- If the daily maximum wind is between 36 and 52 km/h (23% of the days), predict a wet day (10mm);
- If the daily maximum wind stays below 36 km/h (73% of the days), predict a dry day (1.8 mm);
The accuracy of this extremely simple model is only a bit worse than the much more complicated linear regression. Next, instead of growing only one tree, we will grow the whole forest, a method that is very powerful and, more often than not, yields in very good results.
What if, instead of growing a single tree, we grow many (ranging from a few hundred to a few thousand), and introduce some sources of randomness, so that each tree is most likely different from the others? What we get is a random forest (Leo Breiman, 2001), a popular algorithm that pretty much every data scientist in the world knows. In fact, when it comes to problems where the focus is not so much in understanding the data, but in make predictions, random forests are often used immediately after preparing the data set, skipping entirely the EDA stage. This is basically how the algorithm works:
- To grow each tree, a random sample of the training data, with the size of the training data itself, is taken, with replacement. This means that some observations might appear several times in the sample, and others are left out (the proportion of discarded observations converges to 36.8%, for very large sample sizes). In another words, each tree will be grown with its own version of the training data;
- The variables that are used to grow each tree are also randomly sampled; the sample size is 1/3 and the square root of the number of independent variables, for regression and classification problems, respectively;
- Each tree is then fully grown, without any pruning, using its own training set and variable set;
- In classification problems, each tree casts a vote for the class it predicts and the final prediction is the class that has the majority of votes. In regression problems, the predicted value is a weighted average of the value predicted by each tree.
- They do not overfit. Even though each component of the forest (i.e. each tree) is probably overfitting a lot, since they are all different, the mistakes they make are in all directions; when the errors are averaged, they kind of cancel each other. Although each classifier is weak (recall the training set and variables are randomly sampled), when put together they become a strong classifier (this is the concept of ensemble learning);
- There is no need for cross-validation. Remember those 36 to 37% of observations that are left out when sampling from the training set? Well, these data (called out-of-bag) are used not only to estimate the error, but also to measure the importance of each variable. All of this is happening at the same time the model is being built;
- We can grow as many tree as we want (the limit is the computational power). What usually happens, however, is that the estimated error can no longer be improved after a certain number of trees is grown. Typical number for error convergence is between 100 and 2000 trees, depending on the complexity of the problem.
- Although we usually improve accuracy, it comes at a cost: interpretability. A random forest is akin to a black box model, where it is hard to understand what's going on inside; anyway, we still have an estimate for variable importance, which is more than some other models can offer.
> # Needed to run the algorithm > library(randomForest) > # Convert some factor variables to numeric (train and test sets) > train$h.temp.hour <- as.numeric(train$h.temp.hour) > train$l.temp.hour <- as.numeric(train$l.temp.hour) > train$gust.wind.hour <- as.numeric(train$gust.wind.hour) > test$h.temp.hour <- as.numeric(test$h.temp.hour) > test$l.temp.hour <- as.numeric(test$l.temp.hour) > test$gust.wind.hour <- as.numeric(test$gust.wind.hour) > # For reproducibility; 123 has no particular meaning > # Run this immediately before creating the random forest > set.seed(123) > # Create a random forest with 1000 trees > rf <- randomForest(rain ~ month + season + l.temp + h.temp + ave.temp + ave.wind + gust.wind + dir.wind + dir.wind.8 + h.temp.hour + l.temp.hour + gust.wind.hour, data = train, importance = TRUE, ntree=1000) > # How many trees are needed to reach the minimum error estimate? > # This is a simple problem; it appears that about 100 trees would be enough. > which.min(rf$mse)  93 > # Plot rf to see the estimated error as a function of the number of trees > # (not running it here) > # plot(rf) > # Using the importance() function to calculate the importance of each variable > imp <- as.data.frame(sort(importance(rf)[,1],decreasing = TRUE),optional = T) > names(imp) <- "% Inc MSE" > imp % Inc MSE gust.wind 21.241930 dir.wind 12.185181 ave.wind 11.864757 h.temp 9.499869 month 9.039023 ave.temp 8.420760 dir.wind.8 7.635521 l.temp 5.789758 season 5.262452 gust.wind.hour 4.725368 h.temp.hour 3.961188 l.temp.hour 1.979231
> # As usual, predict and evaluate on the test set > test.pred.forest <- predict(rf,test) > RMSE.forest <- sqrt(mean((test.pred.forest-test$rain)^2)) > RMSE.forest  10.20872 > MAE.forest <- mean(abs(test.pred.forest-test$rain)) > MAE.forest  5.065036
We can see the accuracy improved when compared to the decision tree model, and is just about equal to the performance of the linear regression model. The gust wind speed was, once again, considered the most important predictor; it is estimated that, in the absence of that variable, the error would increase by 21.2%.
Putting it all together
We have just built and evaluated the accuracy of five different models: baseline, linear regression, fully-grown decision tree, pruned decision tree, and random forest. Let's create a data frame with the RMSE and MAE for each of these methods.
> # Create a data frame with the error metrics for each method > accuracy <- data.frame(Method = c("Baseline","Linear Regression","Full tree","Pruned tree","Random forest"), RMSE = c(RMSE.baseline,RMSE.lin.reg,RMSE.rtree,RMSE.rtree.pruned,RMSE.forest), MAE = c(MAE.baseline,MAE.lin.reg,MAE.rtree,MAE.rtree.pruned,MAE.forest)) > # Round the values and print the table > accuracy$RMSE <- round(accuracy$RMSE,2) > accuracy$MAE <- round(accuracy$MAE,2) > accuracy Method RMSE MAE 1 Baseline 13.39 8.22 2 Linear Regression 10.26 4.71 3 Full tree 11.38 6.14 4 Pruned tree 11.52 6.14 5 Random forest 10.21 5.07
# Create a data frame with the predictions for each method all.predictions <- data.frame(actual = test$rain, baseline = best.guess, linear.regression = test.pred.lin, full.tree = test.pred.rtree, pruned.tree = test.pred.rtree.p, random.forest = test.pred.forest) > # First six observations > head(all.predictions) actual baseline linear.regression full.tree pruned.tree random.forest 2 64.8 5.315294 7.2638856 13.8058824 13.8058824 11.424860 4 20.1 5.315294 20.3352957 37.2090909 37.2090909 24.821385 5 9.4 5.315294 13.4020265 13.8058824 13.8058824 21.777313 6 38.9 5.315294 30.8604212 37.2090909 37.2090909 28.978410 7 2.0 5.315294 8.0923795 13.8058824 13.8058824 11.085088 10 0.0 5.315294 0.7544513 0.8281046 0.8281046 1.999272 > # Needed to melt the columns with the gather() function > # tidyr is an alternative to the reshape2 package (see the end of Part3a) > library(tidyr) > # Gather the prediction variables (columns) into a single row (i.e., wide to long) > # Recall the ggplot2 prefers the long data format > all.predictions <- gather(all.predictions,key = model,value = predictions,2:6) > head(all.predictions) actual model predictions 1 64.8 baseline 5.315294 2 20.1 baseline 5.315294 3 9.4 baseline 5.315294 4 38.9 baseline 5.315294 5 2.0 baseline 5.315294 6 0.0 baseline 5.315294 > tail (all.predictions) actual model predictions 545 2.3 random.forest 12.788688 546 0.0 random.forest 0.370295 547 4.3 random.forest 2.159183 548 0.3 random.forest 6.752708 549 0.0 random.forest 3.310518 550 0.0 random.forest 2.126990 # Predicted vs. actual for each model ggplot(data = all.predictions,aes(x = actual, y = predictions)) + geom_point(colour = "blue") + geom_abline(intercept = 0, slope = 1, colour = "red") + geom_vline(xintercept = 23, colour = "green", linetype = "dashed") + facet_wrap(~ model,ncol = 2) + coord_cartesian(xlim = c(0,70),ylim = c(0,70)) + ggtitle("Predicted vs. Actual, by model")
The graph shows that none of the models can predict accurately values over 25 mm of daily rain. In the range from 0 to approximately 22 mm (vertical dashed line for reference), the random forest seems to be the method that better approximates the diagonal line (i.e., smaller prediction error), followed closely by the linear regression. Even though these models can not be considered more than fair, they still do a much better job when compared with the baseline prediction.
In many cases, we build models not only to predict, but also to gain insight from the data. When, in Part 3b, we performed EDA, we saw that the maximum wind speed seemed to be an important predictor. Well, the models confirmed it three times: this variable was the most statistically significant in the linear model, was the only one used to predict in the regression tree, and the one that reduced the estimated error the most in the random forest.
In Part 4b, we will continue building models, this time considering the rain as a binary outcome. This means all models will assign probabilities to the occurrence of rain, for each day in the test set.