Model calibration in the context of this post, is about finding optimal hyperparameters for Statistical/Machine learning (ML) models. Optimal in the sense that they minimize a given criterion such as model’s accuracy on unseen data, model’s precision, Root Mean Squared Error (RMSE), etc. What are ML models’ hyperparameters? Let’s take the example of a linear model:
y = beta_0 + beta_1 x_1 + beta_2 x_2
y is a car’s fuel consumption in Miles/(US) gallon.
x_1 is its horsepower, and
x_2 its number of cylinders. Knowing the values of
x_2, we would like to estimate the average value of
y for many different cars.
beta_2 are unknown model parameters, typically estimated by minimizing the distance between the observed car’s consumption
y, and the model
beta_0 + beta_1 x_1 + beta_2 x_2. With such a model, we can obtain for example that:
estimated fuel cons. = 0.1 + 0.4 x horsepower + 0.7 x no. of cylinders
Sometimes, when designing our linear model, we will want the unknown coefficients
beta_2 to be bounded (
beta_2 could otherwise exhibit a high variance). Or, we could want to consider a different polynomial degree
beta_2 are model parameters, the polynomial degree
d on explanatory variables and the bound
s put on parameters
beta_2 are model hyperparameters.
Hyperparameters are those parameters that you can tune, in order to increase/decrease the model’s performance.
d is a degree of freedom. It controls model’s flexibility. The higher
d, the more flexible our model – meaning than it could almost fit “anything”.
s is a regularization parameter that stabilizes model estimates. Increasing
d might lead to overfitting, and a lower
d, to underfitting. Overfitting or underfitting are about: too much flexibility or not enough. We’ll use the
mtcars dataset to illustrate these concepts. This dataset is available from
R console, as:
According to its description,
mtcars is extracted from 1974 Motor Trend US magazine, and comprises fuel consumption (in Miles/(US) gallon) and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We’ll use 5 explanatory variables among 10 here.
mpg: Miles/(US) gallon # this is y cyl: Number of cylinders # this is x_1 disp: Displacement (cu.in.) # this is x_2 hp: Gross horsepower # this is x_3 wt: Weight (1000 lbs) # this is x_4 carb: Number of carburetors # this is x_5
Below, are the correlations between the target variable (to be explained),
mpg, and explanatory variables
x_5. We use R package
corrplot to plot these correlations.
All the explanatory variables are negatively correlated to the fuel consumption (in Miles/(US) gallon). A marginal increase in any of them leads, on average, to a decrease in fuel consumption. Now, in order to illustrate the concepts of overfitting and underfitting, we fit a linear model and a smoothing spline to
mpg (consumption) and
On the left: model fitting on 23 cars, for a linear model and a spline. The linear model fits all the points parsimoniously, but the spline tries to memorize the patterns. On the right: errors obtained by each model on the 9 remaining cars, as a function of the spline’s degrees of freedom. That’s overfitting, illustrated. In other situations, a linear model can also fit (very) poorly, because it’s not flexible enough.
So, how do we find a good compromise between overfitting or underfitting? One way to achieve it is to use a hold-out sample, as we did on the previous example – with 23 cars out of 32 in the training procedure, and 9 for testing. Another way is to use cross-validation. The idea of cross-validation is to divide the whole dataset into k parts (usually called folds); each part being successively included into a training set or a testing set.
On this graph, we have k=5.
crossval is a – work in progress –
R package, for doing just that. WHY did I implement it? Because
R models are contributed by many different people. So, you’re not using a unified interface when training them. For example, in order to obtain predictions for 2 different models, you can have 2 different specifications of function
fitting_obj_model* are the trained models
newdata are the unseen data on which we would like to test the trained model. The position of arguments in function calls do also matter a lot. Idea: use a common cross-validation interface for many different models. Hence,
crossval. There is still room for improvement. If you find cases that are not covered by
crossval, you can contribute them here. Currently, the package can be installed from Github as (in R console):
Here is an example of use of
crossval applied to glmnet (with my old school
R syntax yeah, I like it!):
require(glmnet) require(Matrix) # load the dataset data("mtcars") df <- mtcars[, c(1, 2, 3, 4, 6, 11)] summary(df) # create response and explanatory variables X <- as.matrix(df[, -1]) y <- df$mpg # grid of model hyperparameters tuning_grid <- expand.grid(alpha = c(0, 0.5, 1), lambda = c(0.01, 0.1, 1)) n_params <- nrow(tuning_grid) # list of cross-validation results # - 5-fold cross-validation (`k`) # - repeated 3 times (`repeats`) # - cross-validation on 80% of the data (`p`) # - validation on the remaining 20% cv_results <- lapply(1:n_params, function(i) crossval::crossval( x = X, y = y, k = 5, repeats = 3, p = 0.8, fit_func = glmnet::glmnet, predict_func = predict.glmnet, packages = c("glmnet", "Matrix"), fit_params = list(alpha = tuning_grid[i, "alpha"], lambda = tuning_grid[i, "lambda"]) )) names(cv_results) <- paste0("params_set", 1:n_params) print(cv_results)
Note: I am currently looking for a gig. You can hire me on Malt or send me an email: thierry dot moudiki at pm dot me. I can do descriptive statistics, data preparation, feature engineering, model calibration, training and validation, and model outputs’ interpretation. I am fluent in Python, R, SQL, Microsoft Excel, Visual Basic (among others) and French. My résumé? Here!