# Bayesian Optimization of Machine Learning Models

June 7, 2016
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Max Kuhn: Director, Nonclinical Statistics, Pfizer

Many predictive and machine learning models have structural or tuning parameters that cannot be directly estimated from the data. For example, when using K-nearest neighbor model, there is no analytical estimator for K (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for K and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used. I’ll demonstrate how Bayesian optimization and Gaussian process models can be used as an alternative.

To demonstrate, I’ll use the regression simulation system of Sapp et al. (2014) where the predictors (i.e. x’s) are independent Gaussian random variables with mean zero and a variance of 9. The prediction equation is:

x_1 + sin(x_2) + log(abs(x_3)) + x_4^2 + x_5*x_6 +
I(x_7*x_8*x_9 < 0) + I(x_10 > 0) + x_11*I(x_11 > 0) +
sqrt(abs(x_12)) + cos(x_13) + 2*x_14 + abs(x_15) +
I(x_16 < -1) + x_17*I(x_17 < -1) - 2 * x_18 - x_19*x_20


The random error here is also Gaussian with mean zero and a variance of 9. This simulation is available in the caret package via a function called SLC14_1. First, we’ll simulate a training set of 250 data points and also a larger set that we will use to elucidate the true parameter surface:

> library(caret)
> set.seed(7210)
> train_dat <- SLC14_1(250)
> large_dat <- SLC14_1(10000)

We will use a radial basis function support vector machine to model these data. For a fixed epsilon, the model will be tuned over the cost value and the radial basis kernel parameter, commonly denotes as sigma. Since we are simulating the data, we can figure out a good approximation to the relationship between these parameters and the root mean squared error (RMSE) or the model. Given our specific training set and the larger simulated sample, here is the RMSE surface for a wide range of values: There is a wide range of parameter values that are associated with very low RMSE values in the northwest.

A simple way to get an initial assessment is to use random search where a set of random tuning parameter values are generated across a “wide range”. For a RBF SVM, caret’s train function defines wide as cost values between 2^c(-5, 10) and sigma values inside the range produced by the sigest function in the kernlab package. This code will do 20 random sub-models in this range:

> rand_ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+                           search = "random")
>
> set.seed(308)
> rand_search <- train(y ~ ., data = train_dat,
+                      ## Create 20 random parameter values
+                      tuneLength = 20,
+                      metric = "RMSE",
+                      preProc = c("center", "scale"),
+                      trControl = rand_ctrl)
> rand_search
Support Vector Machines with Radial Basis Function Kernel

250 samples
20 predictor

Pre-processing: centered (20), scaled (20)
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 226, 224, 224, 225, 226, 224, ...
Resampling results across tuning parameters:

sigma       C             RMSE      Rsquared
0.01161955   42.75789360  10.50838  0.7299837
0.01357777   67.97672171  10.71276  0.7212605
0.01392676  828.08072944  10.75235  0.7195869
0.01394119    0.18386619  18.56921  0.2109284
0.01538656    0.05224914  19.33310  0.1890599
0.01711920  228.59215128  11.09522  0.7047713
0.01790202    0.78835920  16.78597  0.3217203
0.01936110    0.91401289  16.45485  0.3492278
0.02023763    0.07658831  19.03987  0.2081059
0.02690269    0.04128731  19.33974  0.2126950
0.02780880    0.64865483  16.52497  0.3545042
0.02920113  974.08943821  12.22906  0.6508754
0.02963586    1.19350198  15.46690  0.4407725
0.03370625   31.45179445  12.60653  0.6314384
0.03561750    0.04970422  19.23564  0.2306298
0.03752561    0.06592800  19.07130  0.2375616
0.03783570  398.44599747  12.92958  0.6143790
0.04534046    3.91017571  13.56612  0.5798001
0.05171719  296.65916049  13.88865  0.5622445
0.06482201   47.31716568  14.66904  0.5192667

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.01161955 and C
= 42.75789. 
> ggplot(rand_search) + scale_x_log10() + scale_y_log10()  > getTrainPerf(rand_search)
  TrainRMSE TrainRsquared    method
1  10.50838     0.7299837 svmRadial

There are other approaches that we can take, including a more comprehensive grid search or using a nonlinear optimizer to find better values of cost and sigma. Another approach is to use Bayesian optimization to find good values for these parameters. This is an optimization scheme that uses Bayesian models based on Gaussian processes to predict good tuning parameters.

Gaussian Process (GP) regression is used to facilitate the Bayesian analysis. If creates a regression model to formalize the relationship between the outcome (RMSE, in this application) and the SVM tuning parameters. The standard assumption regarding normality of the residuals is used and, being a Bayesian model, the regression parameters also gain a prior distribution that is multivariate normal. The GP regression model uses a kernel basis expansion (much like the SVM model does) in order to allow the model to be nonlinear in the SVM tuning parameters. To do this, a radial basis function kernel is used for the covariance function of the multivariate normal prior and maximum likelihood is used to estimate the kernel parameters of the GP.

In the end, the GP regression model can take the current set of resampled RMSE values and make predictions over the entire space of potential cost and sigma parameters. The Bayesian machinery allows of this prediction to have a distribution; for a given set of tuning parameters, we can obtain the estimated mean RMSE values as well as an estimate of the corresponding prediction variance. For example, if we were to use our data from the random search to build a GP model, the predicted mean RMSE would look like: The darker regions indicate smaller RMSE values given the current resampling results. The predicted standard deviation of the RMSE is: The prediction noise becomes larger (e.g. darker) as we move away from the current set of observed values.

(The GPfit package was used to create these models.)

To find good parameters to test, there are several approaches. This paper (pdf) outlines several but we will use the confidence bound approach. For any combination of cost and sigma, we can compute the lower confidence bound of the predicted RMSE. Since this takes the uncertainty of prediction into account it has the potential to produce better directions to take the optimization. Here is a plot of the confidence bound using a single standard deviation of the predicted mean: Darker values indicate better conditions to explore. Since we know the true RMSE surface, we can see that the best region (the northwest) is estimated to be an interesting location to take the optimization. The optimizer would pick a good location based on this model and evaluate this as the next parameter value. This most recent configuration is added to the GP’s training set and the process continues for a pre-specified number of iterations.

Yachen Yan created an R package for Bayesian optimization. He also made a modification so that we can use our initial random search as the substrate to the first GP used. To search a much wider parameter space, our code looks like:

> ## Define the resampling method
> ctrl <- trainControl(method = "repeatedcv", repeats = 5)
>
> ## Use this function to optimize the model. The two parameters are
> ## evaluated on the log scale given their range and scope.
> svm_fit_bayes <- function(logC, logSigma) {
+   ## Use the same model code but for a single (C, sigma) pair.
+   txt <- capture.output(
+     mod <- train(y ~ ., data = train_dat,
+                  preProc = c("center", "scale"),
+                  metric = "RMSE",
+                  trControl = ctrl,
+                  tuneGrid = data.frame(C = exp(logC), sigma = exp(logSigma)))
+   )
+   ## The function wants to _maximize_ the outcome so we return
+   ## the negative of the resampled RMSE value. Pred can be used
+   ## to return predicted values but we'll avoid that and use zero
+   list(Score = -getTrainPerf(mod)[, "TrainRMSE"], Pred = 0)
+ }
>
> ## Define the bounds of the search.
> lower_bounds <- c(logC = -5, logSigma = -9)
> upper_bounds <- c(logC = 20, logSigma = -0.75)
> bounds <- list(logC = c(lower_bounds, upper_bounds),
+                logSigma = c(lower_bounds, upper_bounds))
>
> ## Create a grid of values as the input into the BO code
> initial_grid <- rand_search$results[, c("C", "sigma", "RMSE")] > initial_grid$C <- log(initial_grid$C) > initial_grid$sigma <- log(initial_grid$sigma) > initial_grid$RMSE <- -initial_grid$RMSE > names(initial_grid) <- c("logC", "logSigma", "Value") > > ## Run the optimization with the initial grid and do > ## 30 iterations. We will choose new parameter values > ## using the upper confidence bound using 1 std. dev. > > library(rBayesianOptimization) > > set.seed(8606) > ba_search <- BayesianOptimization(svm_fit_bayes, + bounds = bounds, + init_grid_dt = initial_grid, + init_points = 0, + n_iter = 30, + acq = "ucb", + kappa = 1, + eps = 0.0, + verbose = TRUE) 20 points in hyperparameter space were pre-sampled elapsed = 1.53 Round = 21 logC = 5.4014 logSigma = -5.8974 Value = -10.8148 elapsed = 1.54 Round = 22 logC = 4.9757 logSigma = -5.0449 Value = -9.7936 elapsed = 1.42 Round = 23 logC = 5.7551 logSigma = -5.0244 Value = -9.8128 elapsed = 1.30 Round = 24 logC = 5.2754 logSigma = -4.9678 Value = -9.7530 elapsed = 1.39 Round = 25 logC = 5.3009 logSigma = -5.0921 Value = -9.5516 elapsed = 1.48 Round = 26 logC = 5.3240 logSigma = -5.2313 Value = -9.6571 elapsed = 1.39 Round = 27 logC = 5.3750 logSigma = -5.1152 Value = -9.6619 elapsed = 1.44 Round = 28 logC = 5.2356 logSigma = -5.0969 Value = -9.4167 elapsed = 1.38 Round = 29 logC = 11.8347 logSigma = -5.1074 Value = -9.6351 elapsed = 1.42 Round = 30 logC = 15.7494 logSigma = -5.1232 Value = -9.4243 elapsed = 25.24 Round = 31 logC = 14.6657 logSigma = -7.9164 Value = -8.8410 elapsed = 32.60 Round = 32 logC = 18.3793 logSigma = -8.1083 Value = -8.7139 elapsed = 1.86 Round = 33 logC = 20.0000 logSigma = -5.6297 Value = -9.0580 elapsed = 0.97 Round = 34 logC = 20.0000 logSigma = -1.5768 Value = -19.2183 elapsed = 5.92 Round = 35 logC = 17.3827 logSigma = -6.6880 Value = -9.0224 elapsed = 18.01 Round = 36 logC = 20.0000 logSigma = -7.6071 Value = -8.5728 elapsed = 114.49 Round = 37 logC = 16.0079 logSigma = -9.0000 Value = -8.7058 elapsed = 89.31 Round = 38 logC = 12.8319 logSigma = -9.0000 Value = -8.6799 elapsed = 99.29 Round = 39 logC = 20.0000 logSigma = -9.0000 Value = -8.5596 elapsed = 106.88 Round = 40 logC = 14.1190 logSigma = -9.0000 Value = -8.5150 elapsed = 4.84 Round = 41 logC = 13.4694 logSigma = -6.5271 Value = -8.9728 elapsed = 108.37 Round = 42 logC = 19.0216 logSigma = -9.0000 Value = -8.7461 elapsed = 52.43 Round = 43 logC = 13.5273 logSigma = -8.5130 Value = -8.8728 elapsed = 39.69 Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956 elapsed = 5.99 Round = 45 logC = 20.0000 logSigma = -6.7208 Value = -8.9455 elapsed = 113.01 Round = 46 logC = 14.9611 logSigma = -9.0000 Value = -8.7576 elapsed = 27.45 Round = 47 logC = 19.6181 logSigma = -7.9872 Value = -8.6186 elapsed = 116.00 Round = 48 logC = 17.3060 logSigma = -9.0000 Value = -8.6820 elapsed = 2.26 Round = 49 logC = 14.2698 logSigma = -5.8297 Value = -9.1837 elapsed = 64.50 Round = 50 logC = 20.0000 logSigma = -8.6438 Value = -8.6914 Best Parameters Found: Round = 44 logC = 20.0000 logSigma = -8.3288 Value = -8.4956  Animate the search! The final settings were found at iteration 44 with a cost setting of 485,165,195 and sigma=0.0002043. I would have never thought to evaluate a cost parameter so large and the algorithm wants to make it even larger. Does it really work? We can fit a model based on the new configuration and compare it to random search in terms of the resampled RMSE and the RMSE on the test set: > set.seed(308) > final_search <- train(y ~ ., data = train_dat, + method = "svmRadial", + tuneGrid = data.frame(C = exp(ba_search$Best_Par["logC"]),
+                                             sigma = exp(ba_search$Best_Par["logSigma"])), + metric = "RMSE", + preProc = c("center", "scale"), + trControl = ctrl) > compare_models(final_search, rand_search)  One Sample t-test data: x t = -9.0833, df = 49, p-value = 4.431e-12 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -2.34640 -1.49626 sample estimates: mean of x -1.92133  > postResample(predict(rand_search, large_dat), large_dat$y)
      RMSE   Rsquared
10.1112280  0.7648765 
> postResample(predict(final_search, large_dat), large_dat\$y)
     RMSE  Rsquared
8.2668843 0.8343405 

Much Better!

Thanks to Yachen Yan for making the rBayesianOptimization package.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.