Bayesian hyperparameters optimization
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Introduction
Machine learning models are called by this name because of their ability to learn the best parameter values that are as closest as possible to the right values of the optimum objective function (or loss function). However, since all models require some assumptions (like linearity in linear regression models), parameters (like the cost C in svm models), and settings (like the number of layers in deep learning models) to be prespecified before training the model (in most cases are set by default ), the name of machine learning is not fully justified. Theses prespecified parameters are called hyperparameters, and should be defined in such a way that the corresponding model reaches its best performance (conditionally on the data at hand).
The search of the best hyperparameters called tuning, which simply is training the model with each combination of hyperparameter values many times, using some techniques like crossvalidation to make sure that the resulted loss functions highly likely depends on this specific combination values (not by random chances), and then pick the best one that gives the optimum value for the objective function. This means that if our model requires long computational time (due to fewer hardware resources, or large dataset) we cannot try a large number of combinations, and hence we will be likely far away from the best result.
The main problem, however, is how do we define the space of these hyperparameters to choose from. Many methods are available:

grid search: Using this method the modeler provides the values for each parameter to evaluate and then pick the combination that gives the best result. However, this method based entirely on the experience of the modeler with the model in question and the data at hand. So with not enough experience, which is often the case, the choice in most cases would be arbitrary.

Random search: with this method, we choose randomly the values for each hyperparameter, and it turns out that this method is sometimes more accurate than the previous one. However, this method also suffers from the arbitrary choice of values.

Bayesian hyperparameters: This method uses Bayesian optimization to guide a little bit the search strategy to get the best hyperparameter values with minimum cost (the cost is the number of models to train). We will briefly discuss this method, but if you want more detail you can check the following great article.
we will focus on the best one which is Bayesian hyperparameters, but we first start by briefly introducing the others.
To well understand these methods we will make use of small dataset with a small number of predictors, and we will use two models, the machine learning model Random forest and the deep learning model feedforward neural network.
Bayesian optimization
The main idea behind this method is very simple, at the first iteration we pick a point at random, then at each iteration, and based on Bayes rule, we make a tradeoff between choosing the point that has the highest uncertainty (known as active learning) or choosing the point within the region that has already the best result (optimum objective function) until the current iteration. In more detail, let’s say we are dealing with maximization problem such as maximizing the accuracy rate for a classification problem, then, at each iteration, the Bayes optimization method should decide between focusing the search on the region that contains the best point (that resulted in the maximum objective function) until the current iteration (called exploitation), or inspecting another region with the highest uncertainty (called exploration). The question, however, is how can this method decide between these two options. Well, this method uses what is called acquisition function, which is a function that helps to decide what region should we inspect in the next iteration.
Acquisition functions
Since this problem is always partially random, there exist many forms for this function, and we will discuss briefly the most common ones.
Upper confidence bound
Using this function, the following located point will be that that has the highest upper confidence bound, and, assuming the Gaussian process, this bound will be computed as follows:
\[UCB(x)=\mu(x)+\kappa\sigma(x)\]
Where \(\mu\) and \(\sigma\) are the mean and the standard deviation defined by the Gaussian process, and \(\kappa\) is an exploration parameter, the larger the values the more the exploration.
Probability of improvement PI
this acquisition function chooses the next point that has the highest probability of improvement over the current maximum objective function \(f_{max}\) obtained from the points evaluated previously.
Assuming the Gaussian process, the new point then will be the one that has the highest following probability:
\[PI(x)=\Phi\left(\frac{\mu(x)f{max}\varepsilon}{\sigma(x)}\right)\]
Where \(\varepsilon\) plays the role for trading off between exploration and exploitation, such that larger values result in more exploration than exploitation
Expected improvement EI
This acquisition function, unlike the previous one, tries to quantify how much improvement we get with the new point, the new point that has the maximum expected value will be chosen.
Again, by assuming the Gaussian process, this function can be computed as follows:
\[EI(x) = \left(\mu(x)f_{max}\right)\Phi\left(\frac{\mu(x)f{max}\varepsilon}{\sigma(x)}\right)+\sigma(x)\phi\left(\frac{\mu(x)f{max}\varepsilon}{\sigma(x)}\right)\]
Data preparation
Let’s first call the packages needed along this article.
library(readr) library(tidymodels) library(themis) library(plot3D)
For our illustration, we will use data downloaded from the link described in the below script. It is already split between a training set and a testing set. The target variable is a highly imbalanced binary variable, and the data has a very large number of missing values. To make our analysis simpler, however, we will reduce the size of this data by first, removing the variables with a large number of missing values, then removing completely the remaining ones, and lastly correcting the imbalance distribution using downsampling method . For more detail about this data check my previous article.
train < read_csv("https://archive.ics.uci.edu/ml/machinelearningdatabases/00421/aps_failure_training_set.csv", skip = 20) test < read_csv("https://archive.ics.uci.edu/ml/machinelearningdatabases/00421/aps_failure_test_set.csv", skip = 20)
This data has 171 variables and a total of 76000 instances, 60000 in the training set, and 16000 in the testing sets. Notice that the missing values in the data are represented by lower case na values and not NA and hence are not recognized by R as missing values, That is why the read_csv function converted all the variables that have these values to the character type.
map_chr(train, typeof) %>% tibble() %>% table() . character double 170 1
To fix this problem, therefore, we replace na by NA then converting back the corresponding variable to the numeric type.
train[1] < train[1] %>% modify(~replace(., .=="na", NA)) %>% modify(., as.double) test[1] < test[1] %>% modify(~replace(., .=="na", NA)) %>% modify(., as.double)
Then we keep only the predictors that have less than 600 missing values, and thereafter, we keep only rows without missing values.
names < modify(train[1], is.na) %>% colSums() %>% tibble(names = colnames(train[1]), missing_values=.) %>% filter(missing_values < 600) %>% select(1) train1 < train[c("class",names$names)] %>% .[complete.cases(.),] test1 < test[c("class",names$names)] %>% .[complete.cases(.),] dim(train1) [1] 58888 11 dim(test1) [1] 15728 11
As we see the data now has been reduced to 11 variables. The last thing to check is the distribution of the target variable.
prop.table(table(train1$class)) neg pos 0.98376579 0.01623421
As we see, the data is highly imbalanced, so we correct this problem by downsamplig using the themis package. But first, we create a recipe that defines the formula of our model, normalizes the predictors, and down sample the data. Then we execute the recipe to the data using prep function, and lastly we retrieve the transformed data by juice function.
train2 < recipe(class~., data=train1) %>% step_normalize(all_predictors()) %>% step_downsample(class, seed = 111) %>% prep() %>% juice()
For the testing set, however, we can transform it by the same recipe, but at the end we make use of the bake function instead of juice that uses the data defined in the recipe.
test2 < recipe(class~., data=train1) %>% step_normalize(all_predictors()) %>% themis::step_downsample(class, seed = 111) %>% prep() %>% bake(test1)
It should be noted here that the step_downsample is not needed for the testing set, that is why, the bake does not execute this step by default applied to any new data. We can check this by the dimension that still the same.
dim(test1); dim(test2) [1] 15728 11 [1] 15728 11
Note: The testing set is not needed, since our purpose is the hyperparameter tuning methods, which requires only the training set. But it is included only to show how it is processed with the work flow of tidymodels package.
Random forest model
Since we are dealing with a classification problem, our objective function will be the area under the ROC curve roc_area. And for the model, we will use the most popular one, Random forest model with the two hyperparameters to tune:
 mtry: The number of sampled predictors at each step.
 min_n: The minimum number of instances in the node to split further.
The true distribution of the hyperparameters
With this small data, this model, in the sense of computational time, is not expensive, so we will try a wide range of values for the above hyperparameters, then we can use this as the true distribution to compare the performance of the hyperparameter tuning methods discussed above.
Using tidymodels package we define the model and we leave the mtry
and min_n
for tuning. To speed up, however, computation process we restrict the number of trees to 100 instead of the default fo 500 trees.
model_tune < rand_forest(mtry = tune(), min_n = tune(), trees = 100L) %>% set_engine("ranger", seed=222) %>% set_mode("classification")
To avoid the pitfalls of random samples, we use the crossvalidation technique.
set.seed(1234) folds < vfold_cv(train2, v=5, strata = class)
Then we use the following work flow:
tune_wf < workflow() %>% add_model(model_tune) %>% add_formula(class~.)
Now we are ready to train the model using a wide range of combination values, and then we assume that these combinations are the true distribution of our model. Note that we have 100 combinations, so if the model takes hours to train, then it will require days for tuning. To prevent the model to rerun when rendering this document we save the results in csv file then we load it again. If you want, however, to run this model then uncomment the script
#tuned < tune_wf %>% # tune_grid(resamples = folds, # grid =expand.grid(mtry=1:10, min_n=1:10), # metrics=metric_set(roc_auc))
To extract the results we use the collect_metrics function.
#df < tuned %>% collect_metrics() #write_csv(df, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df.csv") df < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df.csv") df < df %>% arrange(mean) %>% tibble(rank=seq_len(nrow(df)), .) df # A tibble: 100 x 8 rank mtry min_n .metric .estimator mean n std_err1 1 1 2 roc_auc binary 0.984 5 0.00198 2 2 3 6 roc_auc binary 0.984 5 0.00204 3 3 3 7 roc_auc binary 0.983 5 0.00269 4 4 3 8 roc_auc binary 0.983 5 0.00304 5 5 2 10 roc_auc binary 0.983 5 0.00200 6 6 3 4 roc_auc binary 0.983 5 0.00246 7 7 4 4 roc_auc binary 0.983 5 0.00224 8 8 2 1 roc_auc binary 0.983 5 0.00170 9 9 3 5 roc_auc binary 0.983 5 0.00271 10 10 4 5 roc_auc binary 0.983 5 0.00246 # ... with 90 more rows
We reached the maximum value for the objective function 0.9837565 with the following hyperparameter values mtry = 1
, and min_n = 2
.
Note: we will ignore the overfitting problem, since we are only making comparison using the same training set.
We can plot this distribution as follows:
scatter3D(x=df$mtry, y=df$min_n, z=df$mean, phi = 0, bty = "g", type = "h", ticktype = "detailed", pch = 19, cex = 0.5, main="the true distribution", xlab="mtry", ylab="min_n", zlab="roc_auc")
random search
Now suppose that we are allowed to try only 10 combinations. For random search strategy, we use the grid_random function from dials package (embedded in tidymodels package).
#tuned_random < tune_wf %>% # tune_grid(resamples = folds, # grid = grid_random(mtry(range = c(1,10)), min_n(range = c(1,10)), #size = 10), # metrics=metric_set(roc_auc)) #df_r < tuned_random %>% collect_metrics() #write_csv(df_r, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_r.csv") df_r < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_r.csv") df_r %>% arrange(mean) %>% head(1) %>% inner_join(df, by="mean") %>% select(rank, mtry.x, min_n.x, mean) # A tibble: 1 x 4 rank mtry.x min_n.x mean1 13 3 10 0.983
The maximum value obtained by this method is 0.9826900, which corresponds to the 13th rank compared to the maximum value of the the true distribution, and the associated hyperparameters values are mtry=3
, min_n=10
bayesian optimization UCB
For the Bayesian optimization method, we make use of the tune_bayes function from tune package that provides all the acquisition functions discussed above. So we start by UCB function, and we set the argument kappa equals to 2, which controls the tradeoff between exploitation and exploration such that larger values lead to more exploration. Notice that, by default, this function explores 10 combinations, if you want another number change the argument iter.
#tuned_UCB < tune_wf %>% # tune_bayes(resamples = folds, # param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))), # metrics=metric_set(roc_auc), # objective=conf_bound(kappa = 2)) #df_UCB < tuned_UCB %>% collect_metrics() #write_csv(df_UCB,"C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_UCB.csv") df_UCB < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_UCB.csv") df_UCB %>% arrange(mean) %>% head(1) %>% inner_join(df, by="mean") %>% select(rank, mtry.x, min_n.x, mean) # A tibble: 1 x 4 rank mtry.x min_n.x mean1 2 3 6 0.984
With this acquisition function, we obtained better result than the random search 0.9837529 , which occupies the second position.
bayesian optimization PI
This time we will explore the acquisition functions PI discussed above. Note that larger values for the argument tradeoff lead to more exploration than exploitation.
#tuned_PI < tune_wf %>% # tune_bayes(resamples = folds, # param_info=parameters(mtry(range = c(1,10)), min_n(range = c(1,10))), # metrics=metric_set(roc_auc), # objective=prob_improve(trade_off = 0.01)) #df_PI < tuned_PI %>% collect_metrics() #write_csv(df_PI, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_PI.csv") df_PI < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_PI.csv") df_PI %>% arrange(mean) %>% head(1) %>% inner_join(df, by="mean") %>% select(rank, mtry.x, min_n.x, mean) # A tibble: 1 x 4 rank mtry.x min_n.x mean1 11 2 9 0.983
As we see, with this acquisition function, we obtained the 11th position which is more worst than the previous one, but still better than random search.
bayesian optimization EI
Now we try another acquisition function, the expected improvement function.
#tuned_EI < tune_wf %>% # tune_bayes(resamples = folds, # param_info=parameters(mtry(range = c(1,10)), # min_n(range = c(1,10))), # metrics=metric_set(roc_auc), # objective=exp_improve(trade_off = 0.01)) #df_EI < tuned_EI %>% collect_metrics() #write_csv(df_EI, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_EI.csv") df_EI < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_EI.csv") df_EI %>% arrange(mean) %>% head(1) %>% inner_join(df, by="mean") %>% select(rank, mtry.x, min_n.x, mean) # A tibble: 1 x 4 rank mtry.x min_n.x mean1 2 3 6 0.984
Here also we get the same result as UCB method.
Contrast the results
df_rf < tibble(names=c("true", "random", "UCB", "PI", "EI"), mtry=c(df[df$mean==max(df$mean),1, drop=TRUE], df_r[df_r$mean==max(df_r$mean),1, drop=TRUE], df_UCB[df_UCB$mean==max(df_UCB$mean),1, drop=TRUE], df_PI[df_PI$mean==max(df_PI$mean),1, drop=TRUE], df_EI[df_EI$mean==max(df_EI$mean),1, drop=TRUE]), min_n=c(df[df$mean==max(df$mean),2, drop=TRUE], df_r[df_r$mean==max(df_r$mean),2, drop=TRUE], df_UCB[df_UCB$mean==max(df_UCB$mean),2, drop=TRUE], df_PI[df_PI$mean==max(df_PI$mean),2, drop=TRUE], df_EI[df_EI$mean==max(df_EI$mean),2, drop=TRUE]), roc_auc=c(max(df$mean), max(df_r$mean),max(df_UCB$mean), max(df_PI$mean),max(df_EI$mean)), std_err=c(df[df$mean==max(df$mean),ncol(df), drop=TRUE], df_r[df_r$mean==max(df_r$mean),ncol(df_r), drop=TRUE], df_UCB[df_UCB$mean==max(df_UCB$mean),ncol(df_UCB), drop=TRUE], df_PI[df_PI$mean==max(df_PI$mean),ncol(df_PI), drop=TRUE], df_EI[df_EI$mean==max(df_EI$mean),ncol(df_EI), drop=TRUE])) df_rf %>% arrange(roc_auc) # A tibble: 5 x 5 names mtry min_n roc_auc std_err1 true 1 1 0.984 0.00198 2 UCB 3 6 0.984 0.00204 3 EI 3 6 0.984 0.00204 4 PI 2 9 0.983 0.00290 5 random 3 10 0.983 0.00275
As we see, the Bayesian optimization method performs better than random search whatever acquisition function used. However, since the acquisition functions have their own hyperparameters (to tradeoff between exploration and exploitation), their performance thus may differ strongly from each other. Moreover, the difference can be larger if we use a larger dataset.
deep learning model
As we did with the random forest model we explore a wide range of hyperparameter values assuming that these values represent the true distribution, then we apply all the above tuning methods. For the model architecture, we use a deep learning model with a single layer, and for tuning, we use the two hyperparameters, the number of nodes, and the number of epochs. The last thing we have to do is to save what we need in csv file and load it again since each time we rerun the model we get a different result, and besides, it takes a lot of time.
model_tune1 < mlp(hidden_units = tune(), epochs = tune()) %>% set_engine("keras") %>% set_mode("classification") tune_wf1 < workflow() %>% add_model(model_tune1) %>% add_formula(class~.) #tuned_deepl < tune_wf1 %>% # tune_grid(resamples = folds, # grid = expand.grid(hidden_units=5:15, epochs=10:20), # metrics=metric_set(roc_auc)) #df1 < tuned_deepl %>% collect_metrics() #write_csv(df1, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df1.csv") df1 < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df1.csv") df1 %>% arrange(mean) # A tibble: 121 x 7 hidden_units epochs .metric .estimator mean n std_err1 12 18 roc_auc binary 0.982 5 0.00255 2 11 20 roc_auc binary 0.982 5 0.00265 3 9 14 roc_auc binary 0.982 5 0.00274 4 8 16 roc_auc binary 0.982 5 0.00207 5 14 18 roc_auc binary 0.982 5 0.00241 6 14 20 roc_auc binary 0.982 5 0.00244 7 12 17 roc_auc binary 0.982 5 0.00265 8 10 16 roc_auc binary 0.982 5 0.00230 9 10 19 roc_auc binary 0.981 5 0.00231 10 12 20 roc_auc binary 0.981 5 0.00227 # ... with 111 more rows
With 12 units in the hidden layer and 18 epochs we reached the maximum value 0.9819806 for the area under the ROC curve.
scatter3D(x=df1$hidden_units, y=df1$epochs, z=df1$mean, phi = 0, bty = "g", type = "h", ticktype = "detailed", pch = 19, cex = 0.5, main="the true distribution", xlab="units", ylab="epochs", zlab="roc_auc")
Random search
#tuned_random1 < tune_wf1 %>% # tune_grid(resamples = folds, # grid = grid_random(hidden_units(range = c(5,15)), epochs(range = #c(10,20)), size = 10), # metrics=metric_set(roc_auc)) #df_r1 < tuned_random1 %>% collect_metrics() #write_csv(df_r1, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_#r1.csv") df_r1 < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_r1.csv") df_r1 %>% arrange(mean) # A tibble: 10 x 7 hidden_units epochs .metric .estimator mean n std_err1 14 11 roc_auc binary 0.982 5 0.00230 2 14 17 roc_auc binary 0.981 5 0.00240 3 8 15 roc_auc binary 0.981 5 0.00266 4 11 19 roc_auc binary 0.981 5 0.00268 5 14 16 roc_auc binary 0.981 5 0.00221 6 10 12 roc_auc binary 0.980 5 0.00259 7 7 13 roc_auc binary 0.980 5 0.00229 8 5 16 roc_auc binary 0.980 5 0.00241 9 9 10 roc_auc binary 0.980 5 0.00244 10 5 13 roc_auc binary 0.979 5 0.00270
Since deep learning models strongly depend on an internal random process (like when initializing the weights), and besides, the small difference between results, due to the small size of the data, it is harder to contrast between different methods. To alleviate this problem, however, we use this time the standard errors to highlight the significance differences between the methods.
Bayesian optimization UCB
#tuned_UCB1 < tune_wf1 %>% # tune_bayes(resamples = folds, # param_info=parameters(hidden_units(range = c(5L,15L)), # epochs(range = c(10L,20L))), # metrics=metric_set(roc_auc), # objective=conf_bound(kappa = 2)) #df_UCB1 < tuned_UCB1 %>% collect_metrics() #write_csv(df_UCB1, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_UCB1.csv") df_UCB1 < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_UCB1.csv") df_UCB1 %>% arrange(mean) # A tibble: 15 x 8 hidden_units epochs .iter .metric .estimator mean n std_err1 12 19 8 roc_auc binary 0.981 5 0.00237 2 12 17 0 roc_auc binary 0.981 5 0.00250 3 9 19 10 roc_auc binary 0.981 5 0.00203 4 12 18 2 roc_auc binary 0.981 5 0.00236 5 11 17 3 roc_auc binary 0.981 5 0.00248 6 13 19 9 roc_auc binary 0.981 5 0.00246 7 13 18 6 roc_auc binary 0.981 5 0.00260 8 13 17 1 roc_auc binary 0.981 5 0.00254 9 10 19 0 roc_auc binary 0.981 5 0.00230 10 11 19 7 roc_auc binary 0.981 5 0.00221 11 13 15 0 roc_auc binary 0.981 5 0.00257 12 11 18 5 roc_auc binary 0.981 5 0.00269 13 12 16 4 roc_auc binary 0.981 5 0.00213 14 5 11 0 roc_auc binary 0.979 5 0.00233 15 9 12 0 roc_auc binary 0.979 5 0.00288
Bayesian optimization PI
#tuned_PI1 < tune_wf1 %>% # tune_bayes(resamples = folds, # param_info=parameters(hidden_units(range = c(5L,15L)), # epochs(range = c(10L,20L))), # metrics=metric_set(roc_auc), # objective=prob_improve(trade_off = 0.01)) #df_PI1 < tuned_PI1 %>% collect_metrics() #write_csv(df_PI1, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_PI1.csv") df_PI1 < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_PI1.csv") df_PI1 %>% arrange(mean) # A tibble: 15 x 8 hidden_units epochs .iter .metric .estimator mean n std_err1 7 20 6 roc_auc binary 0.981 5 0.00236 2 10 10 5 roc_auc binary 0.981 5 0.00237 3 9 17 0 roc_auc binary 0.981 5 0.00216 4 15 16 8 roc_auc binary 0.981 5 0.00213 5 14 20 0 roc_auc binary 0.981 5 0.00251 6 11 20 4 roc_auc binary 0.981 5 0.00282 7 15 10 3 roc_auc binary 0.981 5 0.00272 8 6 10 0 roc_auc binary 0.981 5 0.00286 9 13 20 10 roc_auc binary 0.981 5 0.00253 10 12 14 0 roc_auc binary 0.981 5 0.00268 11 8 12 0 roc_auc binary 0.980 5 0.00220 12 13 10 7 roc_auc binary 0.980 5 0.00273 13 8 10 1 roc_auc binary 0.980 5 0.00305 14 5 20 2 roc_auc binary 0.980 5 0.00227 15 5 12 9 roc_auc binary 0.979 5 0.00234
Bayesian optimization EI
#tuned_EI1 < tune_wf1 %>% # tune_bayes(resamples = folds, # param_info=parameters(hidden_units(range = c(5L,15L)), # epochs(range = c(10L,20L))), # metrics=metric_set(roc_auc), # objective=exp_improve(trade_off = 0.01)) #df_EI1 < tuned_EI1 %>% collect_metrics() #write_csv(df_EI1, "C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_EI1.csv") df_EI1 < read_csv("C:/Users/dell/Documents/newblog/content/bayes/hyper_bayes/df_EI1.csv") df_EI1 %>% arrange(mean) # A tibble: 15 x 8 hidden_units epochs .iter .metric .estimator mean n std_err1 10 20 0 roc_auc binary 0.981 5 0.00229 2 15 18 1 roc_auc binary 0.981 5 0.00263 3 7 16 0 roc_auc binary 0.981 5 0.00286 4 12 20 8 roc_auc binary 0.981 5 0.00276 5 8 19 6 roc_auc binary 0.981 5 0.00231 6 14 14 0 roc_auc binary 0.981 5 0.00229 7 13 14 7 roc_auc binary 0.981 5 0.00253 8 5 20 10 roc_auc binary 0.981 5 0.00205 9 6 13 5 roc_auc binary 0.981 5 0.00244 10 12 10 0 roc_auc binary 0.980 5 0.00266 11 11 14 4 roc_auc binary 0.980 5 0.00191 12 10 10 9 roc_auc binary 0.980 5 0.00284 13 5 12 0 roc_auc binary 0.980 5 0.00274 14 5 10 2 roc_auc binary 0.980 5 0.00261 15 9 14 3 roc_auc binary 0.979 5 0.00243
Contrast the results
df_deep < tibble(names=c("true", "random", "UCB", "PI", "EI"), hidden_units=c(df1[df1$mean==max(df1$mean),1, drop=TRUE], df_r1[df_r1$mean==max(df_r1$mean),1, drop=TRUE], df_UCB1[df_UCB1$mean==max(df_UCB1$mean),1, drop=TRUE], df_PI1[df_PI1$mean==max(df_PI1$mean),1, drop=TRUE], df_EI1[df_EI1$mean==max(df_EI1$mean),1, drop=TRUE]), epochs=c(df1[df1$mean==max(df1$mean),2, drop=TRUE], df_r1[df_r1$mean==max(df_r1$mean),2, drop=TRUE], df_UCB1[df_UCB1$mean==max(df_UCB1$mean),2, drop=TRUE], df_PI1[df_PI1$mean==max(df_PI1$mean),2, drop=TRUE], df_EI1[df_EI1$mean==max(df_EI1$mean),2, drop=TRUE]), roc_auc=c(max(df1$mean), max(df_r1$mean), max(df_UCB1$mean), max(df_PI1$mean),max(df_EI1$mean)), std_err=c(df1[df1$mean==max(df1$mean),ncol(df1), drop=TRUE], df_r1[df_r1$mean==max(df_r1$mean),ncol(df_r1), drop=TRUE], df_UCB1[df_UCB1$mean==max(df_UCB1$mean),ncol(df_UCB1), drop=TRUE], df_PI1[df_PI1$mean==max(df_PI1$mean),ncol(df_PI1), drop=TRUE], df_EI1[df_EI1$mean==max(df_EI1$mean),ncol(df_EI1), drop=TRUE])) df_deep # A tibble: 5 x 5 names hidden_units epochs roc_auc std_err1 true 12 18 0.982 0.00255 2 random 14 11 0.982 0.00230 3 UCB 12 19 0.981 0.00237 4 PI 7 20 0.981 0.00236 5 EI 10 20 0.981 0.00229 ggplot(df_deep[1, ]%>% arrange(roc_auc), aes(x=names, y=roc_auc))+ geom_point(size=2, color= "red")+ geom_hline(yintercept = df_deep[df_deep$names=="true", 4, drop=TRUE], color= "blue", lty="dashed")+ geom_errorbar(aes(ymin=roc_auc1.92*std_err, ymax=roc_auc+1.92*std_err))
As we see, we do not obtain any significance difference between these methods, which is essentially due to the small size of the data and the small range used for the hyperparameter values. However, with more complex and large data set the difference will be obvious.
Conclusion
The main purpose of this article is more to show how implementing these methods in practice rather than highlighting the difference of performance between each other. However, the Bayesian optimization is known as the most efficient method, but with the downside that it also requires defining some hyperparameters like the acquisition function, sometimes arbitrarily. In other words, we cannot waste, say the limited budget, that should be concentrated to search for the optimum, to instead try different acquisition functions.
Session info
sessionInfo() R version 4.0.1 (20200606) Platform: x86_64w64mingw32/x64 (64bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plot3D_1.3 themis_0.1.1 yardstick_0.0.6 workflows_0.1.1 [5] tune_0.1.0 tibble_3.0.1 rsample_0.0.7 recipes_0.1.12 [9] purrr_0.3.4 parsnip_0.1.1 infer_0.5.2 ggplot2_3.3.1 [13] dplyr_1.0.0 dials_0.0.7 scales_1.1.1 broom_0.5.6 [17] tidymodels_0.1.0 readr_1.3.1 loaded via a namespace (and not attached): [1] mlr_2.17.1 backports_1.1.7 fastmatch_1.10 [4] tidytext_0.2.4 plyr_1.8.6 igraph_1.2.5 [7] splines_4.0.1 crosstalk_1.1.0.1 listenv_0.8.0 [10] SnowballC_0.7.0 rstantools_2.0.0 inline_0.3.15 [13] digest_0.6.25 foreach_1.5.0 htmltools_0.4.0 [16] rsconnect_0.8.16 fansi_0.4.1 checkmate_2.0.0 [19] magrittr_1.5 BBmisc_1.11 unbalanced_2.0 [22] doParallel_1.0.15 globals_0.12.5 gower_0.2.1 [25] RcppParallel_5.0.1 matrixStats_0.56.0 xts_0.120 [28] hardhat_0.1.3 prettyunits_1.1.1 colorspace_1.41 [31] xfun_0.14 callr_3.4.3 crayon_1.3.4 [34] lme4_1.123 survival_3.112 zoo_1.88 [37] iterators_1.0.12 glue_1.4.1 gtable_0.3.0 [40] ipred_0.99 pkgbuild_1.0.8 rstan_2.19.3 [43] miniUI_0.1.1.1 Rcpp_1.0.4.6 xtable_1.84 [46] GPfit_1.08 stats4_4.0.1 lava_1.6.7 [49] StanHeaders_2.21.05 prodlim_2019.11.13 DT_0.13 [52] htmlwidgets_1.5.1 threejs_0.3.3 FNN_1.1.3 [55] ellipsis_0.3.1 farver_2.0.3 ParamHelpers_1.14 [58] pkgconfig_2.0.3 loo_2.2.0 nnet_7.314 [61] utf8_1.1.4 labeling_0.3 tidyselect_1.1.0 [64] rlang_0.4.6 DiceDesign_1.81 reshape2_1.4.4 [67] later_1.1.0.1 munsell_0.5.0 tools_4.0.1 [70] cli_2.0.2 generics_0.0.2 ggridges_0.5.2 [73] evaluate_0.14 stringr_1.4.0 fastmap_1.0.1 [76] yaml_2.2.1 processx_3.4.2 knitr_1.28 [79] RANN_2.6.1 future_1.17.0 nlme_3.1148 [82] mime_0.9 rstanarm_2.19.3 rstudioapi_0.11 [85] tokenizers_0.2.1 compiler_4.0.1 bayesplot_1.7.2 [88] shinythemes_1.1.2 curl_4.3 tidyposterior_0.0.3 [91] lhs_1.0.2 statmod_1.4.34 stringi_1.4.6 [94] ps_1.3.3 blogdown_0.19 lattice_0.2041 [97] Matrix_1.218 nloptr_1.2.2.1 markdown_1.1 [100] shinyjs_1.1 vctrs_0.3.1 pillar_1.4.4 [103] lifecycle_0.2.0 furrr_0.1.0 data.table_1.12.8 [106] httpuv_1.5.4 R6_2.4.1 bookdown_0.19 [109] promises_1.1.1 gridExtra_2.3 janeaustenr_0.1.5 [112] codetools_0.216 boot_1.325 colourpicker_1.0 [115] MASS_7.351.6 gtools_3.8.2 assertthat_0.2.1 [118] ROSE_0.03 withr_2.2.0 shinystan_2.5.0 [121] parallel_4.0.1 hms_0.5.3 grid_4.0.1 [124] rpart_4.115 timeDate_3043.102 tidyr_1.1.0 [127] class_7.317 minqa_1.2.4 misc3d_0.84 [130] rmarkdown_2.2 parallelMap_1.5.0 pROC_1.16.2 [133] tidypredict_0.4.5 shiny_1.4.0.2 lubridate_1.7.9 [136] base64enc_0.13 dygraphs_1.1.1.6
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.