[This article was first published on Machine learning with R, tidyverse, and mlr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## What is the support vector machine (SVM) algorithm?

Imagine you would like to predict whether your boss will be in a good mood or not (a very important machine learning application). Over a couple of weeks, you record the number of hours you spend playing games at your desk and how much money you make the company each day. You also record your boss’ mood the next day as good or bad (they’re very binary). You decide to use the SVM algorithm to build a classifier that will help you decide whether you need to avoid your boss on a particular day! The SVM algorithm will learn a linear hyperplane that separates the days your boss is in a good mood form the days they are in a bad mood. The SVM algorithm is also able add an extra dimension to the data to find the best hyperplane.

## SVMs for linearly-separable data

Take a look at the data shown in figure 1. The plots show the data you recorded on the mood of your boss, based on how hard you’re working and how much money you’re making the company.

The SVM algorithm finds an optimal linear hyperplane that separates the classes. A hyperplane is a surface that exists in as many dimensions as there are variables in a dataset. For a two-dimensional feature space, such as in the example in figure 1, a hyperplane is simply a straight line. For a three-dimensional feature space, a hyperplane is a surface. It’s hard to picture hyperplanes with four or more dimensions, but the principle is the same: they are surfaces that cut through the feature space.

For problems where the classes are fully separable, there may be many different hyperplanes which do just as good a job at separating the classes in the training data. To find an optimal hyperplane (which will, hopefully, generalize better to unseen data), the algorithm finds the hyperplane which maximizes the margin around itself. The margin is a distance around the hyperplane which touches the fewest training cases. The cases in the data that touch the margin are called support vectors because they support the position of the hyperplane (hence the name of the algorithm).

The support vectors are the most important cases in the training set because they define the boundary between the classes. Not only this, the hyperplane that the algorithm learns is entirely dependent on the position of the support vectors, and none of the other cases in the training set. Take a look at figure 2. If we move the position of one of the support vectors, then we move the position of the hyperplane. If, however, we were to move a non-support vector case, there is no influence on the hyperplane at all!

SVMs are extremely popular right now. That’s mainly for three reasons:

They are good at finding ways of separating non-linearly separable classes

They tend to perform well for a wide variety of tasks

We now have the computational power to apply them to larger, more complex datasets

This last point is important because it highlights a potential downside of SVMs: they tend to be more computationally expensive to train than many other classification algorithms. For this reason, if you have a very large dataset and computational power is limited, it may be economical for you to try cheaper algorithms first and see how they perform.

TIP: Usually, we favor predictive performance over speed. But a computationally cheap algorithm that performs well enough for your problem may be preferable to you than one which is very expensive. Therefore, I usually try cheaper algorithms before trying the expensive ones.

## SVMs for non-linearly separable data

Great! So far the SVM algorithm seems quite simple, and for linearly separable classes like in our boss mood example, it is. But I mentioned that one of the strengths of the SVM algorithm is that it can learn decision boundaries between classes that are not linearly separable. As I’ve told you that the algorithm learns linear hyperplanes, this seems like a contradiction. Well here’s what makes the SVM algorithm so powerful: it can add an extra dimension to your data, to find a linear way to separate non-linear data.

Take a look a look at the example in figure 3. The classes are not linearly separable using the two predictor variables. The SVM algorithm adds an extra dimension to the data, such that a linear hyperplane can separate the classes in this new, higher dimensional space. We can visualize this as a sort of deformation or stretching of the feature space. The extra dimension is called a kernel.

Why is it called a kernel?

The word kernel may confuse you (it certainly confuses me). It has nothing to do with the kernels in computing (the bit of your operating system that directly interfaces with the computer hardware), or kernels in corn or fruit.

The truth is that reason they are called kernels is a little murky. In 1904, a German mathematician called David Hilbert published Grundzüge einer allgemeinen theorie der linearen integralgleichungen (Principles of a general theory of linear integral equations). In this book, Hilbert uses the word kern to mean the core of an integral equation. In 1909, an American mathematician called Maxime Bôcher published An introduction to the study of integral equations in which he translates Hilberts use of the word kern, to kernel.

The mathematics of kernel functions evolved from the work in these publications, and took the name kernel with them. The extremely confusing thing is that multiple, seemingly unrelated concepts in mathematics have the word kernel in them!

So how does the algorithm find this new kernel? It uses a mathematical transformation of the data called a kernel function. There are many kernel functions to choose from, each of which applies a different transformation to the data and is suitable for finding linear decision boundaries for different situations. Figure 4 shows examples of situations where some common kernel functions can separate non-linearly separable data. These include:

• linear kernel (equivalent to no kernel)
• polynomial kernel
• sigmoid kernel

Now the type of kernel function for a given problem isn’t learned from the data, we have to specify it. Because of this, the choice of kernel function is a categorical hyperparameter (a hyperparameter which takes discrete, not continuous values). Therefore, the best approach for choosing the best performing kernel is with hyperparameter tuning.

## Hyperparameters of the SVM algorithm

This is where SVMs become fun/difficult/painful depending on your problem, computational budget and sense of humor. We need to tune quite a lot of hyperparameters when building an SVM. This, coupled with the fact that training a single model can be moderately expensive, can make training an optimally performing SVM take quite a long time. You’ll see this in the worked example later in the article.

So the SVM algorithm has quite a few hyperparameters to tune, but the most important ones to consider are:

• the kernel (shown in figure 4)
• the degree hyperparameter, which controls how “bendy” the decision boundary will be for the polynomial kernel (shown in figure 4)
• the cost or C hyperparameter, which controls how “hard” or “soft” the margin is (shown in figure 5)
• the gamma hyperparameter, which controls how much influence individual cases have on the position of the decision boundary (shown in figure 5)

The effect of the kernel function, and degree hyperparameters are shown in figure 4. Note the difference in the shape of the decision boundary between the 2nd and 3rd degree polynomials.

NOTE: The higher the degree of polynomial, the more bendy and complex a decision boundary can be learned, but this has the potential to over-fit the training set.

In the illustrations I’ve shown you so far, the classes have been fully separable.

This is so I could clearly show you how the positioning of the hyperplane is chosen to maximize the margin. But what about situations where the classes are not completely separable? How can the algorithm find a hyperplane when there is no margin that won’t have cases inside it? This is where the cost (also called C) hyperparameter comes in.

The cost hyperparameter assigns a cost or penalty to having cases inside the margin or, put another way, tells the algorithm how bad it is to have cases inside the margin. A low cost tells the algorithm that it’s acceptable to have more cases inside the margin and will result in wider margins less influenced by local differences near the class boundary. A high cost imposes a harsher penalty on having cases inside the boundary and will result in narrower margins more influenced by local differences near the class boundary. The effect of cost is illustrated for a linear kernel in the top part of figure 4.

IMPORTANT: Cases inside the margin are also support vectors, as moving them would change the position of the hyperplane.

The gamma hyperparameter controls the influence each case has on the position of the hyperplane, and is used by all the kernel functions except the linear kernel. Think of each case in the training set jumping up and down shouting “me! me! classify me correctly!” The larger gamma is, the more attention-seeking each case is, and the more granular the decision boundary will be (potentially leading to overfitting). The smaller gamma is, the less attention-seeking each case will be, and the less granular the decision boundary will be (potentially leading to underfitting). The effect of gamma is illustrated for a Gaussian radial basis kernel in the bottom part of figure 5.

So the SVM algorithm has multiple hyperparameters to tune! I’ll show you how we can tune these simultaneously using mlr in a little bit.

## What if I have more than two classes?

So far, I’ve only shown you examples of two-class classification problems. This is because the SVM algorithm is inherently geared towards separating two classes. But can we use it for multiclass problems (where we’re trying to predict more than two classes)? Absolutely! When there are more than two classes, instead of creating a single SVM, we make multiple models, and let them fight it out to predict the most likely class for new data. There are two ways of doing this:

• one versus all
• one versus one

In the one versus all (also called one versus rest) approach, we create as many SVM models as there are classes. Each SVM model describes a hyperplane that best separates one class from all the other classes. Hence the name, one versus all. When we classify new, unseen cases, the models play a game of winner takes all. Put simply, the model that puts the new case on the “correct” side of its hyperplane (the side with the class it separates from all the others) wins. The case is then assigned to the class that model was trying to separate from the others. This is illustrated in the left-side plot of figure 6.

In the one versus one approach, we create an SVM model for every pair of classes. Each SVM model describes a hyperplane that best separates one class from one other class, ignoring data from the other classes. Hence the name, one versus one. When we classify new, unseen cases, the models each cast a vote. For example, if one model separates classes A and B, and the new data falls on the B side of the decision boundary, that model will vote for B. This continues for all the models, and the majority class vote wins. This is illustrated in the right-side plot of figure 6.

Which do we choose? Well in practice, there is usually little difference in the performance of the two methods. Despite training more models (for more than three classes), the one versus one is sometimes less computationally expensive that one versus all. This is because, although we’re training more models, the training sets are smaller (because of the ignored cases). The implementation of the SVM algorithm called by mlr uses the one versus one approach.

There is, however, a problem with these approaches. There will often be regions of the feature space in which none of the models give a clear winning class. Can you see the triangular space between the hyperplanes in the left-side plot of 6? If a new case appeared inside this triangle, none of the three models would clearly win outright. This is a sort of classification no-man’s land. Though not as obvious in figure 6, this occurs with the one versus one approach also.

If there is no outright winner when predicting a new case, a technique called Platt Scaling is used (named after computer scientist John Platt). Platt scaling takes the distances of the cases from each hyperplane, and converts them into probabilities using the logistic function. The logistic function maps a continuous variable to probabilities that range between 0 and 1. Using Platt scaling to make predictions proceeds like this:

1. for every hyperplane (whether we use one versus all or one versus one):
1. measure the distance of each case from the hyperplane
2. use the logistic function to convert these distances into probabilities
2. classify new data as belonging to the class of the hyperplane which has the highest probability

If this seems confusing, take a look at figure 7. We’re using the one versus all approach in the figure, and have generated three separate hyperplanes (one to separate each class from the rest). The dashed arrows in the figure indicate distance in either direction, away from the hyperplanes. Platt scaling takes these distances, and converts them into probabilities using the logistic function (the class each hyperplane separates from the rest has positive distance).

When we classify new, unseen data, the distance of the new data is converted into a probability using each of the three “s”-shaped curves, and the one that gives the highest probability is what the case is classified as. Handily, all of this is taken care of for us in the implementation of SVM called by mlr. If we supply a three-class classification task, we will get a one versus one SVM model with Platt scaling, without having to change our code.

## Building our first SVM model

In this section I’ll teach you how to build an SVM model and tune multiple hyperparameters simultaneously. Imagine you’re sick and tired of receiving so many spam emails (maybe you don’t need to imagine!). It’s difficult for you to be productive because you get so many emails requesting your bank details for a mysterious Ugandan inheritance, and trying to sell you Viagra.

You decide to perform a feature extraction on the emails you receive over a few months, which you manually class as spam or not spam. These features include things like the number of exclamation marks and the frequency of certain words. With this data you decide to make an SVM that will classify new emails as spam or not spam, that you can use as a spam filter.

Let’s learn how to train an SVM model and tune multiple hyperparameters simultaneously. We’ll start by loading the mlr and tidyverse packages.

 install.packages("mlr", dependencies = TRUE)

install.packages("tidyverse")

library(mlr)

library(tidyverse)


Let’s define our task and learner. In the mlr package the task consists of the data, and what we want do to with it. In this case, the data is spamTib and we want to classify the data with the type variable as the target variable. The learner is simply the name of the algorithm we plan to use, along with any additional arguments the algorithm accepts. In this example, we’re using the SVM algorithm for classification, so our learner is ”classif.svm”. Now, let’s load the data, which is built into the kernlab package, convert it into a tibble (with as_tibble()) and explore it a little.

NOTE: The kernlab package should have been installed along with mlr as a suggested package. If you get an error when trying to load the data, you may need to install it with install.packages(“kernlab”).

We have a tibble containing 4601 emails and 58 variables extracted from emails. Our goal is to train a model which can use the information in these variables to predict whether a new email is spam or not.

NOTE: Except the factor type, which denotes whether an email is spam or not, all of the variables are continuous as the SVM algorithm cannot handle categorical predictors.

 data(spam, package = "kernlab")

spamTib <- as_tibble(spam)

spamTib

# A tibble: 4,601 x 58
make address   all num3d   our  over remove internet order  mail

TIP:
We have a lot of features in this dataset! I’m not going to
discuss the meaning of each one, but you’ll find a description of
what they mean by running ?kernlab::spam.

Tuning
our hyperparameters

Let’s
define our task and learner. This time, we supply “classif.svm”
as the argument to makeLearner()
to specify we’re going to use SVM.

Listing
2 Creating the task and learner

svm <- makeLearner("classif.svm")

Now
before we train our model, we need to tune our hyperparameters. To
find out which hyperparameters are available for tuning for an
algorithm, we simply pass the name of the algorithm in quotes to
getParamSet(). For
example, code listing 3 shows how to
print the hyperparameters for the SVM algorithm. I’ve removed some
rows and columns of the output to make it fit, but the most important
columns are there:

the
row name is the name of the hyperparameter

Type
is whether the hyperparameter takes numeric, integer, discrete or
logical values

Def
is the default value (the value that will be used if you don’t
tune it)

Constr
defines the constraints for the hyperparameter, either a set of
specific values or a range of acceptable values

Req
defines whether the hyperparameter is required by the learner

Tunable
is logical and defines whether that hyperparameter can be tuned
(some algorithms have options that cannot be tuned but can be set by
the user)

Listing
3 Printing available SVM hyperparameters

getParamSet("classif.svm")

Type     Def              Constr  Req  Tunable
cost           numeric       1            0 to Inf    Y     TRUE
degree         integer       3            1 to Inf    Y     TRUE
gamma          numeric       -            0 to Inf    Y     TRUE
scale    logicalvector    TRUE                   -    -     TRUE

TIP: The SVM algorithm is sensitive to variables being on different scales, and so it’s usually a good idea to scale the predictors first. Notice the scale hyperparameter in code listing 3? This tells us the algorithm will scale the data for us by default.

Extracting
the possible values for a hyperparameter

While
the getParamSet() function
is useful, I don’t find it particularly simple to extract
information from. If you call str(getParamSet(“classif.svm”)),
you’ll see that it has a reasonably complex structure.
To extract information about a particular hyperparameter, you
need to call getParamSet(“classif.svm”)$pars$[HYPERPAR]
(where [HYPERPAR] is replaced by the hyperparameter you’re
interested in). To extract the possible values for that
hyperparameter, you append $values to the call. For example, to extract the possible kernel functions: getParamSet("classif.svm")$pars$kernel$values

$linear [1] "linear"$polynomial
[1] "polynomial"

$radial [1] "radial"$sigmoid
[1] "sigmoid"

The
most important hyperparameters for us to tune are:

the
kernel

the
cost

the
degree

gamma

Let’s define the hyperparameters we want to tune in code listing 4. We’re going to start by defining a vector of kernel functions we wish to tune.

TIP: Notice I omit the linear kernel? This is because the linear kernel is the same as the polynomial kernel with degree = 1, so we’ll just make sure we include 1 as a possible value for the degree hyperparameter. Including the linear kernel and the 1st degree polynomial kernel is simply a waste of computing time.

Next,
we use the makeParamSet()
function to define the hyperparameter space we wish to tune over. To
the makeParamSet()
function, we supply the information needed to define each
hyperparameter we wish to tune, separated by columns. Let’s break
this down line by line:

the
kernel hyperparameter takes discrete values (the name of the kernel
function), so we use the makeDiscreteParam()
function to define its values as the vector of kernels we created

the
degree hyperparameter takes integer values (whole numbers), so we
use the makeIntegerParam()
function and define its lower and upper values we wish to tune over

the
cost and gamma hyperparameters take numeric values (any number
between zero and infinity), so we use the makeNumericParam()
function to define their lower and upper values we wish to tune over

For
each of these functions, the first argument is the name of the
hyperparameter given by getParamSet(“classif.svm”),
in quotes.

Listing
4 Defining the hyperparameter space for tuning

svmParamSpace <- makeParamSet(
makeDiscreteParam("kernel", values = kernels),
makeIntegerParam("degree", lower = 1, upper = 3),
makeNumericParam("cost", lower = 0.1, upper = 10),
makeNumericParam("gamma", lower = 0.1, 10))

When
searching for the best-performing combination of hyperparameters, one
approach is to train models using every combination of
hyperparameters, exhaustively. We then evaluate the performance of
each of these models and choose the best-performing one. This is
called a _grid search_. We used the grid search procedure to try
every value of k that we defined, during tuning. This is what
the grid search method does: tries every combination of the
hyperparameter space you define, and finds the best performing
combination.

Grid
search is great because, provided you specify a sensible
hyperparameter space to search over, it will always find the best
performing hyperparameters. But look at the hyperparameter space we
defined for our SVM. Let’s say we wanted to try the values for the
cost and gamma hyperparameters in steps of 0.1, that’s 100 values
of each. We have three kernel functions we’re trying, and four
values of the degree hyperparameter. To perform a grid search over
this parameter space would require training a model 120,000 times!
If, in such a situation, you have the time, patience, and
computational budget for such a grid search, then good for you. I,
for one, have better things I could be doing with my computer!

we can employ a technique called random search. Rather than
trying every possible combination of parameters, random search
proceeds as follows:

Randomly
select a combination of hyperparameter values

Use
cross-validation to train and evaluate a model using those
hyperparameter values

Record
the performance metric of the model (usually mean misclassification

Repeat
(iterate) steps 1 to 3 as many times as your computational budget
allows

Select
the combination of hyperparameter values that gave you the best
performing model

Unlike
grid search, random search isn’t guaranteed to find the best set of
hyperparameter values. However, with enough iterations, it can
usually find a good combination that performs well. By using random
search, we can run 500 combinations of hyperparameter values, instead
of all 120,000 combinations.

Let’s
define our random search using the makeTuneControlRandom()
function. We tell the function how many iterations of the random
search procedure we want to use, with the maxit
argument. You should try to set this as high as your computational
budget allows, but in this example we’ll stick to 20 to prevent the
example from taking too long. Next, I describe our cross-validation
procedure. I usually prefer k-fold cross-validation to hold-out
cross-validation (as the former gives more stable estimates of model
performance), unless the training procedure is computationally
expensive. Well, this is computationally expensive, so I’m
compromising by using hold-out cross-validation instead.

Listing
5. Defining the random search

randSearch <- makeTuneControlRandom(maxit = 20)

cvForTuning <- makeResampleDesc("Holdout", split = 2/3)

TIP: There’s something else we can do to speed up this process. R, as a language, doesn’t make that much use of multi-threading (using multiple CPUs simultaneously to accomplish a task). However, one of the benefits of the mlr package is that it allows multi-threading to be used with its functions. This helps you use multiple cores/CPUs on your computer to accomplish tasks such as hyperparameter tuning and cross-validation, much more quickly.  If you don’t know how many cores your computer has, you can find out in R by running parallel::detectCores(). If your computer only has one core, the 90s called and they want their computer back.

To
run an mlr process in parallel, we place its code between the
parallelStartSocket() and
parallelStop() functions
from the parallelMap package. To start our hyperparameter tuning
process, we call the tuneParams()
function and supply as arguments:

the first argument is the name of the learner  task = the name of our task  resampling = our cross validation procedure (defined in code listing 5)  par.set = our hyperparameter space (defined in code listing 4)  control = the search procedure (random search, defined in code listing 5)

This code, between the parallelStartSocket() and parallelStop() functions is shown in code listing 6.

WARNING:
The computer I’m writing this on has four cores and the code
below takes nearly a minute to run on it. It is of the utmost
importance that you go make a cup of tea while it runs. Milk and no

Listing
6. Performing hyperparameter tuning

library(parallelMap)
library(paralell)

parallelStartSocket(cpus = detectCores())

resampling = cvForTuning,
par.set = svmParamSpace,
control = randSearch)

parallelStop()

TIP:
The degree hyperparameter only applies to the polynomial kernel
function, and the gamma hyperparameter doesn’t apply to the linear
kernel. Does this create errors when the random search selects
combinations that don’t make sense? Nope. If the random search
selects the sigmoid kernel, for example, it simply ignores the value
of the degree hyperparameter.

Welcome
back after our interlude! You can print the best-performing
hyperparameter values and the performance of the model built with
them by calling tunedSvm,
or extract just the named values themselves (so you can train a new
model using them) by calling tunedSvm$x. Looking at code listing 7, we can see that the 1st degree polynomial kernel function (equivalent to the linear kernel function), with a cost of 5.8 and gamma of 1.56, gave the best performing model. Listing 7 Extracting the winning hyperparameter values from tuning tunedSvmPars Tune result: Op. pars: kernel=polynomial; degree=1; cost=5.82; gamma=1.56 mmce.test.mean=0.0645372 tunedSvmPars$x
$kernel [1] "polynomial"$degree
[1] 1

$cost [1] 5.816232$gamma
[1] 1.561584

IMPORTANT:
Are your values different to mine? They probably are! This is the
nature of the random search. It may find different winning
combinations of hyperparameter values each time it’s run. To reduce
this variance, we should commit to increasing the number of
iterations it makes.

Training
the model with the tuned hyperparameters

Now
we’ve tuned our hyperparameters, let’s build our model using the
best performing combination. We use the setHyperPars()
function to combine a learner with a set of pre-defined
hyperparameter values. The first argument is the learner we want to
use, and the par.vals
argument is the object containing our tuned hyperparameter values. We
then train a model using our tunedSvm
learner with the train()
function.

Listing
8 Training the model with tuned hyperparameters

tunedSvm <- setHyperPars(makeLearner("classif.svm"),
par.vals = tunedSvmPars$x) tunedSvmModel <- train(tunedSvm, spamTask) TIP: As we already defined our learner in code listing 2, we could simply have run setHyperPars(svm, par.vals = tunedSvmPars$x) to achieve the same result.

That’s
it out on liveBook here
and see this slide
deck.

Get more blogs at http://www.r-bloggers.com !

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'};
(function(d, t) {
var s = d.createElement(t);
s.type = 'text/javascript';
s.async = true;
// s.defer = true;
var r = d.getElementsByTagName(t)[0];
r.parentNode.insertBefore(s, r);
}(document, 'script'));

Related
ShareTweet