What is the support vector machine (SVM) algorithm?
you would like to predict whether your boss will be in a good mood or
not (a very important machine learning application). Over a couple of
weeks, you record the number of hours you spend playing games at your
desk and how much money you make the company each day. You also
record your boss’ mood the next day as good or bad (they’re very
binary). You decide to use the SVM algorithm to build a classifier
that will help you decide whether you need to avoid your boss on a
particular day! The SVM algorithm will learn a linear hyperplane that
separates the days your boss is in a good mood form the days they are
in a bad mood. The SVM algorithm is also able add an extra dimension
to the data to find the best hyperplane.
for linearly-separable data
a look at the data shown in figure 1.
The plots show the data you recorded on the mood of your boss, based
on how hard you’re working and how much money you’re making the
SVM algorithm finds an optimal linear hyperplane that separates the
classes. A hyperplane is a surface that exists in as many dimensions
as there are variables in a dataset. For a two-dimensional feature
space, such as in the example in figure 1,
a hyperplane is simply a straight line. For a three-dimensional
feature space, a hyperplane is a surface. It’s hard to picture
hyperplanes with four or more dimensions, but the principle is the
same: they are surfaces that cut through the feature space.
problems where the classes are fully separable, there may be many
different hyperplanes which do just as good a job at separating the
classes in the training data. To find an optimal hyperplane (which
will, hopefully, generalize better to unseen data), the algorithm
finds the hyperplane which maximizes the margin around itself.
The margin is a distance around the hyperplane which touches the
fewest training cases. The cases in the data that touch the margin
are called support vectors because they support the position
of the hyperplane (hence the name of the algorithm).
support vectors are the most important cases in the training set
because they define the boundary between the classes. Not only this,
the hyperplane that the algorithm learns is entirely dependent on the
position of the support vectors, and none of the other cases in the
training set. Take a look at figure 2.
If we move the position of one of the support vectors, then we move
the position of the hyperplane. If, however, we were to move a
non-support vector case, there is no influence on the hyperplane at
are extremely popular right now. That’s mainly for three reasons:
are good at finding ways of separating non-linearly separable classes
tend to perform well for a wide variety of tasks
now have the computational power to apply them to larger, more
last point is important because it highlights a potential downside of
SVMs: they tend to be more computationally expensive to train than
many other classification algorithms. For this reason, if you have a
very large dataset and computational power is limited, it may be
economical for you to try cheaper algorithms first and see how they
Usually, we favor predictive performance over speed. But a
computationally cheap algorithm that performs well enough for your
problem may be preferable to you than one which is very expensive.
Therefore, I usually try cheaper algorithms before trying the
for non-linearly separable data
So far the SVM algorithm seems quite simple, and for linearly
separable classes like in our boss mood example, it is. But I
mentioned that one of the strengths of the SVM algorithm is that it
can learn decision boundaries between classes that are not
linearly separable. As I’ve told you that the algorithm learns
linear hyperplanes, this seems like a contradiction. Well here’s
what makes the SVM algorithm so powerful: it can add an extra
dimension to your data, to find a linear way to separate non-linear
a look a look at the example in figure 3.
The classes are not linearly separable using the two predictor
variables. The SVM algorithm adds an extra dimension to the data,
such that a linear hyperplane can separate the classes in this new,
higher dimensional space. We can visualize this as a sort of
deformation or stretching of the feature space. The extra dimension
is called a kernel.
Why is it called a kernel?
The word kernel may confuse you (it certainly confuses me). It has nothing to do with the kernels in computing (the bit of your operating system that directly interfaces with the computer hardware), or kernels in corn or fruit.
The truth is that reason they are called kernels is a little murky. In 1904, a German mathematician called David Hilbert published Grundzüge einer allgemeinen theorie der linearen integralgleichungen (Principles of a general theory of linear integral equations). In this book, Hilbert uses the word kern to mean the core of an integral equation. In 1909, an American mathematician called Maxime Bôcher published An introduction to the study of integral equations in which he translates Hilberts use of the word kern, to kernel.
The mathematics of kernel functions evolved from the work in these publications, and took the name kernel with them. The extremely confusing thing is that multiple, seemingly unrelated concepts in mathematics have the word kernel in them!
how does the algorithm find this new kernel? It uses a mathematical
transformation of the data called a kernel function. There are
many kernel functions to choose from, each of which applies a
different transformation to the data and is suitable for finding
linear decision boundaries for different situations. Figure 4
shows examples of situations where some common kernel functions can
separate non-linearly separable data. These include:
kernel (equivalent to no kernel)
radial basis kernel
the type of kernel function for a given problem isn’t learned from
the data, we have to specify it. Because of this, the choice of
kernel function is a categorical hyperparameter (a
hyperparameter which takes discrete, not continuous values).
Therefore, the best approach for choosing the best performing kernel
is with hyperparameter tuning.
of the SVM algorithm
is where SVMs become fun/difficult/painful depending on your problem,
computational budget and sense of humor. We need to tune quite a lot
of hyperparameters when building an SVM. This, coupled with the fact
that training a single model can be moderately expensive, can make
training an optimally performing SVM take quite a long time. You’ll
see this in the worked example later in the article.
the SVM algorithm has quite a few hyperparameters to tune, but the
most important ones to consider are:
kernel (shown in figure 4)
degree hyperparameter, which controls how “bendy”
the decision boundary will be for the polynomial kernel (shown in
cost or C hyperparameter, which controls how “hard”
or “soft” the margin is (shown in figure 5)
gamma hyperparameter, which controls how much influence
individual cases have on the position of the decision boundary
(shown in figure 5)
effect of the kernel function, and degree hyperparameters are shown
in figure 4. Note the difference in the
shape of the decision boundary between the 2nd and 3rd
The higher the degree of polynomial, the more bendy and complex a
decision boundary can be learned, but this has the potential to
over-fit the training set.
the illustrations I’ve shown you so far, the classes have been
is so I could clearly show you how the positioning of the hyperplane
is chosen to maximize the margin. But what about situations where the
classes are not completely separable? How can the algorithm find a
hyperplane when there is no margin that won’t have cases inside
it? This is where the cost (also called C) hyperparameter comes in.
cost hyperparameter assigns a cost or penalty to having cases inside
the margin or, put another way, tells the algorithm how bad it is to
have cases inside the margin. A low cost tells the algorithm that
it’s acceptable to have more cases inside the margin and will
result in wider margins less influenced by local differences near the
class boundary. A high cost imposes a harsher penalty on having cases
inside the boundary and will result in narrower margins more
influenced by local differences near the class boundary. The effect
of cost is illustrated for a linear kernel in the top part of figure
IMPORTANT: Cases inside the margin are also support vectors, as moving them would change the position of the hyperplane.
gamma hyperparameter controls the influence each case has on the
position of the hyperplane, and is used by all the kernel functions
except the linear kernel. Think of each case in the training set
jumping up and down shouting “me! me! classify me correctly!”
The larger gamma is, the more attention-seeking each case is, and the
more granular the decision boundary will be (potentially leading to
overfitting). The smaller gamma is, the less attention-seeking each
case will be, and the less granular the decision boundary will be
(potentially leading to underfitting). The effect of gamma is
illustrated for a Gaussian radial basis kernel in the bottom part of
the SVM algorithm has multiple hyperparameters to tune! I’ll show
you how we can tune these simultaneously using mlr in a little bit.
if I have more than two classes?
far, I’ve only shown you examples of two-class classification
problems. This is because the SVM algorithm is inherently geared
towards separating two classes. But can we use it for multiclass
problems (where we’re trying to predict more than two classes)?
Absolutely! When there are more than two classes, instead of creating
a single SVM, we make multiple models, and let them fight it out to
predict the most likely class for new data. There are two ways of
the one versus all (also called one versus rest) approach, we create
as many SVM models as there are classes. Each SVM model describes a
hyperplane that best separates one class from all the other
classes. Hence the name, one versus all. When we classify new,
unseen cases, the models play a game of winner takes all. Put
simply, the model that puts the new case on the “correct”
side of its hyperplane (the side with the class it separates from all
the others) wins. The case is then assigned to the class that model
was trying to separate from the others. This is illustrated in the
left-side plot of figure 6.
the one versus one approach, we create an SVM model for every pair of
classes. Each SVM model describes a hyperplane that best separates
one class from one other class, ignoring data from the other
classes. Hence the name, one versus one. When we classify new, unseen
cases, the models each cast a vote. For example, if one model
separates classes A and B, and the new data falls on the B side of
the decision boundary, that model will vote for B. This continues for
all the models, and the majority class vote wins. This is illustrated
in the right-side plot of figure 6.
do we choose? Well in practice, there is usually little difference in
the performance of the two methods. Despite training more models (for
more than three classes), the one versus one is sometimes less
computationally expensive that one versus all. This is because,
although we’re training more models, the training sets are smaller
(because of the ignored cases). The implementation of the SVM
algorithm called by mlr uses the one versus one approach.
is, however, a problem with these approaches. There will often be
regions of the feature space in which none of the models give a clear
winning class. Can you see the triangular space between the
hyperplanes in the left-side plot of 6?
If a new case appeared inside this triangle, none of the three models
would clearly win outright. This is a sort of classification no-man’s
land. Though not as obvious in figure 6,
this occurs with the one versus one approach also.
there is no outright winner when predicting a new case, a technique
called Platt Scaling is used (named after computer scientist
John Platt). Platt scaling takes the distances of the cases from each
hyperplane, and converts them into probabilities using the logistic
function. The logistic function maps a continuous variable to
probabilities that range between 0 and 1. Using Platt scaling to make
predictions proceeds like this:
- for every hyperplane (whether we use one versus all or one versus one):
- measure the distance of each case from the hyperplane
- use the logistic function to convert these distances into probabilities
- classify new data as belonging to the class of the hyperplane which has the highest probability
this seems confusing, take a look at figure 7.
We’re using the one versus all approach in the figure, and have
generated three separate hyperplanes (one to separate each class from
the rest). The dashed arrows in the figure indicate distance in
either direction, away from the hyperplanes. Platt scaling takes
these distances, and converts them into probabilities using the
logistic function (the class each hyperplane separates from the rest
has positive distance).
we classify new, unseen data, the distance of the new data is
converted into a probability using each of the three “s”-shaped
curves, and the one that gives the highest probability is what the
case is classified as. Handily, all of this is taken care of for us
in the implementation of SVM called by mlr. If we supply a
three-class classification task, we will get a one versus one SVM
model with Platt scaling, without having to change our code.
Building our first SVM model
this section I’ll teach you how to build an SVM model and tune
multiple hyperparameters simultaneously. Imagine you’re sick and
tired of receiving so many spam emails (maybe you don’t need to
imagine!). It’s difficult for you to be productive because you get
so many emails requesting your bank details for a mysterious Ugandan
inheritance, and trying to sell you Viagra.
decide to perform a feature extraction on the emails you receive over
a few months, which you manually class as spam or not spam. These
features include things like the number of exclamation marks and the
frequency of certain words. With this data you decide to make an SVM
that will classify new emails as spam or not spam, that you can use
as a spam filter.
learn how to train an SVM model and tune multiple hyperparameters
simultaneously. We’ll start by loading the mlr and tidyverse
install.packages("mlr", dependencies = TRUE) install.packages("tidyverse") library(mlr) library(tidyverse)
and exploring the spam dataset
define our task and learner. In the mlr package the task consists of
the data, and what we want do to with it. In this case, the data is
`spamTib` and we want to classify the data with the `type` variable
as the target variable. The learner is simply the name of the
algorithm we plan to use, along with any additional arguments the
algorithm accepts. In this example, we’re using the SVM algorithm for
classification, so our learner is `”classif.svm”`. Now,
let’s load the data, which is built into the kernlab package,
convert it into a tibble (with as_tibble())
and explore it a little.
The kernlab package should have been installed along with mlr as
a suggested package. If you get an error when trying to load the
data, you may need to install it with install.packages(“kernlab”).
have a tibble containing 4601 emails and 58 variables extracted from
emails. Our goal is to train a model which can use the information in
these variables to predict whether a new email is spam or not.
Except the factor type,
which denotes whether an email is spam or not, all of the variables
are continuous as the SVM algorithm cannot handle categorical
1. Loading and exploring the spam dataset
data(spam, package = "kernlab") spamTib <- as_tibble(spam) spamTib # A tibble: 4,601 x 58 make address all num3d our over remove internet order mail
We have a lot of features in this dataset! I’m not going to
discuss the meaning of each one, but you’ll find a description of
what they mean by running ?kernlab::spam.
define our task and learner. This time, we supply “classif.svm”
as the argument to makeLearner()
to specify we’re going to use SVM.
2 Creating the task and learnerspamTask <- makeClassifTask(data = spamTib, target = "type") svm <- makeLearner("classif.svm")
before we train our model, we need to tune our hyperparameters. To
find out which hyperparameters are available for tuning for an
algorithm, we simply pass the name of the algorithm in quotes to
example, code listing 3 shows how to
print the hyperparameters for the SVM algorithm. I’ve removed some
rows and columns of the output to make it fit, but the most important
columns are there:
row name is the name of the hyperparameter
is whether the hyperparameter takes numeric, integer, discrete or
is the default value (the value that will be used if you don’t
defines the constraints for the hyperparameter, either a set of
specific values or a range of acceptable values
defines whether the hyperparameter is required by the learner
is logical and defines whether that hyperparameter can be tuned
(some algorithms have options that cannot be tuned but can be set by
3 Printing available SVM hyperparametersgetParamSet("classif.svm") Type Def Constr Req Tunable cost numeric 1 0 to Inf Y TRUE kernel discrete radial [lin,poly,rad,sig] - TRUE degree integer 3 1 to Inf Y TRUE gamma numeric - 0 to Inf Y TRUE scale logicalvector TRUE - - TRUE
TIP: The SVM algorithm is sensitive to variables being on different scales, and so it’s usually a good idea to scale the predictors first. Notice the scale hyperparameter in code listing 3? This tells us the algorithm will scale the data for us by default.
the possible values for a hyperparameter
the getParamSet() function
is useful, I don’t find it particularly simple to extract
information from. If you call str(getParamSet(“classif.svm”)),
you’ll see that it has a reasonably complex structure.
To extract information about a particular hyperparameter, you
need to call getParamSet(“classif.svm”)$pars$[HYPERPAR]
(where [HYPERPAR] is replaced by the hyperparameter you’re
interested in). To extract the possible values for that
hyperparameter, you append $values
to the call. For example, to extract the possible kernel functions:getParamSet("classif.svm")$pars$kernel$values $linear  "linear" $polynomial  "polynomial" $radial  "radial" $sigmoid  "sigmoid"
most important hyperparameters for us to tune are:
Let’s define the hyperparameters we want to tune in code listing 4. We’re going to start by defining a vector of kernel functions we wish to tune.
TIP: Notice I omit the linear kernel? This is because the linear kernel is the same as the polynomial kernel with degree = 1, so we’ll just make sure we include 1 as a possible value for the degree hyperparameter. Including the linear kernel and the 1st degree polynomial kernel is simply a waste of computing time.
we use the makeParamSet()
function to define the hyperparameter space we wish to tune over. To
function, we supply the information needed to define each
hyperparameter we wish to tune, separated by columns. Let’s break
this down line by line:
kernel hyperparameter takes discrete values (the name of the kernel
function), so we use the makeDiscreteParam()
function to define its values as the vector of kernels we created
degree hyperparameter takes integer values (whole numbers), so we
use the makeIntegerParam()
function and define its lower and upper values we wish to tune over
cost and gamma hyperparameters take numeric values (any number
between zero and infinity), so we use the makeNumericParam()
function to define their lower and upper values we wish to tune over
each of these functions, the first argument is the name of the
hyperparameter given by getParamSet(“classif.svm”),
4 Defining the hyperparameter space for tuningkernels <- c("polynomial", "radial", "sigmoid") svmParamSpace <- makeParamSet( makeDiscreteParam("kernel", values = kernels), makeIntegerParam("degree", lower = 1, upper = 3), makeNumericParam("cost", lower = 0.1, upper = 10), makeNumericParam("gamma", lower = 0.1, 10))
searching for the best-performing combination of hyperparameters, one
approach is to train models using every combination of
hyperparameters, exhaustively. We then evaluate the performance of
each of these models and choose the best-performing one. This is
called a _grid search_. We used the grid search procedure to try
every value of k that we defined, during tuning. This is what
the grid search method does: tries every combination of the
hyperparameter space you define, and finds the best performing
search is great because, provided you specify a sensible
hyperparameter space to search over, it will always find the best
performing hyperparameters. But look at the hyperparameter space we
defined for our SVM. Let’s say we wanted to try the values for the
cost and gamma hyperparameters in steps of 0.1, that’s 100 values
of each. We have three kernel functions we’re trying, and four
values of the degree hyperparameter. To perform a grid search over
this parameter space would require training a model 120,000 times!
If, in such a situation, you have the time, patience, and
computational budget for such a grid search, then good for you. I,
for one, have better things I could be doing with my computer!
we can employ a technique called random search. Rather than
trying every possible combination of parameters, random search
proceeds as follows:
select a combination of hyperparameter values
cross-validation to train and evaluate a model using those
the performance metric of the model (usually mean misclassification
error for classification tasks)
(iterate) steps 1 to 3 as many times as your computational budget
the combination of hyperparameter values that gave you the best
grid search, random search isn’t guaranteed to find the best set of
hyperparameter values. However, with enough iterations, it can
usually find a good combination that performs well. By using random
search, we can run 500 combinations of hyperparameter values, instead
of all 120,000 combinations.
define our random search using the makeTuneControlRandom()
function. We tell the function how many iterations of the random
search procedure we want to use, with the maxit
argument. You should try to set this as high as your computational
budget allows, but in this example we’ll stick to 20 to prevent the
example from taking too long. Next, I describe our cross-validation
procedure. I usually prefer k-fold cross-validation to hold-out
cross-validation (as the former gives more stable estimates of model
performance), unless the training procedure is computationally
expensive. Well, this is computationally expensive, so I’m
compromising by using hold-out cross-validation instead.
5. Defining the random searchrandSearch <- makeTuneControlRandom(maxit = 20) cvForTuning <- makeResampleDesc("Holdout", split = 2/3)
TIP: There’s something else we can do to speed up this process. R, as a language, doesn’t make that much use of multi-threading (using multiple CPUs simultaneously to accomplish a task). However, one of the benefits of the mlr package is that it allows multi-threading to be used with its functions. This helps you use multiple cores/CPUs on your computer to accomplish tasks such as hyperparameter tuning and cross-validation, much more quickly. If you don’t know how many cores your computer has, you can find out in R by running parallel::detectCores(). If your computer only has one core, the 90s called and they want their computer back.
run an mlr process in parallel, we place its code between the
from the parallelMap package. To start our hyperparameter tuning
process, we call the tuneParams()
function and supply as arguments:
- the first argument is the name of the learner
- task = the name of our task
- resampling = our cross validation procedure (defined in code listing 5)
- par.set = our hyperparameter space (defined in code listing 4)
- control = the search procedure (random search, defined in code listing 5)
This code, between the parallelStartSocket() and parallelStop() functions is shown in code listing 6.
The computer I’m writing this on has four cores and the code
below takes nearly a minute to run on it. It is of the utmost
importance that you go make a cup of tea while it runs. Milk and no
6. Performing hyperparameter tuninglibrary(parallelMap) library(paralell) parallelStartSocket(cpus = detectCores()) tunedSvmPars <- tuneParams("classif.svm", task = spamTask, resampling = cvForTuning, par.set = svmParamSpace, control = randSearch) parallelStop()
The degree hyperparameter only applies to the polynomial kernel
function, and the gamma hyperparameter doesn’t apply to the linear
kernel. Does this create errors when the random search selects
combinations that don’t make sense? Nope. If the random search
selects the sigmoid kernel, for example, it simply ignores the value
of the degree hyperparameter.
back after our interlude! You can print the best-performing
hyperparameter values and the performance of the model built with
them by calling tunedSvm,
or extract just the named values themselves (so you can train a new
model using them) by calling tunedSvm$x.
Looking at code listing 7, we can see
that the 1st degree polynomial kernel function (equivalent to the
linear kernel function), with a cost of 5.8 and gamma of 1.56, gave
the best performing model.
7 Extracting the winning hyperparameter values from tuningtunedSvmPars Tune result: Op. pars: kernel=polynomial; degree=1; cost=5.82; gamma=1.56 mmce.test.mean=0.0645372 tunedSvmPars$x $kernel  "polynomial" $degree  1 $cost  5.816232 $gamma  1.561584
Are your values different to mine? They probably are! This is the
nature of the random search. It may find different winning
combinations of hyperparameter values each time it’s run. To reduce
this variance, we should commit to increasing the number of
iterations it makes.
the model with the tuned hyperparameters
we’ve tuned our hyperparameters, let’s build our model using the
best performing combination. We use the setHyperPars()
function to combine a learner with a set of pre-defined
hyperparameter values. The first argument is the learner we want to
use, and the par.vals
argument is the object containing our tuned hyperparameter values. We
then train a model using our tunedSvm
learner with the train()
8 Training the model with tuned hyperparameterstunedSvm <- setHyperPars(makeLearner("classif.svm"), par.vals = tunedSvmPars$x) tunedSvmModel <- train(tunedSvm, spamTask)
As we already defined our learner in code listing 2,
we could simply have run setHyperPars(svm,
par.vals = tunedSvmPars$x) to achieve the same result.
Get more blogs at http://www.r-bloggers.com !