The 5th Tribe, Support Vector Machines and caret

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

In his new book, The Master Algorithm, Pedro Domingos takes on the heroic task of explaining machine learning to a wide audience and classifies machine learning practitioners into 5 tribes*, each with its own fundamental approach to learning problems. To the 5th tribe, the analogizers, Pedro ascribes the Support Vector Machine (SVM) as it's master algorithm. Although the SVM has been a competitive and popular algorithm since its discovery in the 1990's this might be the breakout moment for SVMs into pop culture. (What algorithm has a cooler name?) More people than every will want to give them a try and face the challenge of tuning them. Fortunately, there is plenty of help out there and some good tools for the task.

In A Practical Guide to Support Vector Classification, which is aimed at beginners Hsu et al. suggest the following methodology:

  • Transform data to the format of an SVM package
  • Conduct simple scaling on the data
  • Consider the RBF kernel K(x, y) = exp (−γ||x−y||2)
  • Use cross-validation to find the best parameter C and γ
  • Use the best parameter C and γ to train the whole training set5
  • Test

In this post, we present a variation of the methodology using R and the caret package. First, we set up for an analysis, loading the segmentation data set from the caret package and using the  caret's createDataPartition() function to produce training and test data sets.

# Training SVM Models
library(caret)
library(dplyr)         # Used by caret
library(kernlab)       # support vector machine 
library(pROC)	       # plot the ROC curves
 
### Get the Data
# Load the data and construct indices to divide it into training and test data sets.
data(segmentationData)  	# Load the segmentation data set
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE)
trainData <- segmentationData[trainIndex,]
testData  <- segmentationData[-trainIndex,]
trainX <-trainData[,4:61]        # Pull out the variables for training
sapply(trainX,summary)           # Look at a summary of the training data

Next, we carry out a two pass training and tuning process. In the first pass, shown in the code block below, we arbitrarily pick some tuning parameters and use the default caret settings for others. In the trainControl() function we specify 5 repetitions of 10 fold cross validation. in the train() function which actually does the work, we specify the radial kernel using the method parameter and the ROC as the metric for assessing performance. The tuneLength parameter is set to pick 9 arbitrary values for the C, the "cost" of the radial kernel. This parameter controls the complexity of the boundary between support vectors. The radial kernel also requires setting a smoothing parameter, sigma. In this first, pass we let train() use its default method of calculating an analytically derived estimate for sigma. Also note that we instruct train() to center and scale the data before running the analysis with the preProc parameter.

## SUPPORT VECTOR MACHINE MODEL
# First pass
set.seed(1492)
# Setup for cross validation
ctrl <- trainControl(method="repeatedcv",   # 10fold cross validation
                     repeats=5,		    # do 5 repititions of cv
                     summaryFunction=twoClassSummary,	# Use AUC to pick the best model
                     classProbs=TRUE)
 
 
#Train and Tune the SVM
svm.tune <- train(x=trainX,
                  y= trainData$Class,
                  method = "svmRadial",   # Radial kernel
                  tuneLength = 9,					# 9 values of the cost function
                  preProc = c("center","scale"),  # Center and scale data
                  metric="ROC",
                  trControl=ctrl)
 
svm.tune
# Support Vector Machines with Radial Basis Function Kernel 
# 
# 1010 samples
# 58 predictor
# 2 classes: 'PS', 'WS' 
# 
# Pre-processing: centered, scaled 
# Resampling: Cross-Validated (10 fold, repeated 5 times) 
# Summary of sample sizes: 908, 909, 909, 909, 909, 909, ... 
# Resampling results across tuning parameters:
#   
#   C      ROC        Sens       Spec       ROC SD      Sens SD     Spec SD   
#  0.25  0.8695054  0.8540559  0.6690476  0.03720951  0.04389913  0.07282584
#  0.50  0.8724147  0.8592774  0.6912857  0.03618794  0.04234003  0.07830249
#  1.00  0.8746137  0.8718648  0.6968254  0.03418700  0.04204607  0.06918850
#  2.00  0.8709825  0.8755478  0.6969048  0.03345607  0.03927223  0.06714838
#  4.00  0.8609396  0.8795478  0.6702063  0.03437846  0.04189803  0.06597494
#  8.00  0.8456799  0.8703357  0.6310635  0.03610988  0.04105803  0.07540066
# 16.00  0.8293339  0.8666667  0.5943492  0.03717344  0.04773906  0.08006023
# 32.00  0.8220839  0.8636131  0.5759365  0.03622665  0.04531028  0.07587914
# 64.00  0.8123889  0.8605315  0.5541746  0.03795353  0.04494173  0.07140262
# 
# Tuning parameter 'sigma' was held constant at a value of 0.01521335
# ROC was used to select the optimal model using  the largest value.
# The final values used for the model were sigma = 0.01521335 and C = 1. 

The results show that the best model resulted from setting .

In the second pass, having seen the parameter values selected in the first pass, we use train()'s tuneGrid parameter to do some sensitivity analysis around the values C = 1 and sigma = 0.015 that produced the model with the best ROC value. Note that R's expand.grid() function is used to build a dataframe contain all the combinations of C and sigma we want to look at.

# Second pass
# Look at the results of svm.tune and refine the parameter space
 
set.seed(1492)
# Use the expand.grid to specify the search space	
grid <- expand.grid(sigma = c(.01, .015, 0.2),
                    C = c(0.75, 0.9, 1, 1.1, 1.25)
)
 
#Train and Tune the SVM
svm.tune <- train(x=trainX,
                    y= trainData$Class,
                    method = "svmRadial",
                    preProc = c("center","scale"),
                    metric="ROC",
                    tuneGrid = grid,
                    trControl=ctrl)
 
svm.tune
# Support Vector Machines with Radial Basis Function Kernel 
# 
# 1010 samples
# 58 predictor
# 2 classes: 'PS', 'WS' 
# 
# Pre-processing: centered, scaled 
# Resampling: Cross-Validated (10 fold, repeated 5 times) 
# Summary of sample sizes: 909, 909, 908, 910, 909, 909, ... 
# Resampling results across tuning parameters:
#   
#   sigma  C     ROC        Sens       Spec       ROC SD      Sens SD     Spec SD   
# 0.010  0.75  0.8727381  0.8614685  0.6817619  0.04096223  0.04183900  0.07664910
# 0.010  0.90  0.8742107  0.8633193  0.6878889  0.04066995  0.04037202  0.07817537
# 0.010  1.00  0.8748389  0.8630023  0.6873016  0.04079094  0.04061032  0.08189960
# 0.010  1.10  0.8747998  0.8642378  0.6884444  0.04076756  0.04004827  0.07892234
# 0.010  1.25  0.8749384  0.8657762  0.6923492  0.04083294  0.03911751  0.08070616
# 0.015  0.75  0.8726557  0.8660793  0.6923333  0.04171842  0.04324822  0.08203598
# 0.015  0.90  0.8727037  0.8688531  0.6945714  0.04164810  0.04082448  0.08379649
# 0.015  1.00  0.8727079  0.8713147  0.6906667  0.04184851  0.04273855  0.08174494
# 0.015  1.10  0.8724013  0.8719301  0.6895556  0.04197524  0.04108930  0.08377854
# 0.015  1.25  0.8721560  0.8722331  0.6900952  0.04207802  0.04096501  0.08639355
# 0.200  0.75  0.8193497  0.8863263  0.4478413  0.04695531  0.04072159  0.08061632
# 0.200  0.90  0.8195377  0.8903263  0.4405397  0.04688797  0.04091728  0.07844983
# 0.200  1.00  0.8193307  0.8915478  0.4361111  0.04719399  0.04004779  0.07815045
# 0.200  1.10  0.8195696  0.8958508  0.4333651  0.04694670  0.04026003  0.08252021
# 0.200  1.25  0.8198250  0.8983077  0.4271905  0.04705685  0.03900879  0.07945602
# 
# ROC was used to select the optimal model using  the largest value.
# The final values used for the model were sigma = 0.01 and C = 1.25. 

This was quite a bit of calculation for an improvement of 0.0003247 in the ROC score, but it shows off some of what caret can do.

To finish up, we have build a model with a different kernel. The linear kernel is the simplest way to go. There is only the C parameter to set for this kernel and train() hardwires in a value of C = 1. The resulting ROC value of 0.87 is not too shabby.

#Linear Kernel
set.seed(1492)                     
 
#Train and Tune the SVM
svm.tune2 <- train(x=trainX,
                    y= trainData$Class,
                    method = "svmLinear",
                    preProc = c("center","scale"),
                    metric="ROC",
                    trControl=ctrl)	
 
 
svm.tune2
 
# > svm.tune2
# Support Vector Machines with Linear Kernel 
# 
# 1010 samples
# 58 predictor
# 2 classes: 'PS', 'WS' 
# 
# Pre-processing: centered, scaled 
# Resampling: Cross-Validated (10 fold, repeated 5 times) 
# Summary of sample sizes: 909, 909, 908, 910, 909, 909, ... 
# Resampling results
# 
# ROC        Sens       Spec       ROC SD      Sens SD     Spec SD  
# 0.8654818  0.8774312  0.6327302  0.03975929  0.04210438  0.0824907
# 
# Tuning parameter 'C' was held constant at a value of 1

Because, I took the trouble to set the seed for the pseudorandom number generator to the same value before each of the resampling operations above we can use caret's resample() function to compare the results generated by the radial kernel and linear kernel models. The following block of code and results shows just thee first five lines of the comparison table but includes the summary of the comparison.

rValues <- resamples(list(svm=svm.tune,svm.tune2))
rValues$values
# Resample   svm~ROC  svm~Sens  svm~Spec Model2~ROC Model2~Sens Model2~Spec
# 1  Fold01.Rep1 0.8995726 0.8153846 0.7777778  0.8914530   0.8461538   0.6666667
# 2  Fold01.Rep2 0.9452991 0.9076923 0.7500000  0.9346154   0.9384615   0.6666667
# 3  Fold01.Rep3 0.8153846 0.8153846 0.6111111  0.8115385   0.8153846   0.5555556
# 4  Fold01.Rep4 0.9115385 0.9076923 0.6944444  0.8918803   0.9076923   0.4722222
# 5  Fold01.Rep5 0.9017094 0.8615385 0.6944444  0.8645299   0.8923077   0.6666667
summary(rValues)
# Call:
#   summary.resamples(object = rValues)
# 
# Models: svm, Model2 
# Number of resamples: 50 
# 
# ROC 
# Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
# svm    0.7910  0.8503 0.8785 0.8749  0.9013 0.9620    0
# Model2 0.7789  0.8399 0.8609 0.8655  0.8918 0.9457    0
# 
# Sens 
# Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
# svm    0.7538  0.8346 0.8769 0.8658  0.8935 0.9385    0
# Model2 0.7692  0.8462 0.8923 0.8774  0.9077 0.9538    0
# 
# Spec 
# Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
# svm    0.4286  0.6312 0.6944 0.6923  0.7500 0.8333    0
# Model2 0.4722  0.5556 0.6389 0.6327  0.6667 0.8333    0
 
bwplot(rValues,metric="ROC",ylab =c("linear kernel", "radial kernel"))		    # boxplot

The boxplot below would appear to give the radial kernel the edge on being the better model, but of course, we are far from done. We will leave the last step of Hsu et al.'s methodology, testing the models on held-out testing data, for another day.

Svm_compare

To go further, start with the resources on the caret website. Have a look at the list of 206 models that can be used with the train() function. (This is an astounding number of models to be put into a single operational framework.) Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference. Finally, I'm not quite finished but Pedro's book is a pretty good read.

* The 5 tribes and their master algorithms:

  1. symbolists - inverse deduction
  2. connectionists - backpropagation
  3. evolutionists - genetic programming
  4. Bayesians - Bayesian inference
  5. analogizers - the support vector machine

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)