Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners.

There are occasions, however, when you find yourself for one reason or another committed to classifier that doesn’t automatically narrow down that list of predictor variables that some sort of automated feature selection might seem like a good idea. If you are an R user then the caret package offers a whole lot machinery that might be helpful. Caret offers both filter methods and wrapper methods that include recursive feature estimation, genetic algorithms (GAs) and simulated annealing. In this post, we will have a look at a small experiment with caret’s GA option. But first, a little background.

Performing feature selection with GAs requires conceptualizing the process of feature selection as an optimization problem and then mapping it to the genetic framework of random variation and natural selection. Individuals from a given generation of a population mate to produce offspring who inherit genes (chromosomes) from both parents. Random mutation alters a small part of child’s genetic material. The children of this new generation who are genetically most fit produce the next generation. In the feature selection context, individuals become solutions to a prediction problem. Chromosomes (sequences of genes) are modeled as vectors of 1’s and 0’s with a 1 indicating the presence of a feature and a 0 its absence. The simulated genetic algorithm then does the following: it selects two individuals, randomly chooses a split point for their chromosomes, maps the front of one chromosome to the back of the other (and vice versa) and then randomly mutates the resulting chromosomes according to some predetermined probability.

In their book Applied Predictive Modeling, Kuhn and Johnson provide the following pseudo code for caret's GA:

1. Define stopping criteria, population size, P, for each generation, and mutation probability, pm
2. Randomly generate an initial population of chromosomes
3. repeat:
4. |    for each chromosome do
5. |        Tune and train a model and compute each chromosome's fitness
6. |    end
7. |   for each reproduction 1 … P/2 do
8. |        Select 2 chromosomes based on fitness
9. |        Crossover: randomly select a locus and exchange genes on either side of locus
10. |                        (head of one chromosome applied to tail of the other and vice versa)
11. |                        to produce 2 child chromosomes with mixed genes
12. |        Mutate the child chromosomes with probability pm
13. |   end
14. until stopping criterion are met

If 10 fold cross validation is selected in the GA control procedure, then the entire genetic algorithm (steps 2 through 13) is run 10 times.

Now, just for fun, I'll conduct the following experiment to see if GA feature selection will improve on the performance of the support vector machine model featured in a previous post. As before, I use the segmentationData data set that is included in the caret package and described in the paper by Hill et al. This data set has 2,019 rows and 58 possible feature variables. The structure of the experiment is as follows:

1. Divide the data into a training and test data sets.
2. Run the GA feature selection algorithm on the training data set to produce a subset of the training set with the selected features.
3. Train the SVM algorithm on this subset.
4. Assess the performance of the SVM model using the subset of the test data that contains the selected features.
5. Train the SVM model on the entire training data set.
6. Assess the performance of this second SVM model using the test data set.
7. Compare the performance of the two SVM models.

(Note that I have cut corners here by training the SVM on the same data that was used in the GA. A methodological improvement would be to divide the original data into three sets.)

The is first block of code sets up for the experiment and divides the data into training and test data sets.

library(caret)
library(doParallel) # parallel processing
library(dplyr) # Used by caret
library(pROC) # plot the ROC curve

### Use the segmentationData from caret
# Load the data and construct indices to divided it into training and test data sets.
set.seed(10)
data(segmentationData) # Load the segmentation data set
dim(segmentationData)
#
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Create training feature data frame testX <- testData[,-1] # Create test feature data frame y=trainData$Class # Target variable for training


Next run the GA. Note the first line of code registers the parallel workers to set up for having the caret run the GA in parallel using the 4 cores on my Windows laptop.

registerDoParallel(4) # Registrer a parallel backend for train
getDoParWorkers() # check that there are 4 workers

ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
method = "cv",    # 10 fold cross validation
genParallel=TRUE, # Use parallel programming
allowParallel = TRUE)
##
set.seed(10)
lev <- c("PS","WS")     # Set the levels

system.time(rf_ga3 <- gafs(x = trainX, y = y,
iters = 100, # 100 generations of algorithm
popSize = 20, # population size for each generation
levels = lev,
gafsControl = ga_ctrl))

The parameter settings of the gafsControl() function indicate that the internally implemented random forests model and 10 fold cross validation are to be used to assess performance of the "chromosomes" in each generation. The parameters for the gafs() function itself specify 100 generations of populations consisting of 20 individuals. Ideally, it would be nice to let the algorithm run for more  iterations with larger populations and perhaps repeated 10 fold CV. However, even these modest parameter settings generate a tremendous number of calculations. (The algorithm took 4 hours to complete on a small Azure VM.)

The results below, and the plot that follows them, show that the best performance was achieved on the at iteration 9 with a subset of 44 variables.

rf_ga3
1010 samples
58 predictors
2 classes: 'PS', 'WS'

Maximum generations: 100
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0

Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy

External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (10 fold)

During resampling:
* the top 5 selected variables (out of a possible 58):
DiffIntenDensityCh4 (100%), EntropyIntenCh1 (100%), EqEllipseProlateVolCh1 (100%), EqSphereAreaCh1 (100%), FiberWidthCh1 (100%)
* on average, 39.3 variables were selected (min = 25, max = 52)

In the final search using the entire training set:
* 44 features selected at iteration 9 including:
AvgIntenCh1, AvgIntenCh4, ConvexHullAreaRatioCh1, ConvexHullPerimRatioCh1, EntropyIntenCh3 ...
* external performance at this iteration is

Accuracy Kappa
0.8406 0.6513

plot(rf_ga3) # Plot mean fitness (AUC) by generation

The plot also shows the average internal accuracy estimates as well as the average external estimates calculated from the 10 out of sample predictions.

This next section of code trains an SVM using only the selected features setting up a grid to search the parameter space; again, using the parallel computing feature built into caret.

final <- rf_ga3$ga$final # Get features selected by GA
trainX2 <- trainX[,final] # training data: selected features
testX2 <- testX[,final] # test data: selected features
## SUPPORT VECTOR MACHINE MODEL

#Note the default method of picking the best model is accuracy and Cohen's Kappa
# Set up training control
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)

#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter

grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
#
# Set up for parallel processing
set.seed(1951)
registerDoParallel(4,cores=4)

#Train and Tune the SVM
svm.tune <- train(x=trainX2,
y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) Finally, assess the performance of the model using the test data set. #Make predictions on the test data with the SVM Model svm.pred <- predict(svm.tune,testX2) confusionMatrix(svm.pred,testData$Class)

Confusion Matrix and Statistics

Reference
Prediction PS  WS
PS 560 106
WS 84 259

Accuracy : 0.8117
95% CI : (0.7862, 0.8354)
No Information Rate : 0.6383
P-Value [Acc > NIR] : <2e-16

Kappa : 0.5868
Mcnemar's Test P-Value : 0.1276

Sensitivity : 0.8696
Specificity : 0.7096
Pos Pred Value : 0.8408
Neg Pred Value : 0.7551
Prevalence : 0.6383
Detection Rate : 0.5550
Detection Prevalence : 0.6601
Balanced Accuracy : 0.7896
#
'Positive' Class : PS

svm.probs <- predict(svm.tune,testX2,type="prob") # Gen probs for ROC

svm.ROC <- roc(predictor=svm.probs$PS, response=testData$Class,
levels=rev(levels(testData\$Class)))
svm.ROC
#Area under the curve: 0.8881
plot(svm.ROC,main="ROC for SVM built with GA selected features")

Voilà - a decent model, but at significant computational cost and, as it turns out, with no performance improvement! If the GA feature selection procedure is just omitted, and the SVM is fit using the training and tuning procedure described above with the same seed settings it will produce a model with an AUC of 0.8888.

In the "glass half empty" interpretation of the experiment I did a lot of work for nothing. On the other hand, looking at a "glass half full", using automatic GA feature selection, I was able to build a model that achieved the same performance as the full model but with 24% fewer features. My guess though is that I just got lucky this time. I think the take away is that this kind of automated feature selection might be worth while if there is a compelling reason to reduce the number of features, perhaps at the expense of a decrease in performance. However, unless you know that you are dealing with a classifier that is easily confused by irrelevant predictor variables there is no reason to expect that GA, or any other "wrapper style" feature selection method, will improve performance.