Feature Selection with caret’s Genetic Algorithm Option

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners.

There are occasions, however, when you find yourself for one reason or another committed to classifier that doesn’t automatically narrow down that list of predictor variables that some sort of automated feature selection might seem like a good idea. If you are an R user then the caret package offers a whole lot machinery that might be helpful. Caret offers both filter methods and wrapper methods that include recursive feature estimation, genetic algorithms (GAs) and simulated annealing. In this post, we will have a look at a small experiment with caret’s GA option. But first, a little background.      

Performing feature selection with GAs requires conceptualizing the process of feature selection as an optimization problem and then mapping it to the genetic framework of random variation and natural selection. Individuals from a given generation of a population mate to produce offspring who inherit genes (chromosomes) from both parents. Random mutation alters a small part of child’s genetic material. The children of this new generation who are genetically most fit produce the next generation. In the feature selection context, individuals become solutions to a prediction problem. Chromosomes (sequences of genes) are modeled as vectors of 1’s and 0’s with a 1 indicating the presence of a feature and a 0 its absence. The simulated genetic algorithm then does the following: it selects two individuals, randomly chooses a split point for their chromosomes, maps the front of one chromosome to the back of the other (and vice versa) and then randomly mutates the resulting chromosomes according to some predetermined probability.

In their book Applied Predictive Modeling, Kuhn and Johnson provide the following pseudo code for caret's GA:

  1. Define stopping criteria, population size, P, for each generation, and mutation probability, pm
  2. Randomly generate an initial population of chromosomes
  3. repeat:
  4. |    for each chromosome do
  5. |        Tune and train a model and compute each chromosome's fitness
  6. |    end
  7. |   for each reproduction 1 … P/2 do
  8. |        Select 2 chromosomes based on fitness
  9. |        Crossover: randomly select a locus and exchange genes on either side of locus
  10. |                        (head of one chromosome applied to tail of the other and vice versa)
  11. |                        to produce 2 child chromosomes with mixed genes
  12. |        Mutate the child chromosomes with probability pm
  13. |   end
  14. until stopping criterion are met

If 10 fold cross validation is selected in the GA control procedure, then the entire genetic algorithm (steps 2 through 13) is run 10 times.

Now, just for fun, I'll conduct the following experiment to see if GA feature selection will improve on the performance of the support vector machine model featured in a previous post. As before, I use the segmentationData data set that is included in the caret package and described in the paper by Hill et al. This data set has 2,019 rows and 58 possible feature variables. The structure of the experiment is as follows:

  1. Divide the data into a training and test data sets.
  2. Run the GA feature selection algorithm on the training data set to produce a subset of the training set with the selected features.
  3. Train the SVM algorithm on this subset.
  4. Assess the performance of the SVM model using the subset of the test data that contains the selected features.
  5. Train the SVM model on the entire training data set.
  6. Assess the performance of this second SVM model using the test data set.
  7. Compare the performance of the two SVM models.

(Note that I have cut corners here by training the SVM on the same data that was used in the GA. A methodological improvement would be to divide the original data into three sets.)

The is first block of code sets up for the experiment and divides the data into training and test data sets.

library(doParallel) # parallel processing
library(dplyr) # Used by caret
library(pROC) # plot the ROC curve
### Use the segmentationData from caret
# Load the data and construct indices to divided it into training and test data sets.
data(segmentationData) # Load the segmentation data set
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE)
trainData <- segmentationData[trainIndex,-c(1,2)]
testData <- segmentationData[-trainIndex,-c(1,2)]
trainX <-trainData[,-1] # Create training feature data frame
testX <- testData[,-1] # Create test feature data frame 
y=trainData$Class # Target variable for training

Next run the GA. Note the first line of code registers the parallel workers to set up for having the caret run the GA in parallel using the 4 cores on my Windows laptop. 

registerDoParallel(4) # Registrer a parallel backend for train
getDoParWorkers() # check that there are 4 workers
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv",    # 10 fold cross validation
                       genParallel=TRUE, # Use parallel programming
                       allowParallel = TRUE)
lev <- c("PS","WS")     # Set the levels
system.time(rf_ga3 <- gafs(x = trainX, y = y,
                           iters = 100, # 100 generations of algorithm
                           popSize = 20, # population size for each generation
                           levels = lev,
                           gafsControl = ga_ctrl))

The parameter settings of the gafsControl() function indicate that the internally implemented random forests model and 10 fold cross validation are to be used to assess performance of the "chromosomes" in each generation. The parameters for the gafs() function itself specify 100 generations of populations consisting of 20 individuals. Ideally, it would be nice to let the algorithm run for more  iterations with larger populations and perhaps repeated 10 fold CV. However, even these modest parameter settings generate a tremendous number of calculations. (The algorithm took 4 hours to complete on a small Azure VM.)

The results below, and the plot that follows them, show that the best performance was achieved on the at iteration 9 with a subset of 44 variables.

1010 samples
58 predictors
2 classes: 'PS', 'WS' 
Maximum generations: 100 
Population per generation: 20 
Crossover probability: 0.8 
Mutation probability: 0.1 
Elitism: 0 
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy 
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy 
External resampling method: Cross-Validated (10 fold) 
 During resampling:
 * the top 5 selected variables (out of a possible 58):
 DiffIntenDensityCh4 (100%), EntropyIntenCh1 (100%), EqEllipseProlateVolCh1 (100%), EqSphereAreaCh1 (100%), FiberWidthCh1 (100%)
 * on average, 39.3 variables were selected (min = 25, max = 52)
 In the final search using the entire training set:
 * 44 features selected at iteration 9 including:
AvgIntenCh1, AvgIntenCh4, ConvexHullAreaRatioCh1, ConvexHullPerimRatioCh1, EntropyIntenCh3 ... 
* external performance at this iteration is
 Accuracy Kappa 
 0.8406 0.6513
plot(rf_ga3) # Plot mean fitness (AUC) by generation



The plot also shows the average internal accuracy estimates as well as the average external estimates calculated from the 10 out of sample predictions.

This next section of code trains an SVM using only the selected features setting up a grid to search the parameter space; again, using the parallel computing feature built into caret.

final <- rf_ga3$ga$final # Get features selected by GA
trainX2 <- trainX[,final] # training data: selected features
testX2 <- testX[,final] # test data: selected features
#Note the default method of picking the best model is accuracy and Cohen's Kappa 
# Set up training control
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
#Use the expand.grid to specify the search space 
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate 
n.minobsinnode = 20)
# Set up for parallel processing
#Train and Tune the SVM
svm.tune <- train(x=trainX2,
                  y= trainData$Class,
                  method = "svmRadial",
                  tuneLength = 9, # 9 values of the cost function
                  preProc = c("center","scale"),
                  trControl=ctrl) # same as for gbm above

Finally, assess the performance of the model using the test data set.

#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,testX2)
Confusion Matrix and Statistics
Prediction PS  WS
        PS 560 106
        WS 84 259
Accuracy : 0.8117 
95% CI : (0.7862, 0.8354)
No Information Rate : 0.6383 
P-Value [Acc > NIR] : <2e-16 
Kappa : 0.5868 
Mcnemar's Test P-Value : 0.1276 
Sensitivity : 0.8696 
Specificity : 0.7096 
Pos Pred Value : 0.8408 
Neg Pred Value : 0.7551 
Prevalence : 0.6383 
Detection Rate : 0.5550 
Detection Prevalence : 0.6601 
Balanced Accuracy : 0.7896 
'Positive' Class : PS
svm.probs <- predict(svm.tune,testX2,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$PS,
#Area under the curve: 0.8881
plot(svm.ROC,main="ROC for SVM built with GA selected features")


Voilà - a decent model, but at significant computational cost and, as it turns out, with no performance improvement! If the GA feature selection procedure is just omitted, and the SVM is fit using the training and tuning procedure described above with the same seed settings it will produce a model with an AUC of 0.8888.

In the "glass half empty" interpretation of the experiment I did a lot of work for nothing. On the other hand, looking at a "glass half full", using automatic GA feature selection, I was able to build a model that achieved the same performance as the full model but with 24% fewer features. My guess though is that I just got lucky this time. I think the take away is that this kind of automated feature selection might be worth while if there is a compelling reason to reduce the number of features, perhaps at the expense of a decrease in performance. However, unless you know that you are dealing with a classifier that is easily confused by irrelevant predictor variables there is no reason to expect that GA, or any other "wrapper style" feature selection method, will improve performance. 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)