by Joseph Rickert
If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners.
There are occasions, however, when you find yourself for one reason or another committed to classifier that doesn’t automatically narrow down that list of predictor variables that some sort of automated feature selection might seem like a good idea. If you are an R user then the caret package offers a whole lot machinery that might be helpful. Caret offers both filter methods and wrapper methods that include recursive feature estimation, genetic algorithms (GAs) and simulated annealing. In this post, we will have a look at a small experiment with caret’s GA option. But first, a little background.
Performing feature selection with GAs requires conceptualizing the process of feature selection as an optimization problem and then mapping it to the genetic framework of random variation and natural selection. Individuals from a given generation of a population mate to produce offspring who inherit genes (chromosomes) from both parents. Random mutation alters a small part of child’s genetic material. The children of this new generation who are genetically most fit produce the next generation. In the feature selection context, individuals become solutions to a prediction problem. Chromosomes (sequences of genes) are modeled as vectors of 1’s and 0’s with a 1 indicating the presence of a feature and a 0 its absence. The simulated genetic algorithm then does the following: it selects two individuals, randomly chooses a split point for their chromosomes, maps the front of one chromosome to the back of the other (and vice versa) and then randomly mutates the resulting chromosomes according to some predetermined probability.
In their book Applied Predictive Modeling, Kuhn and Johnson provide the following pseudo code for caret's GA:
- Define stopping criteria, population size, P, for each generation, and mutation probability, pm
- Randomly generate an initial population of chromosomes
- | for each chromosome do
- | Tune and train a model and compute each chromosome's fitness
- | end
- | for each reproduction 1 … P/2 do
- | Select 2 chromosomes based on fitness
- | Crossover: randomly select a locus and exchange genes on either side of locus
- | (head of one chromosome applied to tail of the other and vice versa)
- | to produce 2 child chromosomes with mixed genes
- | Mutate the child chromosomes with probability pm
- | end
- until stopping criterion are met
If 10 fold cross validation is selected in the GA control procedure, then the entire genetic algorithm (steps 2 through 13) is run 10 times.
Now, just for fun, I'll conduct the following experiment to see if GA feature selection will improve on the performance of the support vector machine model featured in a previous post. As before, I use the segmentationData data set that is included in the caret package and described in the paper by Hill et al. This data set has 2,019 rows and 58 possible feature variables. The structure of the experiment is as follows:
- Divide the data into a training and test data sets.
- Run the GA feature selection algorithm on the training data set to produce a subset of the training set with the selected features.
- Train the SVM algorithm on this subset.
- Assess the performance of the SVM model using the subset of the test data that contains the selected features.
- Train the SVM model on the entire training data set.
- Assess the performance of this second SVM model using the test data set.
- Compare the performance of the two SVM models.
(Note that I have cut corners here by training the SVM on the same data that was used in the GA. A methodological improvement would be to divide the original data into three sets.)
The is first block of code sets up for the experiment and divides the data into training and test data sets.
library(caret) library(doParallel) # parallel processing library(dplyr) # Used by caret library(pROC) # plot the ROC curve ### Use the segmentationData from caret # Load the data and construct indices to divided it into training and test data sets. set.seed(10) data(segmentationData) # Load the segmentation data set dim(segmentationData) head(segmentationData,2) # trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) trainData <- segmentationData[trainIndex,-c(1,2)] testData <- segmentationData[-trainIndex,-c(1,2)] # trainX <-trainData[,-1] # Create training feature data frame testX <- testData[,-1] # Create test feature data frame y=trainData$Class # Target variable for training
Next run the GA. Note the first line of code registers the parallel workers to set up for having the caret run the GA in parallel using the 4 cores on my Windows laptop.
registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() # check that there are 4 workers ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel=TRUE, # Use parallel programming allowParallel = TRUE) ## set.seed(10) lev <- c("PS","WS") # Set the levels system.time(rf_ga3 <- gafs(x = trainX, y = y, iters = 100, # 100 generations of algorithm popSize = 20, # population size for each generation levels = lev, gafsControl = ga_ctrl))
The parameter settings of the gafsControl() function indicate that the internally implemented random forests model and 10 fold cross validation are to be used to assess performance of the "chromosomes" in each generation. The parameters for the gafs() function itself specify 100 generations of populations consisting of 20 individuals. Ideally, it would be nice to let the algorithm run for more iterations with larger populations and perhaps repeated 10 fold CV. However, even these modest parameter settings generate a tremendous number of calculations. (The algorithm took 4 hours to complete on a small Azure VM.)
The results below, and the plot that follows them, show that the best performance was achieved on the at iteration 9 with a subset of 44 variables.
rf_ga3 1010 samples 58 predictors 2 classes: 'PS', 'WS' Maximum generations: 100 Population per generation: 20 Crossover probability: 0.8 Mutation probability: 0.1 Elitism: 0 Internal performance values: Accuracy, Kappa Subset selection driven to maximize internal Accuracy External performance values: Accuracy, Kappa Best iteration chose by maximizing external Accuracy External resampling method: Cross-Validated (10 fold) During resampling: * the top 5 selected variables (out of a possible 58): DiffIntenDensityCh4 (100%), EntropyIntenCh1 (100%), EqEllipseProlateVolCh1 (100%), EqSphereAreaCh1 (100%), FiberWidthCh1 (100%) * on average, 39.3 variables were selected (min = 25, max = 52) In the final search using the entire training set: * 44 features selected at iteration 9 including: AvgIntenCh1, AvgIntenCh4, ConvexHullAreaRatioCh1, ConvexHullPerimRatioCh1, EntropyIntenCh3 ... * external performance at this iteration is Accuracy Kappa 0.8406 0.6513 plot(rf_ga3) # Plot mean fitness (AUC) by generation
The plot also shows the average internal accuracy estimates as well as the average external estimates calculated from the 10 out of sample predictions.
This next section of code trains an SVM using only the selected features setting up a grid to search the parameter space; again, using the parallel computing feature built into caret.
final <- rf_ga3$ga$final # Get features selected by GA trainX2 <- trainX[,final] # training data: selected features testX2 <- testX[,final] # test data: selected features ## SUPPORT VECTOR MACHINE MODEL #Note the default method of picking the best model is accuracy and Cohen's Kappa # Set up training control ctrl <- trainControl(method="repeatedcv", # 10fold cross validation repeats=5, # do 5 repititions of cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) #Use the expand.grid to specify the search space #Note that the default search grid selects 3 values of each tuning parameter grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4 n.trees=seq(10,100,by=10), # let iterations go from 10 to 100 shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate n.minobsinnode = 20) # # Set up for parallel processing set.seed(1951) registerDoParallel(4,cores=4) #Train and Tune the SVM svm.tune <- train(x=trainX2, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above )
Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model svm.pred <- predict(svm.tune,testX2) confusionMatrix(svm.pred,testData$Class) Confusion Matrix and Statistics Reference Prediction PS WS PS 560 106 WS 84 259 Accuracy : 0.8117 95% CI : (0.7862, 0.8354) No Information Rate : 0.6383 P-Value [Acc > NIR] : <2e-16 Kappa : 0.5868 Mcnemar's Test P-Value : 0.1276 Sensitivity : 0.8696 Specificity : 0.7096 Pos Pred Value : 0.8408 Neg Pred Value : 0.7551 Prevalence : 0.6383 Detection Rate : 0.5550 Detection Prevalence : 0.6601 Balanced Accuracy : 0.7896 # 'Positive' Class : PS svm.probs <- predict(svm.tune,testX2,type="prob") # Gen probs for ROC svm.ROC <- roc(predictor=svm.probs$PS, response=testData$Class, levels=rev(levels(testData$Class))) svm.ROC #Area under the curve: 0.8881 plot(svm.ROC,main="ROC for SVM built with GA selected features")
Voilà - a decent model, but at significant computational cost and, as it turns out, with no performance improvement! If the GA feature selection procedure is just omitted, and the SVM is fit using the training and tuning procedure described above with the same seed settings it will produce a model with an AUC of 0.8888.
In the "glass half empty" interpretation of the experiment I did a lot of work for nothing. On the other hand, looking at a "glass half full", using automatic GA feature selection, I was able to build a model that achieved the same performance as the full model but with 24% fewer features. My guess though is that I just got lucky this time. I think the take away is that this kind of automated feature selection might be worth while if there is a compelling reason to reduce the number of features, perhaps at the expense of a decrease in performance. However, unless you know that you are dealing with a classifier that is easily confused by irrelevant predictor variables there is no reason to expect that GA, or any other "wrapper style" feature selection method, will improve performance.