Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.
Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available via RTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.
The dependent variable,
bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).
Check another RTCGA use case: TCGA and The Curse of BigData.
Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.
coef extracts coefficients for fitted model. Argument
s specifies for which regularization parameter we would like to extract them -
lamba.min is the parameter for which miss-classification error is minimal. You may also try to use
Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.
The second feature selection algorithm, Boruta, is based on ranger - a fast implementation of random forests. At start this algorithm creates duplicated variables for each attribute in the model’s formula. Duplicates then have randomly ordered values. Algorithm takes into consideration the Z-statistic of each variable in each tree and checks whether the boxplot of Z-statistics for a variable is higher than the highest boxplot for randomly ordered duplicate. Boruta is more sophisticated and its 9 page explanation can be found in Journal of Statistical Software. Below I present code that extracts valuable and tentative attributes
plot() overloaded function for
Boruta class object which enables plotting boxplots of Z-statistics for each variable. It’s a good way to visualize the selection process. In this case there are over 20 thousand of attributes so I am using
getSelectedAttributes function to extract valuable features. You could also extract
tentative variables with
withTentative = TRUE.
In case you run
Boruta code yourself and it takes too long or you get such an error
I have archived my model on GitHub, and you can load it to R global environment with archivist package
The history of Boruta decisions of rejecting and accepting features can be seen on such a graph
We have realized that there are many sophisticated and computationally absorbent feature selection algorithms. In many cases the time is a huge blocker and we can’t wait for Boruta to finish its computing. For such cases easier algorithms are demanded. One of them is entropy based filter implemented in FSelector package and its faster Rcpp (free of Java/Weka) reimplementation in FSelectorRcpp package (by zzawadz) that also has support for sparse matrixes. If you would like to contribute to this package, please check our issues list on GitHub.
Information gain is a simple, linear algorithm that computes the entropy of dependent and explanatory variables, and the conditional entropy of dependent variable with respect to each explanatory variable separately. This simple statistic enables to calculate the belief of the distribution of dependent variable when we only know the distribution of explanatory variable. Below code shows how easy it is to use
Information gain algorithm returns scores for attributes and we will cut top 1 % features. For that we will use simple
Finally we can visualize differences in decisions of algorithms. One way is a Venn Diagram that (after Wikipedia)
shows all possible logical relations between a finite collection of different sets.
The effect of this code can be seen on the welcome graph of this post. We can see that only 1 variable was chosen with all 3 algorithms. VennDiagram package allows to find out the names of intersections.
Do you have any comments about the diagram or the unbelievable result that only 1 variable overlaps all 3 algorithms? Maybe some algorithms can be improved? Is this the case of random nature of
Boruta? Feel free to leave your comments on the Disqus panel below.