Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available via RTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <- 
   substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
          y = factor(BRCA.rnaseq[, 1]),
          family = "binomial", 
          type.measure = "class", 
          parallel = TRUE) -> cvfit
# extract feature names that have 
# non zero coefficiant
names(which(
   coef(cvfit, s = "lambda.min")[, 1] != 0)
   )[-1] -> glmnet.features
# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

plot of chunk unnamed-chunk-5

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

Boruta

The second feature selection algorithm, Boruta, is based on ranger - a fast implementation of random forests. At start this algorithm creates duplicated variables for each attribute in the model’s formula. Duplicates then have randomly ordered values. Algorithm takes into consideration the Z-statistic of each variable in each tree and checks whether the boxplot of Z-statistics for a variable is higher than the highest boxplot for randomly ordered duplicate. Boruta is more sophisticated and its 9 page explanation can be found in Journal of Statistical Software. Below I present code that extracts valuable and tentative attributes

library(Boruta)
# fit features selection algorithm
system.time({
Boruta(as.factor(bcr_patient_barcode) ~.,
       data = BRCA.rnaseq) -> Boruta.model
}) # 20425 seconds ~ 35.5 min
# archive result
library(archivist)
saveToRepo(Boruta.model, 
           repoDir = "Museum")

# correct names
gsub(x = getSelectedAttributes(Boruta.model),
     pattern = "`",
     replacement = "",
     fixed = T) -> Boruta.features

There exist plot() overloaded function for Boruta class object which enables plotting boxplots of Z-statistics for each variable. It’s a good way to visualize the selection process. In this case there are over 20 thousand of attributes so I am using getSelectedAttributes function to extract valuable features. You could also extract tentative variables with withTentative = TRUE.

In case you run Boruta code yourself and it takes too long or you get such an error

Error: protect () : protection stack overflow 

I have archived my model on GitHub, and you can load it to R global environment with archivist package

library(archivist)
aread("MarcinKosinski/Museum/d3732742b989ed49666b0472ba52d705") -> Boruta.model

The history of Boruta decisions of rejecting and accepting features can be seen on such a graph

plotImpHistory(Boruta.model)

plot of chunk unnamed-chunk-9

FSelectorRcpp::information_gain

We have realized that there are many sophisticated and computationally absorbent feature selection algorithms. In many cases the time is a huge blocker and we can’t wait for Boruta to finish its computing. For such cases easier algorithms are demanded. One of them is entropy based filter implemented in FSelector package and its faster Rcpp (free of Java/Weka) reimplementation in FSelectorRcpp package (by zzawadz) that also has support for sparse matrixes. If you would like to contribute to this package, please check our issues list on GitHub.

Information gain is a simple, linear algorithm that computes the entropy of dependent and explanatory variables, and the conditional entropy of dependent variable with respect to each explanatory variable separately. This simple statistic enables to calculate the belief of the distribution of dependent variable when we only know the distribution of explanatory variable. Below code shows how easy it is to use

devtools::install_github('mi2-warsaw/FSelectorRcpp')
library(FSelectorRcpp)
information_gain(y = BRCA.rnaseq[, 1],
  x = BRCA.rnaseq[, -1]) -> FSelectorRcpp.weights
library(FSelector)
cutoff.k.percent(FSelectorRcpp.weights, 
   k = 0.01) -> FSelectorRcpp.features

Information gain algorithm returns scores for attributes and we will cut top 1 % features. For that we will use simple cutoff.k.percent from FSelector.

Venn Diagram

Finally we can visualize differences in decisions of algorithms. One way is a Venn Diagram that (after Wikipedia)

shows all possible logical relations between a finite collection of different sets.

named.list.for.venn <-
   list(` Boruta` = Boruta.features,
             `GLMnet     ` = glmnet.features,
      FSelectorRcpp = FSelectorRcpp.features)
library(VennDiagram)
venn.diagram(
    x = named.list.for.venn,
    filename = "venn.png",
    imagetype = "png",
    #col = "transparent",,
    col = "black",
    lty = "dotted",
    lwd = 4,
    fill = c("cornflowerblue", "green", 
             "yellow"),
    alpha = 0.50,
    label.col = c("orange", "white",  
                  "orange", "white", 
      "darkblue", "white", "darkgreen"),
    cex = 2.5,
    fontfamily = "serif",
    fontface = "bold",
    cat.col = c("darkblue", "darkgreen", 
                "orange"),
    cat.cex = 2.5,
    cat.fontfamily = "serif"
    )

The effect of this code can be seen on the welcome graph of this post. We can see that only 1 variable was chosen with all 3 algorithms. VennDiagram package allows to find out the names of intersections.

get.venn.partitions(named.list.for.venn) -> 
   partitions
apply(partitions[, c(1,2,3)],
      MARGIN = 1,
      function(configuration) {
         all(configuration == rep(T,3))
      }) -> all.algorithms
partitions[all.algorithms, ]
   Boruta GLMnet      FSelectorRcpp                           ..set..   ..values.. ..count..
1    TRUE        TRUE          TRUE  Boruta∩GLMnet     ∩FSelectorRcpp ABCA10|10349         1

Do you have any comments about the diagram or the unbelievable result that only 1 variable overlaps all 3 algorithms? Maybe some algorithms can be improved? Is this the case of random nature of cv.glmnet or Boruta? Feel free to leave your comments on the Disqus panel below.

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)