# Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

**Computational Proteomics & Bioinformatics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

**There are two main approaches to this selection process:**

**Filter approaches**: you select the features first, then you use this subset to execute classification or clustering algorithms, etc;

**Embedded or Wrapper approaches**a classification algorithm is applied to the raw dataset in order to identify the most relevant features.

Correlation Matrix :

R Example: Removing features with more than 0.70 of Correlation.

import java.util.List; library(corrplot) #corrplot: the library to compute correlation matrix. datMy <- read.table("data.csv", header = TRUE) #read the tab file using the read table function. datMy.scale<- scale(datMy[2:ncol(datMy)],center=TRUE,scale=TRUE); #scale all the features (from feature 2 bacause feature 1 is the predictor output) corMatMy <- cor(datMy.scale) #compute the correlation matrix corrplot(corMatMy, order = "hclust") #visualize the matrix, clustering features by correlation index.

Resulting Output:

Highly Correlate Matrix for 400 features. |

After inspecting the matrix, we set the correlation threshold at 0.70.

highlyCor <- findCorrelation(corMatMy, 0.70) #Apply correlation filter at 0.70, #then we remove all the variable correlated with more 0.7. datMyFiltered.scale <- datMy.scale[,-highlyCor] corMatMy <- cor(datMyFiltered.scale) corrplot(corMatMy, order = "hclust")

Resulting Output:

Correlation matrix after filter. |

**Using PCA**

**PCA in R**

R Example: PCA function using FactoMineR for 400 features & 5 PCs

require(FactoMineR) # PCA with function PCA datMy <- read.table("data.csv", header = TRUE) #read the tab file using the read table function. pca <- PCA(datMy, scale.unit=TRUE, ncp=5, graph=T) #scale all the features, ncp: number of dimensions kept in the results (by default 5) dimdesc(pca) #This line of code will sort the variables the most linked to each PC. It is very useful when you have many variables.

**Wrapper Approaches with Backwards Selection**

**Using Built-in Backward Selection**

- x, a matrix or data frame of predictor variables
- y, a vector (numeric or factor) of outcomes
- sizes, an integer vector for the specific subset sizes that should be tested (which must not include ncol(x))
- rfeControl, a list of options that can be used to specify the model and the methods for prediction, ranking etc.

For a specific model, a set of functions must be specified in rfeControl$functions. There are a number of pre-defined sets of functions for several models, including: linear regression (in the object lmFuncs), random forests (rfFuncs), naive Bayes (nbFuncs), bagged trees (treebagFuncs) and functions that can be used with caret's train function (caretFuncs).

R example: Selecting features using backward selection and the caret package

library(caret); #load caret library data_features<-as.matrix(read.table("data-features.csv",sep="\t", header=TRUE)); #load data features data_class<-as.matrix(read.table('data.csv', header=TRUE)); #load data classes data_features<- scale(data_features, center=TRUE, scale=TRUE); #scale data features inTrain <- createDataPartition(data_class, p = 3/4, list = FALSE); #Divide the dataset in train and test sets #Create the Training Dataset for Descriptors trainDescr <- data_features[inTrain,]; # Create the Testing dataset for Descriptors testDescr <- data_features[-inTrain,]; trainClass <- data_class[inTrain]; testClass <- data_class[-inTrain]; descrCorr <- cor(trainDescr); highCorr <- findCorrelation(descrCorr, 0.70); trainDescr <- trainDescr[, -highCorr]; testDescr <- testDescr[, -highCorr]; # Here, we can included a correlation matrix analysis to remove the redundant features before the backwards selection svmProfile <- rfe(x=trainDescr, y = trainClass, sizes = c(1:5), rfeControl= rfeControl(functions = caretFuncs,number = 2),method = "svmRadial",fit = FALSE); #caret function: the rfe is the backwards selection, c is the possible sizes of the features sets, and method the optimization method is a support vector machine.

Finally I would like to recommned an excellent Review about Feature Selection in Bioinformatics.

**leave a comment**for the author, please follow the link and comment on their blog:

**Computational Proteomics & Bioinformatics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.