# Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

**Computational Proteomics & Bioinformatics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

**There are two main approaches to this selection process:**

**Filter approaches**: you select the features first, then you use this subset to execute classification or clustering algorithms, etc;

**Embedded or Wrapper approaches**a classification algorithm is applied to the raw dataset in order to identify the most relevant features.

Correlation Matrix :

R Example: Removing features with more than 0.70 of Correlation.

`import java.util.List;`

library(corrplot)

#corrplot: the library to compute correlation matrix.

datMy <- read.table("data.csv", header = TRUE)

#read the tab file using the read table function.

datMy.scale<- scale(datMy[2:ncol(datMy)],center=TRUE,scale=TRUE);

#scale all the features (from feature 2 bacause feature 1 is the predictor output)

corMatMy <- cor(datMy.scale)

#compute the correlation matrix

corrplot(corMatMy, order = "hclust")

#visualize the matrix, clustering features by correlation index.

Resulting Output:

Highly Correlate Matrix for 400 features. |

After inspecting the matrix, we set the correlation threshold at 0.70.

`highlyCor <- findCorrelation(corMatMy, 0.70)`

#Apply correlation filter at 0.70,

#then we remove all the variable correlated with more 0.7.

datMyFiltered.scale <- datMy.scale[,-highlyCor]

corMatMy <- cor(datMyFiltered.scale)

corrplot(corMatMy, order = "hclust")

Resulting Output:

Correlation matrix after filter. |

**Using PCA**

**PCA in R**

R Example: PCA function using FactoMineR for 400 features & 5 PCs

require(FactoMineR)

# PCA with function PCA

datMy <- read.table("data.csv", header = TRUE)

#read the tab file using the read table function.

pca <- PCA(datMy, scale.unit=TRUE, ncp=5, graph=T)

#scale all the features, ncp: number of dimensions kept in the results (by default 5)

dimdesc(pca)

#This line of code will sort the variables the most linked to each PC. It is very useful when you have many variables.

**Wrapper Approaches with Backwards Selection**

**Using Built-in Backward Selection**

- x, a matrix or data frame of predictor variables
- y, a vector (numeric or factor) of outcomes
- sizes, an integer vector for the specific subset sizes that should be tested (which must not include ncol(x))
- rfeControl, a list of options that can be used to specify the model and the methods for prediction, ranking etc.

For a specific model, a set of functions must be specified in rfeControl$functions. There are a number of pre-defined sets of functions for several models, including: linear regression (in the object lmFuncs), random forests (rfFuncs), naive Bayes (nbFuncs), bagged trees (treebagFuncs) and functions that can be used with caret’s train function (caretFuncs).

R example: Selecting features using backward selection and the caret package

`library(caret);`

#load caret library

data_features<-as.matrix(read.table("data-features.csv",sep="\t", header=TRUE));

#load data features

data_class<-as.matrix(read.table('data.csv', header=TRUE));

#load data classes

data_features<- scale(data_features, center=TRUE, scale=TRUE);

#scale data features

inTrain <- createDataPartition(data_class, p = 3/4, list = FALSE);

#Divide the dataset in train and test sets

#Create the Training Dataset for Descriptors

trainDescr <- data_features[inTrain,];

# Create the Testing dataset for Descriptors

testDescr <- data_features[-inTrain,];

trainClass <- data_class[inTrain];

testClass <- data_class[-inTrain];

descrCorr <- cor(trainDescr);

highCorr <- findCorrelation(descrCorr, 0.70);

trainDescr <- trainDescr[, -highCorr];

testDescr <- testDescr[, -highCorr];

# Here, we can included a correlation matrix analysis to remove the redundant features before the backwards selection

svmProfile <- rfe(x=trainDescr, y = trainClass, sizes = c(1:5), rfeControl= rfeControl(functions = caretFuncs,number = 2),method = "svmRadial",fit = FALSE);

#caret function: the rfe is the backwards selection, c is the possible sizes of the features sets, and method the optimization method is a support vector machine.

Finally I would like to recommned an excellent Review about Feature Selection in Bioinformatics.

**leave a comment**for the author, please follow the link and comment on their blog:

**Computational Proteomics & Bioinformatics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.