Consensus clustering in R

[This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The logic behind the Monti consensus clustering algorithm is that in the face of resampling the ideal clusters should be stable, thus any pair of samples should either always or never cluster together. We can use this principle to infer the optimal number of clusters (K). This works by examining cluster stability from K=2 to K=10 during resampling. To do this for every K, we calculate the consensus rates for all sample pairs, which is the fraction of times a pair of samples cluster together. This gives us a consensus matrix for every K, which is symmetrical and in the range 0 to 1. A matrix entirely of 1s and 0s will represent perfect stability for a given K. We can simply compare stability of these matrices from K=2 to K=10 to determine the optimal K.

This is a very nice concept and the consensus matrix gives us a helpful visualisation tool. However, a major problem with this approach is that on random data the consensus matrix converges towards a matrix of perfect stability simply by chance. A second problem, is one general to cluster analysis, and is that we do not test the null hypothesis K=1. We developed M3C to correct for the internal bias of consensus clustering by using a Monte Carlo simulation, M3C can also test the null hypothesis K=1.

We have recently had the M3C paper published (https://www.nature.com/articles/s41598-020-58766-1), and this blog post is to demonstrate a new objective function included in the M3C algorithm (https://bioconductor.org/packages/devel/bioc/html/M3C.html) which did not get the chance the get into the paper. I’m going to do this without math notation because this is pretty informal.

Essentially we are in fact dealing with probabilities in the consensus matrix, where each element is the probability of samples i and j clustering together, given they were resampled together. Seeing the consensus matrix in this way, naturally leads to entropy as a very appropiate function to describe the stability (or uncertainty) of each consensus matrix. We simply can calculate entropy (binary information entropy) of each matrix and minimise it to determine the best K. Entropy, in this context, is the degree of surprisal, and we want the minimum surprise, if these are indeed good clusters.

Let’s do some analyses with the upgraded version of M3C on some glioblastoma cancer data (GBM).

## load package and test data
library(M3C)
dim(mydata) # we have 1740 features and 50 samples (gbm data)

## test code
test <- M3C(mydata)

Below is the CDF plot for every consensus matrix from 2 … 10, and it describes the distribution of numerical values in each matrix. A flat line would indicate all 1s and 0s. We can see a problem here because as K increases the line is converging towards a matrix of perfect stability. This also happens on random data with one cluster.

I think the CDF is a nice visualisation tool, but entropy does not require a CDF to work. Some metrics like the PAC score and delta k are calculated from the CDF, but I think it is more parsimonious to work directly with the consensus matrix probabilities. None of these methods correct for the convergence at chance problem of this method, for that we need to do Monte Carlo.

CDF

This next plot is entropy for every consensus matrix from 2 … 10.

Again, we can see a problem here because, although there is an elbow at K=4, the most certain or stable consensus matrix is for K=10. This is because even on random data, the algorithm generates consensus matrices that become more stable as K increases (see our paper for more details on this).

S

Remember, information entropy is describing uncertainty in the system, for the best clustering we want to minimise uncertainty (or surprisal) as much as possible despite perturbation. However, the random expectation must be taken into account.

The next plot is the RCSI, which corrects entropy for the chance expectation using a reference distribution derived using a Monte Carlo simulation and calculates 95% confidence intervals. We can see the bias has been corrected, and now we are just looking for a maximum value instead of elbows or last values before the floor, or other subjective approaches. This is nice.

RCSI

Four clusters is well supported by the histological data in which there are four subtypes for GBM, so I am satisfied with the decision made here. M3C also calculates p values for each K to test the null hypothesis K=1, see our vignette for more details.

Here are some general pointers on M3C and cluster analysis on Omic data.

  • When deciding K using M3C it is best to examine entropy, the RCSI, and the p values.
  • For more accurate p values, on some datasets, it is good to increase the Monte Carlo simulation number, the Monte Carlo simulation gives us the null distribution. Default is 25, this could be increased to 100.
  • Inner replications are set to 100, this is usually fine, however, on some datasets it is better to increase this to 250.
  • It is preferable to run SigClust on all pairs of clusters to check the significance.
  • It is preferable to run another method to support the K decision, we have found CLEST works well, a classic algorithm with some similarities to consensus clustering. This is implemented in the RSKC package.
  • The silhouette width does not work well on high dimensional cancer data in our experience and that of Șenbabaoğlu described in the 2014 paper. A possible alternative is to run t-SNE first before the silhouette width on the data with reduced dimensionality. Estimating K on complex Omic data is a hard problem to do well because of the high dimensionality and noise.

For demonstration of M3C on real TCGA data, see the supplementary information of this paper, in Table 1. The p value method was used in select K when running M3C here.

https://doi.org/10.1093/bioinformatics/btz704

For demonstration of M3C on simulated data and more TCGA data, this is the main paper that includes the method, but we used the PAC score instead of entropy as the objective function to minimise to find the best K. We still include the PAC score in the Bioconductor software because it is pretty good.

https://doi.org/10.1038/s41598-020-58766-1

Thanks for reading.

 

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)