# why clusterProfiler fails

August 6, 2014
By

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

Recently, there are some comments said that sometimes clusterProfiler failed in KEGG enrichment analysis.

kaji331 compared cluserProfiler with GeneAnswers and found that clusterProfiler gives larger p values. The result forces me to test it.

? RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25  require(GeneAnswers) data('humanGeneInput') y <- geneAnswersBuilder(humanGeneInput, 'org.Hs.eg.db', categoryType='KEGG', testType='hyperG', pvalueT=0.1, geneExpressionProfile=humanExpr, verbose=FALSE) yy <- y@enrichmentInfo   require(clusterProfiler) x <- enrichKEGG(humanGeneInput$GeneID, pvalueCutoff=0.2, qvalueCutoff=0.2, minGSSize=1) xx <- summary(x) id <- sub("hsa", "", xx$ID) idx <- id %in% rownames(yy)   p.clusterProfiler <- xx$pvalue[idx] p.GeneAnswers <- yy[id[idx],]$"p value" > cor(p.clusterProfiler, p.GeneAnswers) [1] 0.9996165 > p.clusterProfiler - p.GeneAnswers [1] 1.029789e-04 -3.588252e-05 -4.623010e-05 1.079117e-04 -1.075746e-04 [6] -1.077398e-04 -3.774637e-04 -2.849278e-04 -4.197993e-04 7.588155e-04 [11] -3.702141e-04 2.314721e-03 -5.695641e-04 -5.940830e-04 -4.923697e-04 [16] -5.560738e-04 -5.884079e-04 2.011138e-03

Here, I used the dataset, humanGeneInput, provided by GeneAnswers. There are 19 pathways have p values below 0.1 by GeneAnswers and 18 pathways have p values below 0.1 by clusterProfiler. 18 of them are the same and p values are highly correlated with very small differences.

Enrichment analyses
are tested by geometric model, I believe all packages do it in the same way. The difference should be attributed to annotation file and I found background ratio are different.

? RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  > xx[1,] ID Description GeneRatio BgRatio pvalue p.adjust hsa04110 hsa04110 Cell cycle 6/51 128/5894 0.00075424 0.02191169 qvalue geneID Count hsa04110 0.01683117 894/595/4175/6502/4173/9134 6 > yy[1,] genes in Category percent in the observed List percent in the genome 04110 6 0.1176471 0.02112436 fold of overrepresents odds ratio p value 04110 5.56926 6.441808 0.0006512612     > allEG = unique(unlist(as.list(org.Hs.egPATH2EG))) > length(get("04110", org.Hs.egPATH2EG)) [1] 124 > length(allEG) [1] 5870 > length(get("04110", org.Hs.egPATH2EG))/length(allEG) [1] 0.02112436

In the above example, the BgRatio of clusterProfiler is 128/5894, slightly larger than 0.02112436 calculated by GeneAnswers. This is why clusterProfiler gives a p value larger than GeneAnswers (In this particular pathway of this particular example, clusterProfiler not always produce larger p values).

The annotation file used by clusterProfiler is KEGG.db, while GeneAnswers uses org.Hs.eg.db. This is the reason of the difference and it's not a big deal.

It's not a good idea to give the background ratio in decimal format as GeneAnswers does. We have no idea the number of total genes in background with ratio like 0.02112436. Some software packages may use the gene number of the whole genome as background. If the total gene number is setting to 18000, the background ratio turn out to be 124/18000=0.006888889. With 0.006888889, it's hard to detect the background gene number was set to a larger number. In this scenario, all the p values will become very small, and many false positive were reported.

clusterProfiler and GeneAnswers estimate the total genes in the right way, but there are many packages may setting it to all the genes, no matter they have annotation or not. For instance, some software may setting the background to 20,000 when testing human genes.

I still remember in 2007, I used princeton's web server, http://go.princeton.edu/cgi-bin/GOTermFinder, to test GO enrichment. I found the background gene number was set to a very large number and all the p values were extremely small. When writing this blog post, I submit a job to this web server and found the gene number of human is setting to 45758. After I found this, I never use this web server again. The result is extremely "significant"! The web sever has an option, "Enter the number of products estimated for your organism". This is ridiculous. It's hard to estimate for ordinary users and it is impossible to estimate it correctly since we don't know the annotation file in used. The default number is too large, 45758 for human, result in many false positive.

Input a gene list as background is useful, for instance we may use all genes that are detectable in microarray experiment or all proteins identified by MS as background. This is an effective way to eliminate bias of using the whole genome. Use the whole genome is not mean to use the total number of gene in the genome as background. By default, KEGG enrichment analyses of clusterProfiler and GeneAnswers are using the whole genome, that is about 5900 genes for human, which are all genes have pathway annotation in the whole genome. Others without annotation are abandoned. This itself is a bias actually, but it's the realistic. If we include all genes no matter they have annotation or not, and setting 20000 or even more as background of human genes, definitely we will get significant result! But it is man-made significant.

As I replied to kaji331, if p values obtained by different software are different, trust the one produce larger p values. Many user will choose a software give a more "significant" result, but trust me, you are making type I error. Conservatively, always choose the one give you larger p-values if you know nothing about the calculation. Software that never "fail" is untrustable.