I get a lot of requests in the core about running a “pathway analysis.” Someone ran a handful of gene expression arrays, or better yet, ran an RNA-seq experiment (with replicates!). These, and many other kinds of high-throughput assays (GWAS, ChIP-seq, etc.) result in a list of genes and some associated p-value, fold change, or other statistic.
Here’s some R code to download public data from a study on susceptibility to colorectal cancer. I realize that this is an oversimplified design – that’s not the focus here. You can read more about proper design and contrast matrices in the limma vignette. You can read more about the study and the samples in the paper or on the GEO page.
A thoughtful and thorough analysis doesn’t end with a list of genes and P-values. You spent way too much money on an experiment just to end up with a list of genes and p-values at some arbitrary cutoff. I often see investigators going down the list and cherry-picking genes that they think are important or might play a role, and trying to create a story involving those cherry-picked genes. I believe it’s better to be more agnostic, taking a data-driven approach to framing results in a functional context.
How do you do this? This review recently published in PLoS Comp Bio is an excellent place to start:
Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Computational Biology, 8(2), e1002375. doi:10.1371/journal.pcbi.1002375
I wrote an evaluation of this paper at Faculty of 1000 – here’s an expanded version. The paper gives an overview of the available methods for pathway analysis, and categorizes these methods into three functional classes: over-representation analysis, functional class scoring, and topology-based methods. The paper discusses the advantages and limitations of each approach, as well as future development directions.
Over-representation analysis is what you would typically do with something like gene ontology annotations. The Gene Ontology (GO) project is an effort to describe gene products from all organisms using a consistent language (ontologies are formal representations of knowledge domains; GO does this with cell biology). A biological process is typically made up of a group of genes, as opposed to an individual gene alone. The idea of over-representation analysis is that if a biological process is abnormal in a given study, the co-functioning genes are more likely to be selected as a relevant group by the high-throughput screening technologies. For example, if 10% of the your genes selected by a microarray experiment are kinases, as opposed to 1% of the genes in the whole human genome (this is the gene population background) that are kinases, then you have significant over-representation in your genes for kinases. There are many, many tools for doing over-representation analysis using gene ontology terms, but they all rely on the same idea, typically using a hypergeometric test. You can create directed acyclic graphs like the one below, color coding nodes based on the statistical significance of the term overrepresentation in your gene list. See the paper above for a discussion of limitations of over-representation analysis, and also check out this paper on the use and misuse of gene ontology annotations.
Functional Class Scoring. One of the biggest limitations of over-representation analysis is that you have to arbitrarily decide what you call significant and what you don’t. If you use an FDR threshold of 0.05 and fold change cutoff of 2, you’ll lose all those genes with a fold change of 1.95 or FDR 0.051, which are arguably just as important as the genes just under the arbitrary cutoff. Further, over-representation analyses only use the number of genes and ignore how strongly those genes are associated with whatever you’re studying. Pathway analysis methods that the above review classifies as “Functional Class Scoring” methods use all the genes you have as well as their association statistics (e.g. fold change), and compute a running enrichment score for gene groupings (based on some functional knowledge like gene ontology or KEGG pathways). Gene Set Enrichment Analysis is one of the more popular approaches in this category. The plot below shows the kind of results you get from GSEA. From the GSEA documentation: “The primary result of the gene set enrichment analysis is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The top portion of the plot shows the running ES for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set. Gene sets with a distinct peak at the beginning (such as the one shown here) or end of the ranked list are generally the most interesting.” So this shows that you have significant enrichment in the cancer phenotype (labeled “LL”) for genes involved in the KEGG Spliceosome pathway.
Topology based methods. The above methods don’t take into account how genes interact with each other (e.g. activation, inhibition, phosphorylation, direct/indirect interaction, ubiquitination, etc). Pathway topology methods this extra information in computing pathway-level statistics. I’ve recently started using the bioconductor SPIA package (Signaling Pathway Impact Analysis), which integrates lists of differentially expressed genes, their fold changes, and pathway topology, to identify pathways associated with condition you’re studying. The code below runs SPIA on the analysis we did above. Make sure you successfully loaded all the R packages in the code above.
You can read more about the results SPIA is giving you in their paper and vignette. Briefly, each pathway is represented by a point, and the x and y axes are showing increasing evidence for the involvement of this pathway in your disease (the total “perturbation” in the pathway based on your gene expression data).
It’s worth noting that all of the above mentioned methods have limitations. We don’t fully understand biology, and our understanding of molecular networks and signaling pathways is still very low-resolution. We also don’t have information about how different isoforms have different effects – which is something we’ll get from RNA-seq experiments. Annotations are often incorrect and inaccurate, and we don’t have very much cell-type specific or dynamic information about these pathways. Finally, the analysis of large gene lists with the current enrichment tools is still more of an exploratory data-mining procedure rather than a pure statistical solution and analytical endpoint.
Despite the limitations, the kinds of methods discussed here and reviewed elsewhere (see below) are still very useful for extracting biological meaning and framing results from high-throughput experiments in a functional context. The bioinformatics core here at UVA can do any of the kinds of analyses discussed above. Please fill out a consultation request form and I’ll be happy to talk to you about what kinds of insight you may be able to glean from your existing data or in an experiment you’re planning.
Other useful literature on the subject: