# Downstream Bioinformatics Analysis of Omics Data with edgeR

**R – Myscape**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Summary

A common task when working with transcriptomic data is the identification of differentially expressed (DE) genes or tags between groups. In this tutorial participants will learn how to perform biostatistical analysis with edgeR, including pairwise and analysis of variance (ANOVA) like comparisons to identify significantly DE genes.

### Questions

- What data we will be working with?
- What types of comparisons can we make?
- How do I prepare expression data for pairwise comparisons?
- How do I prepare expression data for ANOVA like comparisons?
- What informative plots can I make to explore expression data and analysis results?

### Objectives

- Become comfortable working with quantified transcriptomic data.
- Learn what comparisons are possible based on experimental design.
- Be able to perform pairwise comparisons with edgeR.
- Be able to perform ANOVA like comparisons with edgeR.
- Discover different types of plots for exploring expression data and analysis results.

### Related

The analysis described in this tutorial is downstream of my previous tutorial on Bioinformatics Analysis of Omics Data with the Shell & R, in which I provide a walkthrough of a simple bioinformatics workflow for quantifying transcriptomic data.

## The Data

In this workshop we will be using data from a study that investigates the effects of ultraviolet radiation (UVR) on the larvae of the red flour beetle, titled “Digital gene expression profiling in larvae of *Tribolium castaneum* at different periods post UV-B exposure”.

UVR is common to many environments and it varies widely in its intensity and composition, such as differing ratios of UV-A and UV-B radiation. The different forms of UVR have distinct, and frequently harmful effects on organisms and biological systems.

For example, the following diagram depicts the effects of different forms of UVR on skin.

There are three primary methods for organisms to defend against harmful levels of UVR:

- avoidance
- photoprotection
- repair

Since the red flour beetle (*Tribolium castaneum*) spends much of its life cycle in infested grains, the larvae does not typically experience high levels of UVR. Furthermore, the larvae of the red flour beetle is light in pigmentation and does not appear to employ photoprotective pigments (e.g., melanin).

So, how do red flour beetle larvae respond to UVR exposure?

## Study Design & Bioinformatics Workflow

In their study, the authors investigated the defense strategy against UV-B radiation in red flour beetle larvae. The following graphical abstract illustrates how the transcriptomic (expression) data was generated, as well as the bioinformatics workflow of the authors.

The following steps were performed by the authors in their bioinformatics analysis, which is described in the “2.4. Bioinformatic analysis” section of the paper:

- Reads mapping to the reference genome – TopHat
- Quantification of gene expression level and differential expression analysis – RSEM & DESeq2
- Gene ontology and Kyoto encyclopedia of genes and genomes pathway enrichment analysis of differentially expressed genes – GOseq & KOBAS
- Validation by qRT-PCR

This tutorial is part of a series on bioinformatics and biostatistics analysis, and covers step 6 of the following workflow:

- data collection – SRA toolkit
- quality control – fastqc
- genomic data format conversion – gffread
- transcriptomic data alignment – hisat2
- quantify transcript alignments – featureCounts
- differential expression analysis – edgeR
- Pairwise comparisons
- ANOVA comparisons

Note that steps 1-5 of the above workflow are described in my previous post on Bioinformatics Analysis of Omics Data with the Shell & R.

In this tutorial we will be performing differential expression (DE) analysis using edgeR, rather than DESeq2. The DESeq2 and edgeR packages perform similarly and have the same underlying hypothesis that most genes are not DE, but use different normalization methods that may change the number of detected DE genes.

Further note that we will begin with the simpler pairwise comparisons, then move on to analysis of variance (ANOVA). However, it is typical to begin a statistical analysis with the ANOVA. After identifying any significant main effect associated with an explanatory variable, then pairwise comparisons are used to identify the groups that are different from one another.

## Biostatistics Analysis

Pairwise comparisons are a good place to start with differential expression analysis of transcriptomic data sets. Exact tests are the classic edgeR approach to making pairwise comparisons between groups of samples in a data set.

We can use generalized linear models (GLMs) to detect differentially expressed genes associated with an explanatory variable (factor). These are another classic edgeR method for analyzing RNA sequencing expression data that parallels the ANOVA method.

In contrast to exact tests, GLMs allow for more general comparisons. Make sure to check out the edgeR user guide for detailed information about each of the commands we will be using.

### Experimental Design

The experimental design of the expression data we are using in this workshop is as follows.

sample | treatment | hours |

SRR8288561 | cntrl | 4h |

SRR8288562 | cntrl | 4h |

SRR8288563 | cntrl | 4h |

SRR8288564 | treat | 4h |

SRR8288557 | treat | 4h |

SRR8288560 | treat | 4h |

SRR8288558 | cntrl | 24h |

SRR8288567 | cntrl | 24h |

SRR8288568 | cntrl | 24h |

SRR8288559 | treat | 24h |

SRR8288565 | treat | 24h |

SRR8288566 | treat | 24h |

Notice that there are two factors of **treatment** and **hours** for each sample, and within each of these factors are two levels. The **treatment** factor has the levels of **cntrl** and **treat**, and the **hours** factors has the levels of **4h** and **24h**.

So, we are able to group our data using the different levels of each factor to specify subsets of samples (groups).

### General Setup

With our transcriptomic sequence (expression) data aligned and quantified (counted), we can setup to perform some biostatistical analysis of the data.

The comma separated variable (CSV) file named TriboliumCounts.csv with the quantified transcript sequence data can be found -> here <-

To make the next steps more simple, set your working directory using the setwd command in R as follows. The TriboliumCounts.csv that we will be using should be accessible from, or located in this directory.

setwd("/YOUR/FILE/PATH")

### Packages

Make sure to install the BiocManager package, if you do not already have it installed. Then you can use **BiocManager **to install the **edgeR** package.

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("edgeR")

We will also be using the ggplot2, **ghibli**, and ggVennDiagram packages, which may be installed as follows.

install.packages("ggplot2") install.packages("ghibli") install.packages("ggVennDiagram")

Remember to load the **edgeR**, **ggplot2**, **ghibli**, and **ggVennDiagram **libraries before proceeding.

library(edgeR) library(ggplot2) library(ghibli) library(ggVennDiagram)

### Plotting Palettes

Before moving on with the analysis, let’s also setup a couple of vectors with fun color palettes for plotting later.

First, the **par** command allows us to change the graphical parameters and create a single plot with all of the palettes displayed across 9 rows and 3 columns.

par(mfrow=c(9,3))

Next, we can view the available color palettes of the loaded **ghibli** library with a **for** loop, and the **names** and **print** commands like so.

for(i in names(ghibli_palettes)) print(ghibli_palette(i))

The **dev.off** command is needed to close the plot and return the display to the default graphical parameters.

dev.off()

The PonyoMedium palette has a nice range of cool (e.g., blues) and warm (e.g., reds) colors for plotting, so let’s retrieve the vector of colors associated with this palette using the ghibli_palette command.

ghibli_colors <- ghibli_palette("PonyoMedium", type = "discrete")

Note that we are using the **“discrete”** palette type here since we will only need a small range of colors for plotting.

We can view the selected range of palette colors by simply entering in the name of the vector as follows.

ghibli_colors

Next, let’s specify a vector of colors in a low, medium, and high range for use with our volcano plots later.

ghibli_subset <- c(ghibli_colors[3], ghibli_colors[6], ghibli_colors[4])

### Count Data

Let’s use the **read.csv** command to import the file named TriboliumCounts.csv with the quantified (counted) transcript sequence data.

tribolium_counts <- read.csv("TriboliumCounts.csv", row.names="X")

Notice that we specified **row.names=”X”** in the **read.csv** command since the CSV file of quantified transcript sequence data has just a pair of empty quotes **“”** in the first cell, and R removes certain special characters (e.g., quotes) when the data is imported.

## Pairwise Comparisons

In edgeR the exact test for the negative binomial distribution strongly parallels the Fisher’s exact test, and is applicable to experiments with a single factor. The Fisher’s exact test is used to determine if there is a significant association between two categorical variables.

Note that the exact test in edgeR is based on the qCML methods for experiments with a single factor. By knowing the conditional distribution for the sum of counts in a group, we can compute exact p-values by summing over all sums of counts that have a probability less than the probability under the null hypothesis of the observed sum of counts.

### Pairwise Design

To perform pairwise analysis we need to specify the factors (treatment and hours) and levels associated with each of the samples using the **factor** command. Recall that the treatment factor has the levels of cntrl and treat, and the hours factor has the levels of 4h and 24h.

group <- factor(c(rep("cntrl_4h",3), rep("treat_4h",3), rep("cntrl_24h",3), rep("treat_24h",3)))

Note that the vector of factors named **group **needs to be arranged in the same order as the samples in the columns of the **tribolium_counts **data frame.

Now that we have the factors associated with each sample specified, we can use the **DGEList** command created a object with the table of counts.

list <- DGEList(counts=tribolium_counts,group=group)

Note that in the table of counts rows are features and columns are samples, including a group indicator for each column.

### Normalization

After normalization of raw counts we will perform gene wise exact tests, which are pairwise tests of DE for each gene. Exact tests are used to identify differences in the means between two groups of gene counts.

First, let’s use the **barplot **command of base R to quickly plot and view the sizes of the sequencing libraries as follows.

barplot(list$samples$lib.size*1e-6, names=1:12, ylab="Library size (millions)")

Note that the default normalization method of edgeR is the Trimmed Mean of *M*-values (TMM), and TMM normalization factors do not take into account library sizes.

We need to filter the table of raw gene counts using the **filterByExpr **command before we proceed with normalization.

keep <- filterByExpr(list)

Let’s use the **table** command to view the number of filtered genes.

table(keep)

Now we should subset our data frame to remove genes that the previous filtering identified as not expressed in either experimental condition.

list <- list[keep, , keep.lib.sizes=FALSE]

We’ll use the **calcNormFactors** command next to calculate scaling factors to convert raw library sizes into effective by finding a set of scaling factors for the library sizes that minimizes the log-fold changes between the samples for most genes.

list <- calcNormFactors(list)

Next, we’ll compute counts per million (CPM) using normalized library sizes with the **cpm** command.

normList <- cpm(list, normalized.lib.sizes=TRUE)

### Data Exploration

With the filtered count data we can start to explore the data using some common edgeR plots, such as multi-dimensional scaling (MDS) and heatmaps.

#### MDS Plot

With the filtered count data we can also create a multidimensional scaling (MDS) plot. The distances between samples in the MDS plot approximate the expression differences.

Make sure to check out the documentation for the plotMDS command to learn more about the different options. By default the expression differences are calculated as the the average of the largest (leading) absolute log-fold changes between each pair of samples.

First, we need to create a vector with symbol (shape) numbers to specify shapes based on the hours factor of each sample in the MDS plot. Remember that there are two levels to the hours factor, so we need two different shapes.

points <- c(0,1,15,16)

Note that the square symbol is specified using the numbers 0 and 15, where 0 is for open squares and 15 is for filled in squares. The circle symbol is specified using the numbers 1 and 16, where 1 is for open circles and 16 is for filled in circles.

We also need a vector of colors for coloring samples based on factor in the MDS plot. The **rep** command is used to twice replicate the vector of color values (codes) created in the nested **c** command. This results in a vector of four color codes, two teal and two yellow.

colors <- rep(c(ghibli_colors[3], ghibli_colors[6]), 2)

Note that the teal color is obtained from element 3 of the **ghibli_colors **vector we defined earlier. The yellow color is obtained from element 6 of the **ghibli_colors **vector.

Next, the **par** command is needed again here to adjust the graphical parameters and enlarge the plot display

par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)

Now we can create a detailed MDS plot of the samples as follows.

plotMDS(list, col=colors[group], pch=points[group])

Let’s add a legend to the MDS plot describing the points. The **inset** argument of the **legend** command is used to move the plot legend outside of the plotting area.

legend("topright", inset=c(-0.4,0), legend=levels(group), pch=points, col=colors)

Note that the **inset** argument places the legend in different positions depending on the display, such as the RStudio plot window in comparison with the **jpeg** command output.

Notice that the point shapes (symbols) are either square for 24h or circle for 4h samples. The control samples have open circles and squares, while treatment samples have filled in circles and squares. Also, notice that the points are colored teal for 24h and yellow for 4h samples.

The **dev.off** command is needed again here to close the plot and return the display to the default graphical parameters.

dev.off()

### Heatmap

As a form of unsupervised clustering, let’s create a heatmap of individual RNA sequencing (transcriptomic) samples using moderated log counts per million (CPM).

First, we need to calculate the log2 CPM of the gene count data by setting the **log** argument of the **cpm** command to **TRUE** like so.

logcpm <- cpm(list, log=TRUE)

Note that by default this produces a matrix of log2 CPM that has undefined values avoided and poorly defined log fold changes (logFC) for low counts shrunk towards zero.

Now we can draw a heatmap clustering the individual samples with the **heatmap **command of base R.

heatmap(logcpm)

Note that by default the **heatmap** command performs hierarchical clustering of the individual samples using the calculated log2 CPM.

### Model Fitting

Once negative binomial models are fitted and dispersion estimates are obtained, we can proceed with testing for the differential expression of genes.

We need to first estimate common dispersion and tagwise dispersions to produce a matrix of pseudo-counts using the **estimateDisp **command as follows.

list <- estimateDisp(list)

#### BCV Plot

Next, we will create a biological coefficient of variation (BCV) plot to explore our data before performing statistical tests.

Then, let’s plot dispersion estimates and biological coefficient of variation with the **plotBCV **command.

plotBCV(list)

Note that the BCV is the square root of the dispersion parameter under the negative binomial model and is equivalent to estimating the dispersions of the negative binomial model.

### Pairwise Contrasts

A contrast is a linear combination of means for a group. So, each of the following pairwise comparisons is a contrast that we will be performing to determine the differential expression of genes.

- treat_4h vs cntrl_4h
- treat_24h vs cntrl_24h
- treat_4h vs treat_24h
- cntrl_4h vs cntrl_24h

Note that the authors were focused on the comparison of differentially expressed genes at different times post UV-B exposure (treat) using the following contrasts.

- treat_4h vs cntrl_4h
- treat_24h vs cntrl_24h

Also keep in mind that the authors considered genes with adjusted P-values < 0.05 to be DE. We will consider genes with FDR adjusted P-values < 0.05 to be significantly DE in this tutorial.

#### treat_4h vs cntrl_4h

The first contrast we will make is with the groups of **treat_4h **and **cntrl_4h**, which are the subsets of samples associated with either the **treat** or **cntrl** levels of the **treatment** factor and the **4h** level of the **hours** factor.

Let’s use the **exactTest** command to perform the treat_4h vs cntrl_4h contrast as follows.

tested_4h <- exactTest(list, pair=c("cntrl_4h", "treat_4h"))

Using the **summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(tested_4h))

##### MD Plot

The first plot we’ll make is of log-fold change against log-counts per million using the **plotMD** command.

plotMD(tested_4h)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

Notice that the legend indicates significantly DE genes are highlighted in red for positive change in expression (up), and blue for negative (down). So, in this contrast there are no significantly DE genes.

##### Volcano Plot

Another informative type of plot we can make is called a volcano plot, which is a type of scatterplot that can be used to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

resultsTbl_4h <- topTags(tested_4h, n=nrow(tested_4h$table), adjust.method="fdr")$table

Note that the default sorting method of **topTags** is by P-value, so we are using the **adjust.method** argument to sort by the FDR adjusted P-values.

Now, we need to identify the direction of expression (up or down) for significantly DE genes. Remember that we are considering genes with a FDR adjusted P-values < 0.05 to be significantly DE.

Let’s start by adding a column named **topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

resultsTbl_4h$topDE <- "NA"

We will identify genes that are significantly positively expressed (up) as those with a log FC > 1 and with a FDR adjusted P-values < 0.05 as follows.

resultsTbl_4h$topDE[resultsTbl_4h$logFC > 1 & resultsTbl_4h$FDR < 0.05] <- "UP"

Next, we will identify genes that are significantly negatively expressed (down) as those with a log FC < -1 and with a FDR adjusted P-values < 0.05 like so.

resultsTbl_4h$topDE[resultsTbl_4h$logFC < -1 & resultsTbl_4h$FDR < 0.05] <- "DOWN"

Now we can create our volcano plot using the **ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame. Note that the **FDR** is transformed using the negative **–** results of the **log10** command.

ggplot(data=resultsTbl_4h, aes(x=logFC, y=-log10(FDR), color=topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type=ghibli_subset, breaks = c("Up", "Down"))

The specific colors for each of the directions of expression are set using the **scale_colour_discrete** command and the subset of **ghibli** palette colors that we previously stored in the **ghibli_subset** vector.

Notice that there is only one color of points in the above plot since we did not find any significantly DE genes.

#### treat_24h vs cntrl_24h

The second contrast we will make is with the groups of **treat_24h **and **cntrl_24h**, which are the subsets of samples associated with either the **treat** or **cntrl** levels of the **treatment** factor and the **24h** level of the **hours** factor.

Let’s use the **exactTest** command to perform the treat_24h vs cntrl_24h contrast as follows.

tested_24h <- exactTest(list, pair=c("cntrl_24h", "treat_24h"))

Using the **summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(tested_24h))

##### MD Plot

The first plot we’ll make is of log-fold change against log-counts per million using the **plotMD** command.

plotMD(tested_24h)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

Notice that significantly DE genes are highlighted in red for positive change in expression (up), and blue for negative (down).

##### Volcano Plot

We will make a scatterplot to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

resultsTbl_24h <- topTags(tested_24h, n=nrow(tested_24h$table), adjust.method="fdr")$table

Now, we need to identify the direction of expression (up or down) for significantly DE genes.

Let’s start by adding a column named **topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

resultsTbl_24h$topDE <- "NA"

We will identify genes that are significantly positively expressed (up) as those with a log FC > 1 and with a FDR adjusted P-values < 0.05 as follows.

resultsTbl_24h$topDE[resultsTbl_24h$logFC > 1 & resultsTbl_24h$FDR < 0.05] <- "Up"

Next, we will identify genes that are significantly negatively expressed (down) as those with a log FC < -1 and with a FDR adjusted P-values < 0.05 like so.

resultsTbl_24h$topDE[resultsTbl_24h$logFC < -1 & resultsTbl_24h$FDR < 0.05] <- "Down"

Now we can create our volcano plot using the **ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame.

ggplot(data=resultsTbl_24h, aes(x=logFC, y=-log10(FDR), color = topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type = ghibli_subset, breaks = c("Up", "Down"))

Notice that the significantly DE genes are colored by the **topDE** column in the **aes** command. The specific colors for each of the directions of expression are set using the **scale_colour_discrete** command and the subset of **ghibli** palette colors that we previously stored in the **ghibli_subset** vector.

We also set the **breaks** argument of the **scale_colour_discrete** command to specify that we only want the tags of “Up” and “Down” to be included in the legend.

##### Filtering

Let’s create a table filtered to contain the significantly DE genes for plotting later.

Recall that we will consider genes with FDR adjusted P-values < 0.05 to be DE. This means that we need to filter the list of DE genes by first retrieving all genes that meet the criteria of **FDR **< 0.05 like so.

resultsTbl_24h.keep <- resultsTbl_24h$FDR < 0.05

Then we can subset the data frame to remove all genes that do not have a FDR adjusted P-value < 0.05.

resultsTbl_24h_filtered <- resultsTbl_24h[resultsTbl_24h.keep,]

#### treat_4h vs treat_24h

The third contrast we will make is with the groups of **treat_4h **and **treat_24h**, which are the subsets of samples associated with either the **4h **or **24h **levels of the **hours **factor and the **treat **level of the **treatment **factor.

Let’s use the **exactTest** command to perform the treat_4h vs treat_24h contrast as follows.

tested_treat <- exactTest(list, pair=c("treat_24h", "treat_4h"))

Using the **summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(tested_treat))

##### MD Plot

The first plot we’ll make is of log-fold change against log-counts per million using the **plotMD** command.

plotMD(tested_treat)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

##### Volcano Plot

We will make a scatterplot to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

resultsTbl_treat <- topTags(tested_treat, n=nrow(tested_treat$table), adjust.method="fdr")$table

Now, we need to identify the direction of expression (up or down) for significantly DE genes.

Let’s start by adding a column named **topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

resultsTbl_treat$topDE <- "NA"

We will identify genes that are significantly positively expressed (up) as those with a log FC > 1 and with a FDR adjusted P-values < 0.05 as follows.

resultsTbl_treat$topDE[resultsTbl_treat$logFC > 1 & resultsTbl_treat$FDR < 0.05] <- "Up"

Next, we will identify genes that are significantly negatively expressed (down) as those with a log FC < -1 and with a FDR adjusted P-values < 0.05 like so.

resultsTbl_treat$topDE[resultsTbl_treat$logFC < -1 & resultsTbl_treat$FDR < 0.05] <- "Down"

Now we can create our volcano plot using the **ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame.

ggplot(data=resultsTbl_treat, aes(x=logFC, y=-log10(FDR), color = topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type = ghibli_subset, breaks = c("Up", "Down"))

##### Filtering

Let’s create a table filtered to contain the significantly DE genes for plotting later.

We need to filter the list of DE genes by first retrieving all genes that meet the criteria of **FDR **< 0.05 like so.

resultsTbl_treat.keep <- resultsTbl_treat$FDR < 0.05

Then we can subset the data frame to remove all genes that do not have a FDR adjusted P-value < 0.05.

resultsTbl_treat_filtered <- resultsTbl_treat[resultsTbl_treat.keep,]

#### cntrl_4h vs cntrl_24h

The last pairwise contrast we will make is with the groups of **cntrl_4h **and **cntrl_24h**, which are the subsets of samples associated with either the **4h **or **24h **levels of the **hours **factor and the **cntrl **level of the **treatment **factor.

Let’s use the **exactTest** command to perform the cntrl_4h vs cntrl_24h contrast as follows.

tested_cntrl <- exactTest(list, pair=c("cntrl_24h", "cntrl_4h"))

**summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(tested_cntrl))

##### MD Plot

**plotMD** command.

plotMD(tested_cntrl)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

##### Volcano Plot

We will again make a scatterplot to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

resultsTbl_ncntrl <- topTags(tested_cntrl, n=nrow(tested_cntrl$table), adjust.method="fdr")$table

Now, we need to identify the direction of expression (up or down) for significantly DE genes.

**topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

resultsTbl_ncntrl$topDE <- "NA"

resultsTbl_ncntrl$topDE[resultsTbl_ncntrl$logFC > 1 & resultsTbl_ncntrl$FDR < 0.05] <- "Up"

resultsTbl_ncntrl$topDE[resultsTbl_ncntrl$logFC < -1 & resultsTbl_ncntrl$FDR < 0.05] <- "Down"

Now we can create our volcano plot using the **ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame.

ggplot(data=resultsTbl_ncntrl, aes(x=logFC, y=-log10(FDR), color = topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type = ghibli_subset, breaks = c("Up", "Down"))

##### Filtering

Let’s create a table filtered to contain the significantly DE genes for plotting later.

We need to filter the list of DE genes by first retrieving all genes that meet the criteria of **FDR **< 0.05 like so.

resultsTbl_cntrl.keep <- resultsTbl_ncntrl$FDR < 0.05

Then we can subset the data frame to remove all genes that do not have a FDR adjusted P-value < 0.05.

resultsTbl_cntrl_filtered <- resultsTbl_ncntrl[resultsTbl_cntrl.keep,]

### Pairwise Results Exploration

We can start to explore the results of our DE analysis by comparing the sets of significantly DE genes for each of the pairwise contrasts.

First, let’s retrieve set of significantly DE gene names for the 24h groups contrast of treat_24h vs cntrl_24h as follows.

geneSet_24h <- rownames(resultsTbl_24h_filtered)

The set of significantly DE gene names for the treat groups contrast of treat_4h vs treat_24h.

geneSet_treat <- rownames(resultsTbl_treat_filtered)

And the set of significantly DE gene names for the treat groups contrast of treat_4h vs treat_24h.

geneSet_cntrl <- rownames(resultsTbl_cntrl_filtered)

Next, we need to create a combined list of the sets of significantly DE gene names like so.

list_venn <- list(h24 = geneSet_24h, treat = geneSet_treat, cntrl = geneSet_cntrl)

Now we can create a venn diagram comparing the sets of significantly DE genes for each of the pairwise contrasts using the **ggVennDiagram** command with the **scale_color_brewer** command.

ggVennDiagram(list_venn, label_alpha=0.25, category.names = c("24h","treat","cntrl")) + scale_color_brewer(palette = "Paired")

Recall that no significantly DE genes were found for the 4h groups contrast of treat_4h vs cntrl_4h, so it is not included in the above plot.

Note that we are using the **category.names** argument of the **ggVennDiagram** command to set the names of the groups on the plot. We also use the **label_alpha** argument to make the labels on the plot slightly transparent.

Further note that we set the **palette** argument of the **scale_color_brewer** command to** “Paired”**, which is one of the qualitative palettes that can be used to represent nominal or categorical data.

## ANOVA Comparisons

Next we will use generalized linear models (GLMs) to detect differentially expressed genes associated with the factors of our experimental design. GLMs are an extension of classical linear models to non-normally distributed response data, and are used to specify probability distributions according to their mean-variance relationship.

Note that GLMs may be used for general experiments with multiple factors, and parallels the ANOVA method. Recall that exact tests are only applicable to experiments with a single factor.

The comma separated variable (CSV) file named groupingFactors_tribolium.csv with the grouping factors can be found -> here <-

### ANOVA Design

First, we will import a CSV file named **groupingFactors_tribolium.csv** that describes the design of our experiment as follows.

glm_targets <- read.csv(file="groupingFactors_tribolium.csv", row.names="sample")

Let’s use the **factor** command with the **paste** command to create an object with the groupings for our samples.

glm_group <- factor(paste(glm_targets$treatment, glm_targets$hours, sep="."))

Again, we need to create a DGEList object as follows.

glm_list <- DGEList(counts=tribolium_counts, group=glm_group)

Here we should add the sample names from the row names of the **glm_targets** data frame with the design of our experiment.

colnames(glm_list) <- rownames(glm_targets)

Now we can setup the design matrix used for conducting contrasts with the generalized linear models using the **model.matrix** command.

glm_design <- model.matrix(~ 0 + glm_group)

We also need to add the group names to the columns of our design matrix using the **levels **of our object with the groupings for our sample named using the **colnames** and **levels** commands.

colnames(glm_design) <- levels(glm_group)

### Normalization

Remember that we need to filter the table of raw gene counts using the **filterByExpr **command before we proceed with normalization.

glm_keep <- filterByExpr(glm_list)

Let’s again use the **table** command to view the number of filtered genes.

table(glm_keep)

Now we should subset our data frame to remove genes that the previous filtering identified as not expressed in either experimental condition.

glm_list <- glm_list[glm_keep, , keep.lib.sizes=FALSE]

We’ll again use the **calcNormFactors** command next to calculate scaling factors to convert raw library sizes into effective.

glm_list <- calcNormFactors(glm_list)

Next, we’ll compute counts per million (CPM) using normalized library sizes with the **cpm** command.

norm_glm_list <- cpm(glm_list, normalized.lib.sizes=TRUE)

### Model Fitting

Once negative binomial models are fitted and dispersion estimates are obtained, we can proceed with testing for the differential expression of genes.

We need to first estimate common dispersion and tagwise dispersions to produce a matrix of pseudo-counts using the **estimateDisp **command as follows.

glm_list <- estimateDisp(glm_list, glm_design, robust=TRUE)

Now we need to fit the quasi-likelihood (QL) negative binomial generalized log-linear model to count data using the **glmQLFit** command.

glm_fit <- glmQLFit(glm_list, glm_design, robust=TRUE)

We can use the **plotQLDisp** command to plot the gene wise QL dispersions against the gene abundance in log2 CPM.

plotQLDisp(glm_fit)

### ANOVA Contrasts

Recall that a contrast is a linear combination of means for a group. We will be performing each of the following contrasts to determine the differential expression of genes between the levels of each factor, and for any interaction effect.

- treatment
- hours
- interaction

Recall that the ANOVA let’s us identify any significant main effect associated with an explanatory variable, which are categorical factors like treatment and hours.

#### treatment

The first ANOVA like contrast we will make is between the **treat** and **cntrl** levels of the **treatment **factor.

Let’s use the **makeContrasts** command to setup the treatment contrast matrix as follows.

con.treatment <- makeContrasts(set.treatment = (treat.4h + treat.24h)/2 - (cntrl.4h + cntrl.24h)/2, levels=glm_design)

Now we’ll use the **glmTreat** command to conduct gene wise statistical tests for the treatment contrast.

anov.treatment <- glmTreat(glm_fit, contrast=con.treatment)

**summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(anov.treatment))

##### MD Plot

**plotMD** command.

plotMD(anov.treatment)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

##### Volcano Plot

We will make a scatterplot to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

tagsTbl_treatment <- topTags(anov.treatment, n=nrow(anov.treatment$table), adjust.method="fdr")$table

Now, we need to identify the direction of expression (up or down) for significantly DE genes.

**topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

tagsTbl_treatment$topDE <- "NA"

tagsTbl_treatment$topDE[tagsTbl_treatment$logFC > 1 & tagsTbl_treatment$FDR < 0.05] <- "UP"

tagsTbl_treatment$topDE[tagsTbl_treatment$logFC < -1 & tagsTbl_treatment$FDR < 0.05] <- "DOWN"

**ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame.

ggplot(data=tagsTbl_treatment, aes(x=logFC, y=-log10(FDR), color = topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type = ghibli_subset, breaks = c("Up", "Down"))

##### Filtering

Let’s create a table filtered to contain the significantly DE genes for plotting later.

We need to filter the list of DE genes by first retrieving all genes that meet the criteria of **FDR **< 0.05 like so.

tagsTbl_treatment.glm_keep <- tagsTbl_treatment$FDR < 0.05

tagsTbl_treatment_filtered <- tagsTbl_treatment[tagsTbl_treatment.glm_keep,]

#### hours

The first ANOVA like contrast we will make is between the **4h **and **24h **levels of the **hours **factor.

Let’s use the **makeContrasts** command to setup the treatment contrast matrix as follows.

con.hours <- makeContrasts(set.hours = (cntrl.24h + treat.24h)/2 - (cntrl.4h + treat.4h)/2, levels=glm_design)

Now we’ll use the **glmTreat** command to conduct gene wise statistical tests for the hours contrast.

anov.hours <- glmTreat(glm_fit, contrast=con.hours)

**summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(anov.hours))

##### MD Plot

**plotMD** command.

plotMD(anov.hours)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

##### Volcano Plot

Again we will make a scatterplot to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

tagsTbl_hours <- topTags(anov.hours, n=nrow(anov.hours$table), adjust.method="fdr")$table

Now, we need to identify the direction of expression (up or down) for significantly DE genes.

**topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

tagsTbl_hours$topDE <- "NA"

tagsTbl_hours$topDE[tagsTbl_hours$logFC > 1 & tagsTbl_hours$FDR < 0.05] <- "UP"

tagsTbl_hours$topDE[tagsTbl_hours$logFC < -1 & tagsTbl_hours$FDR < 0.05] <- "DOWN"

**ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame.

ggplot(data=tagsTbl_hours, aes(x=logFC, y=-log10(FDR), color = topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type = ghibli_subset, breaks = c("Up", "Down"))

##### Filtering

Let’s again create a table filtered to contain the significantly DE genes for plotting later.

**FDR **< 0.05 like so.

tagsTbl_hours.glm_keep <- tagsTbl_hours$FDR < 0.05

tagsTbl_hours_filtered <- tagsTbl_hours[tagsTbl_hours.glm_keep,]

#### interaction

The first ANOVA like contrast we will make is between the **treatment** and **hours **factors to identify any interaction effect. This means that we need to consider the groupings based on the levels within each factor.

Let’s use the **makeContrasts** command to setup the treatment contrast matrix as follows.

con.interaction <- makeContrasts(set.interaction = ((treat.4h + treat.24h)/2 - (cntrl.4h + cntrl.24h)/2) - ((cntrl.24h + treat.24h)/2 - (cntrl.4h + treat.4h)/2), levels=glm_design)

Now we’ll use the **glmTreat** command to conduct gene wise statistical tests for the interaction contrast.

anov.interaction <- glmTreat(glm_fit, contrast=con.interaction)

**summary** command with the **decideTests **command we can view the total number of DE genes at a FDR adjusted P-value of 0.05 like so.

summary(decideTests(anov.interaction))

##### MD Plot

**plotMD** command.

plotMD(anov.interaction)

Now let’s add blue lines to indicate 2-fold changes in expression like so.

abline(h=c(-1, 1), col="blue")

##### Volcano Plot

Again we will make a scatterplot to display the association between statistical significance (P-value) and magnitude of gene expression (fold change).

First, let’s create a results table of DE genes and statistics using the **topTags** command.

tagsTbl_inter <- topTags(anov.interaction, n=nrow(anov.interaction$table), adjust.method="fdr")$table

Now, we need to identify the direction of expression (up or down) for significantly DE genes.

**topDE** to our data frame of exact test results and fill each entry with the string **“NA”**.

tagsTbl_inter$topDE <- "NA"

tagsTbl_inter$topDE[tagsTbl_inter$logFC > 1 & tagsTbl_inter$FDR < 0.05] <- "UP"

tagsTbl_inter$topDE[tagsTbl_inter$logFC < -1 & tagsTbl_inter$FDR < 0.05] <- "DOWN"

**ggplot** command with the **logFC**, **FDR**, and **topDE** columns of the exact test results data frame.

ggplot(data=tagsTbl_inter, aes(x=logFC, y=-log10(FDR), color = topDE)) + geom_point() + theme_minimal() + scale_colour_discrete(type = ghibli_subset, breaks = c("Up", "Down"))

##### Filtering

Let’s again create a table filtered to contain the significantly DE genes for plotting later.

**FDR **< 0.05 like so.

tagsTbl_inter.glm_keep <- tagsTbl_inter$FDR < 0.05

tagsTbl_inter_filtered <- tagsTbl_inter[tagsTbl_inter.glm_keep,]

### ANOVA Results Exploration

We can start to explore the results of our DE analysis by comparing the sets of significantly DE genes for each of the pairwise contrasts.

First, let’s retrieve set of significantly DE gene names for the hours contrast of as follows.

geneSet_hours <- rownames(tagsTbl_hours_filtered)

The set of significantly DE gene names for the interaction contrast.

geneSet_interaction <- rownames(tagsTbl_inter_filtered)

Next, we need to create a combined list of the sets of significantly DE gene names like so.

glm_list_venn <- list(hours = geneSet_hours, interaction = geneSet_interaction)

Now we can create a venn diagram comparing the sets of significantly DE genes for each of the pairwise contrasts using the **ggVennDiagram** command with the **scale_color_brewer** command.

ggVennDiagram(glm_list_venn, label_alpha=0.25, category.names = c("hours","interaction")) + scale_color_brewer(palette = "Paired")

Recall that no significantly DE genes were found for the treatment contrast, so it is not included in the above plot.

## Saving Plots

We can save plots as PNG files by placing the code for the plot in between the **png** command and **dev.off** command calls, for example.

png("glm_tribolium_venn.png") ggVennDiagram(glm_list_venn, label_alpha=0.25, category.names = c("hours","interaction")) + scale_color_brewer(palette = "Paired") dev.off()

## Saving Tables

We can write table of data to a comma separated variable (CSV) file for downstream analysis using the write.table command, for example.

write.table(normList, file="Tribolium_normalizedCounts.csv", sep=",", row.names=TRUE)

## Key Points & Tips

- Pairwise comparisons are a good starting point for biostatistical analysis.
- ANOVA comparisons are used for general experiments with multiple factors.
- Make sure to install any necessary software in advance.
- Use different types of plots to explore expression data and analysis results.

## Code & Data

A R script with the code for all of the above analyses is provided -> here <-

The CSV file named TriboliumCounts.csv with the gene count data can be found -> here <-

The CSV file named groupingFactors_tribolium.csv with the grouping factors for GLM analysis can be found -> here <-

The CSV file named Tribolium_normalizedCounts.csv with the example output normalized gene count results can be found -> here <-

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Myscape**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.