An R alternative to pairs for -omics QC
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Category
Tags
Introduction
The Problem: I've got a couple of problems with the commonly used “pairs” plot in R for quality control in -omics data. (1) It's not that space-efficient since it only uses half the space for datapoints and (2) the scatterplot isn't very informative. When I look at those scatter plots it's hard to tell anything about the spread of the data or any normalization problems. This is particularily true for proteomics data where the high dynamic range can obscure lower abundance points.
The Solution: A panel of MA plots (Minus-Average). The MA plot shows fold-change vs average intensity for a pair of samples. It lets you see difference between sample groups as fold-change which I think is a useful unit for comparison and visualizes normalization problems. Rather than plot each against each we will only compare samples between groups to save space.
This goes along with a previous post of mine that attempts to convince biologists of the value of putting tables of data into tidy format. This method takes advantage of pivoting data to succintly generate a panel of MA plots
suppressPackageStartupMessages(library(tidyverse)) library(GGally) library(DEFormats)
Set up the data
I'll start with simulated data that will resemble a gene expression study. A proteomics dataset would be similar. The dataset will have 8 samples, half of them treated, half control. 7 of the samples will approximately the same but Sample 4 will have a 3-fold increase compared to the rest to illustrate how MA-plots help identify problems with normalization.
counts <- simulateRnaSeqData(n = 5000, m = 8)
counts[, 4] <- counts[, 4] *  3
targets <- data.frame(sample = colnames(counts), group = c(rep("control", 4), rep("treated", 4)))
The ggpairs function from the GGally package does a decent job of the pairs plot.  
ggpairs(data.frame(counts))
The pairs plot tells us something about the data. The correlation is nice to have and if any sample was wildly different from the others it would show up in the scatter plot. Still I don't think it conveys the information very efficiently.
MA plot panel
Typically I would start by pivoting all the count data to a single columns and joining in the metadata. But I need to associate control and treated data for each gene for each sample so the usual method won't work. It took me a while to fall on the solution: we have to pivot the control and treated samples separately. So for each gene we will have a control sample name, a treated sample name and control and treated count data. Those can be used to calculate Fold-change and intensity.
control_samples <- targets$sample[targets$group == "control"]
treated_samples <- targets$sample[targets$group == "treated"]
data.frame(counts) %>%
  rownames_to_column("gene") %>%
  pivot_longer(all_of(control_samples), names_to = "control_sample", values_to = "control_count") %>%
  pivot_longer(all_of(treated_samples), names_to = "treated_sample", values_to = "treated_count") %>%
  mutate(FC = treated_count / control_count) %>%
  mutate(Intensity = (treated_count + control_count) / 2)
## # A tibble: 80,000 x 7 
##    gene  control_sample control_count treated_sample treated_count    FC Intensity
##    <chr> <chr>                  <dbl> <chr>                  <dbl> <dbl>     <dbl>
##  1 gene1 sample1                  103 sample5                   71 0.689      87  
##  2 gene1 sample1                  103 sample6                   79 0.767      91  
##  3 gene1 sample1                  103 sample7                   76 0.738      89.5
##  4 gene1 sample1                  103 sample8                  118 1.15      110. 
##  5 gene1 sample2                   82 sample5                   71 0.866      76.5
##  6 gene1 sample2                   82 sample6                   79 0.963      80.5
##  7 gene1 sample2                   82 sample7                   76 0.927      79  
##  8 gene1 sample2                   82 sample8                  118 1.44      100  
##  9 gene1 sample3                   89 sample5                   71 0.798      80  
## 10 gene1 sample3                   89 sample6                   79 0.888      84  
## # ... with 79,990 more rows
All that's left to do now is plot. Facet_grid will let us split the samples up.
data.frame(counts) %>%
  rownames_to_column("gene") %>%
  pivot_longer(all_of(control_samples), names_to = "control_sample", values_to = "control_count") %>%
  pivot_longer(all_of(treated_samples), names_to = "treated_sample", values_to = "treated_count") %>%
  mutate(FC = treated_count / control_count) %>%
  mutate(Intensity = (treated_count + control_count) / 2) %>%
  ggplot(aes(x = Intensity, y = FC)) +
  geom_point(alpha = 0.5, na.rm = TRUE) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log2", breaks = 2^seq(-4, 4, 2)) +
  geom_hline(yintercept = 1) +
  labs(x = "Intensity", y = "Fold Change, treated vs control") +
  facet_grid(rows = vars(treated_sample), cols = vars(control_sample))
The change in abundance in sample 4 shows up much more clearly now. This isn't a common way to plot the data so it might require some explaining to your colleagues but worth the effort in my opinion.
Related Post
- How I selected my starting word for Wordle using simulations and R
- Ditch p-values. Use Bootstrap confidence intervals instead
- Introduction for Decision Tree
- Correlation vs PPS in Python
- Earthquake Analysis (4/4): Cluster Analysis
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
 

