Learning And Exploring The Workflow of RNA-Seq Analysis – A Note To Myself

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Learned RNA-seq workflow using C. difficile data from a published study 🧬. Processed raw reads through fastp → kallisto → DESeq2 pipeline. Results matched the original paper’s findings, with clear differential expression between mucus and control conditions 📊.

Motivations:

To be honest, I don’t really know what RNA-seq was until I learnt more about it and its potential! On our previous learning processes, we’ve looked at Learning Antimicrobial Resistance (AMR) genes with Bioconductor, Phylogenetic Analysis, Building DNA sequence Alignment, Assemblying DNA Sequence, Learning BLAST and MLST, Exploring Long Sequence ONT Workflow. Notice that all these are DNA related. This article, we’ll explore the world of RNA! In my simplistic view, RNA-seq enables us to explore and discover transcriptomes of certain conditions, whether it can tell us a bit more of the gene expression based on certain conditions. The potential is huge! Imagine if you can tell the difference between colonization vs a true infection in a clinical setting, I wonder if differential expression analysis of transcriptome can provide us a bit more information! Differentiating infection vs contamination (?more accurate HAI definition 🤷‍♂️). Let’s dive into the shallow pool of the unknown world of RNA-Seq and at least learn the basics! Let’s go!

Game Plan:

Let’s look at Clostridioides difficile and see if we can find any raw cDNA sequences on ncbi. After some searching, I found this Clostridioides difficile-mucus interactions encompass shifts in gene expression, metabolism, and biofilm formation that might be potentially a good one to look at and see if we can somewhat reproduce what’s found.

Disclaimer:

I am not a biostatistician, neither do I work in a lab. I’m attempting to understand the bioinformatics workflow of an RNA-seq. If you find any information displayed here is wrong, please let me know so I can learn. Please verify the information presented here as well. This is a documentation of my learning process and a note for my future self to reproduce the workflow I’ve explored

Objectives:

What is RNA-Seq?

RNA-Seq, or RNA sequencing, is a powerful technique used to analyze the transcriptome of a cell or organism. The transcriptome refers to the complete set of RNA molecules, including messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and non-coding RNAs, that are present in a cell at a specific time. RNA-Seq allows researchers to study gene expression patterns, identify differentially expressed genes, and discover novel transcripts. read more on wiki. It’s usually challenging to sequence from RNA itself due to its instability and susceptibility to degradation. Therefore, RNA is typically converted into complementary DNA (cDNA) using a process called reverse transcription before sequencing. This cDNA is then used as the input for sequencing platforms. However, for short RNA sequence, nowadays with ONT, it does provide flowcells that can directly sequence RNA without the need for conversion to cDNA.

Let’s Look At Existing Data

Let’s dive into the bioproject. To download the raw sequence, we need to go through several clicks. And also understand what some of these abbreviations means.

Sequence Read Archive (SRA):
SRR – Individual sequencing runs (raw data files).
SRX – Experiments (groups related runs).
SRS – Samples (biological specimens).
SRP – Projects/Studies (collections of related experiments).

Gene Expression Omnibus (GEO):
GSM – Individual samples with expression data.
GSE – Series/datasets (collections of related samples).
GPL – Platforms (microarray or sequencing platform descriptions).
GDS – Curated datasets (processed GSE data).

Assembly Database:
ASM – Genome assemblies.
GCA – GenBank assemblies.
GCF – RefSeq assemblies.

Core Sequence Databases:
NC_ – RefSeq chromosomes/complete genomes.
NM_ – RefSeq mRNA sequences.
NP_ – RefSeq protein sequences.
XM_/XP_ – Model/predicted RefSeq sequences.
AC_ – GenBank finished genomic sequences.

BioProject & BioSample:
PRJNA – BioProject accessions (research projects).
SAMN – BioSample accessions (biological specimens).

Alright, with that out of the way, we’re interested in SRR, which should be under SRA. We can see on the bioproject page that there is SRA experiments. Click on that and it’ll bring you to a page of SRXs. Since they are very BIG files, in terms of gigs, we can’t just download from the website, we need sra-tools

## Install follow this link instruction 
## https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit

## download SRRs example
fasterq-dump --progress --threads 4 SRR27792604

Now, repeat the above for the other SRRs and you’ll get all the raw sequences in fastq format. With our practice, we’ll only download SRR27792597,SRR27792603,SRR27792598, andSRR27792604. Two mucus and two controls. It will take sometime, which is why i like the --progress parameter. Alright, after we’ve downloaded the sequences, let’s move on to our official workflow.

files <- c("kallisto_SRR27792597/abundance.h5",
           "kallisto_SRR27792603/abundance.h5", 
           "kallisto_SRR27792598/abundance.h5",
           "kallisto_SRR27792604/abundance.h5")

The Workflow

TL;DR Quality control (fastp) -> Create transcriptome reference (kallisto) -> Assemble Raw Sequence (kallisto) -> Analyse (PCA & DESeq2)

QC of raw read

# install fastp

# run QC
fastp -i SRR27792597_1.fastq -h SRR27792597_1.html --verbose --stdout --thread 8 > /dev/null

If you’re like me who likes to know what’s going on when running, make sure to include --verbose and I found --thread is very helpful, otherwise it took sometime. Repeat for all fastq files and inspect.

What is Considered Acceptable?

I’ll be honest, I’m not sure. 🤔 According to Claude, Read Quality Metrics such as Quality Scores (Phred scores): Q30+: >80% is good, >90% is excellent, Q20+: Should be >95%. Total Reads: Depends on our goals: Differential expression: 10-30M reads per sample, whereas Novel transcript discovery: 50-100M+ reads. Contamination and Artifacts: Adapter Content: <5%: Good, 5-20%: Moderate (should trim), >20%: High contamination. Duplication Rate: <30%: Low (genomic DNA-like), 30-70%: Normal for RNA-seq, >70%: Potentially problematic. Sequence Composition: GC Content: Should match expected organism: Human/Mouse: ~42%, E. coli: ~50%, Yeast: ~38%; Per-base Quality: Should stay >Q28 across read length. Red Flags to Watch For Poor quality data: <10M reads for differential expression <80% Q20 bases, Severe 3’ quality drop (below Q20), Unusual GC content spikes, 20% adapter contamination after trimming.

Our QC report is not included, but a quick glance, it looks pretty good!

Getting Reference of Transcriptome

On the article’s supplement, we found that they used FN545816.1 as reference genome. We can download them from here. Click download and include gff as well.

# Install gffread if you don't have it
brew install gffread  # on macOS

# install kallisto
# go here https://pachterlab.github.io/kallisto/download.html for info

# extract transcriptome sequences
gffread -w transcriptome.fa -g GCA_000027105.1_ASM2710v1_genomic.fna genomic.gff
kallisto index -i transcriptome.idx transcriptome.fa

You should have transcriptome.idx and transcriptome.fa in your current working directory. What this does is that it extracts the transcriptome sequences from the genome based on the gff file, in our case it’s a specific Clostridioides difficile R20291 assembly with accession FN545816.1. The gff file contains information about the locations of genes and other features on the genome. The -w flag specifies the output file for the transcriptome sequences, and the -g flag specifies the input genome sequence file. The resulting transcriptome.fa file contains the nucleotide sequences of all transcripts annotated in the gff file.

Now that we have the reference. Let’s assemble our raw sequence to something readable.

Assemble

# assemble, make sure to include all the other samples
kallisto quant -i transcriptome.idx \
 -o kallisto_SRR27792604 \
 --threads 8 \
 SRR27792604_1.fastq SRR27792604_2.fastq

We’d have to go through all the raw sequences we downloaded and assemble them. We’ll then see folders such as kallisto_SRR27792604. And inside the folder, we see abundance.h5 and abundance.tsv. The reason i prefer kallisto over STAR is mainly because of the speed. When I first tried STAR it was very slow and no progress bar. The output of these files are actually quite small. Let’s take a look!

library(tximport)
library(DESeq2)

# Create file paths for all samples
files <- c("kallisto_SRR27792597/abundance.h5",
           "kallisto_SRR27792603/abundance.h5", 
           "kallisto_SRR27792598/abundance.h5",
           "kallisto_SRR27792604/abundance.h5")

# Name them
names(files) <- c("mucus_rep1", "control_rep1", "mucus_rep2", "control_rep2")

# Import data
txi <- tximport(files, type = "kallisto", txOut = TRUE)

coldata <- data.frame(
  condition = c("mucus", "control", "mucus", "control"),  # or however they're grouped
  row.names = names(files)
)

dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)

Just as a note, there are another way we can create from matrix and count as well using tsv.

PCA

# Perform variance stabilizing transformation
vsd <- vst(dds, blind = FALSE)

# Plot PCA
plotPCA(vsd, intgroup = "condition")

Wow, it looks like the treatment and control are separated quite well! Let’s dive straight into differential expression analysis. If we look at figure 2B on the article, it looks very similar, even though we haven’t include the full sequences.

DESeq2

library(tidyverse)

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results
res <- results(dds, alpha = 0.01, tidy = T)

# View summary
summary(res)

##      row               baseMean       log2FoldChange         lfcSE        
##  Length:3570        Min.   :      0   Min.   :-8.42507   Min.   :0.06850  
##  Class :character   1st Qu.:    326   1st Qu.:-0.51037   1st Qu.:0.08493  
##  Mode  :character   Median :   1313   Median : 0.00074   Median :0.10664  
##                     Mean   :  12154   Mean   :-0.01934   Mean   :0.16100  
##                     3rd Qu.:   4999   3rd Qu.: 0.53025   3rd Qu.:0.16896  
##                     Max.   :3272163   Max.   : 3.75721   Max.   :4.98958  
##                                       NA's   :1          NA's   :1        
##       stat               pvalue               padj          
##  Min.   :-80.75478   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.: -4.39763   1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :  0.00566   Median :0.0001034   Median :0.0002068  
##  Mean   : -0.71728   Mean   :0.1229994   Mean   :0.1366817  
##  3rd Qu.:  3.44035   3rd Qu.:0.0970157   3rd Qu.:0.1293422  
##  Max.   : 28.59200   Max.   :0.9978324   Max.   :0.9978324  
##  NA's   :1           NA's   :1           NA's   :1

# positive
res |> 
  filter(log2FoldChange > 0) |>
  arrange(padj, desc(log2FoldChange)) |>
  head(10)

##                   row  baseMean log2FoldChange      lfcSE     stat
## 1  gene-CDR20291_1626  2872.357       2.999390 0.10490313 28.59200
## 2  gene-CDR20291_2014 32895.724       2.043527 0.07250627 28.18414
## 3  gene-CDR20291_0509 54339.793       1.925179 0.07046733 27.32016
## 4  gene-CDR20291_2174 13209.067       1.957986 0.07539631 25.96926
## 5  gene-CDR20291_2495 24664.096       2.321013 0.08978192 25.85168
## 6  gene-CDR20291_0508  8811.393       2.021692 0.07967500 25.37423
## 7  gene-CDR20291_2738  9037.752       1.859512 0.07781080 23.89787
## 8  gene-CDR20291_1446  4410.271       1.982683 0.08581234 23.10487
## 9  gene-CDR20291_2871  3862.148       1.912250 0.08437678 22.66322
## 10 gene-CDR20291_2017  5272.620       1.784124 0.08208607 21.73479
##           pvalue          padj
## 1  8.448830e-180 1.076924e-177
## 2  9.148740e-175 1.053286e-172
## 3  2.444054e-164 2.643282e-162
## 4  1.102039e-148 9.832939e-147
## 5  2.329704e-147 2.027979e-145
## 6  4.856404e-142 3.939206e-140
## 7  3.223217e-126 2.170502e-124
## 8  4.136645e-118 2.636372e-116
## 9  1.033340e-113 6.250829e-112
## 10 9.621927e-105 5.283178e-103

# negative
res |> 
  filter(log2FoldChange < 0) |>
  arrange(padj, log2FoldChange) |>
  head(10)

##                   row   baseMean log2FoldChange      lfcSE      stat pvalue
## 1  gene-CDR20291_3145 195320.938      -8.425071 0.10432907 -80.75478      0
## 2  gene-CDR20291_2142  14441.266      -5.528640 0.08551642 -64.65004      0
## 3  gene-CDR20291_1557  21279.449      -4.484919 0.11271527 -39.78981      0
## 4  gene-CDR20291_0332  11550.169      -4.080126 0.08702073 -46.88683      0
## 5  gene-CDR20291_0331   4196.984      -3.907433 0.09507966 -41.09641      0
## 6  gene-CDR20291_2206   6931.606      -3.821015 0.08700146 -43.91897      0
## 7  gene-CDR20291_1078  28192.247      -3.531056 0.08204064 -43.04033      0
## 8  gene-CDR20291_2237   6355.265      -3.491135 0.08384686 -41.63704      0
## 9  gene-CDR20291_0330   6786.174      -3.481683 0.08283612 -42.03097      0
## 10 gene-CDR20291_2238   6034.666      -3.430734 0.08338778 -41.14193      0
##    padj
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
## 7     0
## 8     0
## 9     0
## 10    0

Interesting thing about the pvalue, we need to look at adjust pval instead because we are doing multiple comparison. Adjust pvalue is via Benjamini-Hochberg method. This adjust the multiplicity of comparison.

Volcano plot

library(ggplot2)
library(ggrepel)

# Define specific genes to label
genes_to_label <- c("gene-CDR20291_1626", "gene-CDR20291_0508", "gene-CDR20291_1446", 
                    "gene-CDR20291_2495", "gene-CDR20291_0455", "gene-CDR20291_0876",
                    "gene-CDR20291_0877", "gene-CDR20291_1275", "gene-CDR20291_0875",
                    "gene-CDR20291_3145")

# Create labeled volcano plot
res |>
  filter(!is.na(padj)) |>
  mutate(label = ifelse(row %in% genes_to_label, str_extract(row, "\\d+$"), "")) |>
  mutate(color = case_when(
    log2FoldChange < -1 & padj <= 0.01 ~ "negative",
    log2FoldChange > 1 & padj <= 0.01 ~ "positive",
    TRUE ~ "neutral"
  )) |>
  ggplot(aes(x = log2FoldChange, y = -log10(padj), color = color)) +
  geom_point(alpha = 0.2) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
  geom_text_repel(aes(label = label), 
                  size = 5, 
                  # box.padding = 0.3,
                  max.overlaps = 20) +
  labs(title = "Volcano Plot - Mucus vs Control",
       x = "Log2 Fold Change", 
       y = "-Log10 Adjusted P-value") +
  scale_color_manual(values = c("positive" = "red", "negative" = "blue", "neutral" = "grey")) +
  theme_bw() +
  theme(legend.position = "none")

If we look at figure 2A, it again looks very similar! To interpret the volcano plot, we basically look at the the top right and top left of the plot. These are the ones that are significantly differentially expressed. The top right are the upregulated genes, whereas the top left are the downregulated genes. We can see that gene-CDR20291_1626 is the most upregulated gene, whereas gene-CDR20291_3145 is the most downregulated gene. 🙌

Looking at NCBI, looks like 1626 is a putative sodium/phosphate cotransporter [Clostridioides difficile R20291]. And 3145 is probable protease. What this means is that Cdiff when exposed to mucus when compared to control, the putative sodium/phosphate cotransporter expressed gene was found more (in mucus group), whereas the probable protease expressed gene was found less (in mucus group).

Opportunities for improvement

  • I should probably rewrite the above to a script and with function so that in the future we can easily reproduce any DE analysis
  • include esearch or entrez to get accession metadata for more accurate label
  • include heatmap
  • pathway analysis. Yes, we can DE analyze these genes, but what do they actually mean? Do they use nutrients differently etc.

Lessons Learnt

  • learnt the basics of sra-tools fasterq-dump,fastp, kallisto,DESeq2.
  • took me a while to find the right reference isolate. Found in on their supplement material and got the right one. For future reference, don’t assume they use the popular refseq, look through their procedure and get that specifically.
  • learnt from raw RNA-seq QC
  • learnt to interpret volcano plot

If you like this article:

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)