Learning And Exploring The Workflow of RNA-Seq Analysis – A Note To Myself
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Learned RNA-seq workflow using C. difficile data from a published study 🧬. Processed raw reads through fastp → kallisto → DESeq2 pipeline. Results matched the original paper’s findings, with clear differential expression between mucus and control conditions 📊.
Motivations:
To be honest, I don’t really know what RNA-seq was until I learnt more about it and its potential! On our previous learning processes, we’ve looked at Learning Antimicrobial Resistance (AMR) genes with Bioconductor, Phylogenetic Analysis, Building DNA sequence Alignment, Assemblying DNA Sequence, Learning BLAST and MLST, Exploring Long Sequence ONT Workflow. Notice that all these are DNA related. This article, we’ll explore the world of RNA! In my simplistic view, RNA-seq enables us to explore and discover transcriptomes of certain conditions, whether it can tell us a bit more of the gene expression based on certain conditions. The potential is huge! Imagine if you can tell the difference between colonization vs a true infection in a clinical setting, I wonder if differential expression analysis of transcriptome can provide us a bit more information! Differentiating infection vs contamination (?more accurate HAI definition 🤷♂️). Let’s dive into the shallow pool of the unknown world of RNA-Seq and at least learn the basics! Let’s go!
Game Plan:
Let’s look at Clostridioides difficile
and see if we can find any raw cDNA sequences on ncbi. After some searching, I found this
Clostridioides difficile-mucus interactions encompass shifts in
gene expression, metabolism, and biofilm formation that might be potentially a good one to look at and see if we can somewhat reproduce what’s found.
Disclaimer:
I am not a biostatistician, neither do I work in a lab. I’m attempting to understand the bioinformatics workflow of an RNA-seq. If you find any information displayed here is wrong, please let me know so I can learn. Please verify the information presented here as well. This is a documentation of my learning process and a note for my future self to reproduce the workflow I’ve explored
Objectives:
What is RNA-Seq?
RNA-Seq, or RNA sequencing, is a powerful technique used to analyze the transcriptome of a cell or organism. The transcriptome refers to the complete set of RNA molecules, including messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and non-coding RNAs, that are present in a cell at a specific time. RNA-Seq allows researchers to study gene expression patterns, identify differentially expressed genes, and discover novel transcripts. read more on wiki. It’s usually challenging to sequence from RNA itself due to its instability and susceptibility to degradation. Therefore, RNA is typically converted into complementary DNA (cDNA) using a process called reverse transcription before sequencing. This cDNA is then used as the input for sequencing platforms. However, for short RNA sequence, nowadays with ONT, it does provide flowcells that can directly sequence RNA without the need for conversion to cDNA.
Let’s Look At Existing Data
Let’s dive into the bioproject. To download the raw sequence, we need to go through several clicks. And also understand what some of these abbreviations means.
Sequence Read Archive (SRA
):
SRR
– Individual sequencing runs (raw data files).
SRX
– Experiments (groups related runs).
SRS
– Samples (biological specimens).
SRP
– Projects/Studies (collections of related experiments).
Gene Expression Omnibus (GEO
):
GSM
– Individual samples with expression data.
GSE
– Series/datasets (collections of related samples).
GPL
– Platforms (microarray or sequencing platform descriptions).
GDS
– Curated datasets (processed GSE data).
Assembly Database:
ASM
– Genome assemblies.
GCA
– GenBank assemblies.
GCF
– RefSeq assemblies.
Core Sequence Databases:
NC_
– RefSeq chromosomes/complete genomes.
NM_
– RefSeq mRNA sequences.
NP_
– RefSeq protein sequences.
XM_/XP_
– Model/predicted RefSeq sequences.
AC_
– GenBank finished genomic sequences.
BioProject & BioSample:
PRJNA
– BioProject accessions (research projects).
SAMN
– BioSample accessions (biological specimens).
Alright, with that out of the way, we’re interested in SRR
, which should be under SRA
. We can see on the bioproject page that there is SRA experiments
. Click on that and it’ll bring you to a page of SRX
s. Since they are very BIG files, in terms of gigs, we can’t just download from the website, we need sra-tools
## Install follow this link instruction ## https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit ## download SRRs example fasterq-dump --progress --threads 4 SRR27792604
Now, repeat the above for the other SRRs
and you’ll get all the raw sequences in fastq
format. With our practice, we’ll only download SRR27792597
,SRR27792603
,SRR27792598
, andSRR27792604
. Two mucus and two controls. It will take sometime, which is why i like the --progress
parameter. Alright, after we’ve downloaded the sequences, let’s move on to our official workflow.
files <- c("kallisto_SRR27792597/abundance.h5", "kallisto_SRR27792603/abundance.h5", "kallisto_SRR27792598/abundance.h5", "kallisto_SRR27792604/abundance.h5")
The Workflow
TL;DR Quality control (fastp) -> Create transcriptome reference (kallisto) -> Assemble Raw Sequence (kallisto) -> Analyse (PCA & DESeq2)
QC of raw read
# install fastp # run QC fastp -i SRR27792597_1.fastq -h SRR27792597_1.html --verbose --stdout --thread 8 > /dev/null
If you’re like me who likes to know what’s going on when running, make sure to include --verbose
and I found --thread
is very helpful, otherwise it took sometime. Repeat for all fastq
files and inspect.
What is Considered Acceptable?
I’ll be honest, I’m not sure. 🤔 According to Claude, Read Quality Metrics such as Quality Scores
(Phred scores):
Q30+: >80% is good, >90% is excellent, Q20+: Should be >95%. Total Reads
: Depends on our goals: Differential expression: 10-30M reads per sample, whereas Novel transcript discovery: 50-100M+ reads. Contamination and Artifacts
: Adapter Content: <5%: Good, 5-20%: Moderate (should trim), >20%: High contamination. Duplication Rate
: <30%: Low (genomic DNA-like), 30-70%: Normal for RNA-seq, >70%: Potentially problematic. Sequence Composition:
GC Content
: Should match expected organism: Human/Mouse: ~42%, E. coli: ~50%, Yeast: ~38%; Per-base Quality: Should stay >Q28 across read length. Red Flags to Watch For Poor quality data: <10M reads for differential expression
<80% Q20 bases, Severe 3’ quality drop (below Q20), Unusual GC content spikes, 20% adapter contamination after trimming.
Our QC report is not included, but a quick glance, it looks pretty good!
Getting Reference of Transcriptome
On the article’s supplement, we found that they used FN545816.1
as reference genome. We can download them from
here. Click download
and include gff
as well.
# Install gffread if you don't have it brew install gffread # on macOS # install kallisto # go here https://pachterlab.github.io/kallisto/download.html for info # extract transcriptome sequences gffread -w transcriptome.fa -g GCA_000027105.1_ASM2710v1_genomic.fna genomic.gff kallisto index -i transcriptome.idx transcriptome.fa
You should have transcriptome.idx
and transcriptome.fa
in your current working directory. What this does is that it extracts the transcriptome sequences from the genome based on the gff
file, in our case it’s a specific
Clostridioides difficile R20291
assembly with accession FN545816.1
. The gff
file contains information about the locations of genes and other features on the genome. The -w
flag specifies the output file for the transcriptome sequences, and the -g
flag specifies the input genome sequence file. The resulting transcriptome.fa
file contains the nucleotide sequences of all transcripts annotated in the gff
file.
Now that we have the reference. Let’s assemble our raw sequence to something readable.
Assemble
# assemble, make sure to include all the other samples kallisto quant -i transcriptome.idx \ -o kallisto_SRR27792604 \ --threads 8 \ SRR27792604_1.fastq SRR27792604_2.fastq
We’d have to go through all the raw sequences we downloaded and assemble them. We’ll then see folders such as kallisto_SRR27792604
. And inside the folder, we see abundance.h5
and abundance.tsv
. The reason i prefer kallisto
over STAR
is mainly because of the speed. When I first tried STAR it was very slow and no progress bar. The output of these files are actually quite small. Let’s take a look!
library(tximport) library(DESeq2) # Create file paths for all samples files <- c("kallisto_SRR27792597/abundance.h5", "kallisto_SRR27792603/abundance.h5", "kallisto_SRR27792598/abundance.h5", "kallisto_SRR27792604/abundance.h5") # Name them names(files) <- c("mucus_rep1", "control_rep1", "mucus_rep2", "control_rep2") # Import data txi <- tximport(files, type = "kallisto", txOut = TRUE) coldata <- data.frame( condition = c("mucus", "control", "mucus", "control"), # or however they're grouped row.names = names(files) ) dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
Just as a note, there are another way we can create from matrix and count as well using tsv
.
PCA
# Perform variance stabilizing transformation vsd <- vst(dds, blind = FALSE) # Plot PCA plotPCA(vsd, intgroup = "condition")

Wow, it looks like the treatment and control are separated quite well! Let’s dive straight into differential expression analysis. If we look at figure 2B on the article, it looks very similar, even though we haven’t include the full sequences.
DESeq2
library(tidyverse) # Run DESeq2 analysis dds <- DESeq(dds) # Get results res <- results(dds, alpha = 0.01, tidy = T) # View summary summary(res) ## row baseMean log2FoldChange lfcSE ## Length:3570 Min. : 0 Min. :-8.42507 Min. :0.06850 ## Class :character 1st Qu.: 326 1st Qu.:-0.51037 1st Qu.:0.08493 ## Mode :character Median : 1313 Median : 0.00074 Median :0.10664 ## Mean : 12154 Mean :-0.01934 Mean :0.16100 ## 3rd Qu.: 4999 3rd Qu.: 0.53025 3rd Qu.:0.16896 ## Max. :3272163 Max. : 3.75721 Max. :4.98958 ## NA's :1 NA's :1 ## stat pvalue padj ## Min. :-80.75478 Min. :0.0000000 Min. :0.0000000 ## 1st Qu.: -4.39763 1st Qu.:0.0000000 1st Qu.:0.0000000 ## Median : 0.00566 Median :0.0001034 Median :0.0002068 ## Mean : -0.71728 Mean :0.1229994 Mean :0.1366817 ## 3rd Qu.: 3.44035 3rd Qu.:0.0970157 3rd Qu.:0.1293422 ## Max. : 28.59200 Max. :0.9978324 Max. :0.9978324 ## NA's :1 NA's :1 NA's :1 # positive res |> filter(log2FoldChange > 0) |> arrange(padj, desc(log2FoldChange)) |> head(10) ## row baseMean log2FoldChange lfcSE stat ## 1 gene-CDR20291_1626 2872.357 2.999390 0.10490313 28.59200 ## 2 gene-CDR20291_2014 32895.724 2.043527 0.07250627 28.18414 ## 3 gene-CDR20291_0509 54339.793 1.925179 0.07046733 27.32016 ## 4 gene-CDR20291_2174 13209.067 1.957986 0.07539631 25.96926 ## 5 gene-CDR20291_2495 24664.096 2.321013 0.08978192 25.85168 ## 6 gene-CDR20291_0508 8811.393 2.021692 0.07967500 25.37423 ## 7 gene-CDR20291_2738 9037.752 1.859512 0.07781080 23.89787 ## 8 gene-CDR20291_1446 4410.271 1.982683 0.08581234 23.10487 ## 9 gene-CDR20291_2871 3862.148 1.912250 0.08437678 22.66322 ## 10 gene-CDR20291_2017 5272.620 1.784124 0.08208607 21.73479 ## pvalue padj ## 1 8.448830e-180 1.076924e-177 ## 2 9.148740e-175 1.053286e-172 ## 3 2.444054e-164 2.643282e-162 ## 4 1.102039e-148 9.832939e-147 ## 5 2.329704e-147 2.027979e-145 ## 6 4.856404e-142 3.939206e-140 ## 7 3.223217e-126 2.170502e-124 ## 8 4.136645e-118 2.636372e-116 ## 9 1.033340e-113 6.250829e-112 ## 10 9.621927e-105 5.283178e-103 # negative res |> filter(log2FoldChange < 0) |> arrange(padj, log2FoldChange) |> head(10) ## row baseMean log2FoldChange lfcSE stat pvalue ## 1 gene-CDR20291_3145 195320.938 -8.425071 0.10432907 -80.75478 0 ## 2 gene-CDR20291_2142 14441.266 -5.528640 0.08551642 -64.65004 0 ## 3 gene-CDR20291_1557 21279.449 -4.484919 0.11271527 -39.78981 0 ## 4 gene-CDR20291_0332 11550.169 -4.080126 0.08702073 -46.88683 0 ## 5 gene-CDR20291_0331 4196.984 -3.907433 0.09507966 -41.09641 0 ## 6 gene-CDR20291_2206 6931.606 -3.821015 0.08700146 -43.91897 0 ## 7 gene-CDR20291_1078 28192.247 -3.531056 0.08204064 -43.04033 0 ## 8 gene-CDR20291_2237 6355.265 -3.491135 0.08384686 -41.63704 0 ## 9 gene-CDR20291_0330 6786.174 -3.481683 0.08283612 -42.03097 0 ## 10 gene-CDR20291_2238 6034.666 -3.430734 0.08338778 -41.14193 0 ## padj ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 0 ## 7 0 ## 8 0 ## 9 0 ## 10 0
Interesting thing about the pvalue, we need to look at adjust pval instead because we are doing multiple comparison. Adjust pvalue is via Benjamini-Hochberg
method. This adjust the multiplicity of comparison.
Volcano plot
library(ggplot2) library(ggrepel) # Define specific genes to label genes_to_label <- c("gene-CDR20291_1626", "gene-CDR20291_0508", "gene-CDR20291_1446", "gene-CDR20291_2495", "gene-CDR20291_0455", "gene-CDR20291_0876", "gene-CDR20291_0877", "gene-CDR20291_1275", "gene-CDR20291_0875", "gene-CDR20291_3145") # Create labeled volcano plot res |> filter(!is.na(padj)) |> mutate(label = ifelse(row %in% genes_to_label, str_extract(row, "\\d+$"), "")) |> mutate(color = case_when( log2FoldChange < -1 & padj <= 0.01 ~ "negative", log2FoldChange > 1 & padj <= 0.01 ~ "positive", TRUE ~ "neutral" )) |> ggplot(aes(x = log2FoldChange, y = -log10(padj), color = color)) + geom_point(alpha = 0.2) + geom_vline(xintercept = c(-1, 1), linetype = "dashed") + geom_hline(yintercept = -log10(0.01), linetype = "dashed") + geom_text_repel(aes(label = label), size = 5, # box.padding = 0.3, max.overlaps = 20) + labs(title = "Volcano Plot - Mucus vs Control", x = "Log2 Fold Change", y = "-Log10 Adjusted P-value") + scale_color_manual(values = c("positive" = "red", "negative" = "blue", "neutral" = "grey")) + theme_bw() + theme(legend.position = "none")

If we look at figure 2A, it again looks very similar! To interpret the volcano plot, we basically look at the the top right and top left of the plot. These are the ones that are significantly differentially expressed. The top right are the upregulated genes, whereas the top left are the downregulated genes. We can see that gene-CDR20291_1626
is the most upregulated gene, whereas gene-CDR20291_3145
is the most downregulated gene. 🙌
Looking at NCBI, looks like 1626
is a
putative sodium/phosphate cotransporter [Clostridioides difficile R20291]. And 3145
is
probable protease. What this means is that Cdiff when exposed to mucus when compared to control, the putative sodium/phosphate cotransporter expressed gene was found more (in mucus group), whereas the probable protease expressed gene was found less (in mucus group).
Opportunities for improvement
- I should probably rewrite the above to a script and with function so that in the future we can easily reproduce any DE analysis
- include
esearch
orentrez
to get accession metadata for more accurate label - include
heatmap
- pathway analysis. Yes, we can DE analyze these genes, but what do they actually mean? Do they use nutrients differently etc.
Lessons Learnt
- learnt the basics of
sra-tools fasterq-dump
,fastp
,kallisto
,DESeq2
. - took me a while to find the right
reference
isolate. Found in on their supplement material and got the right one. For future reference, don’t assume they use the popular refseq, look through their procedure and get that specifically. - learnt from raw RNA-seq QC
- learnt to interpret volcano plot
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.