Simulating genes and counts for DESeq2 analysis

January 31, 2017

(This article was first published on Let's talk about science with R, and kindly contributed to R-bloggers)

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

# Get all human gene symbols from biomaRt
mart <- useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")
my_results <- getBM(attributes = c("hgnc_symbol"), mart=mart)

# Simulate 100 gene names to be used for our cnts matrix
my_genes <- with(my_results, sample(hgnc_symbol, size=100, replace=FALSE))

# Simulate a cnts matrix
cnts = matrix(runif(600, min=0, max=1200), ncol=6)
cnts = apply(cnts, c(1,2), as.integer)


Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.

# Running DESEQ2 based on
grp.idx <- rep(c("KO", "WT"), each=3)
coldat=DataFrame(grp=factor(grp.idx, levels=c("WT", "KO")))

# Add the column names and gene names
colnames(cnts) <- paste(grp.idx, 1:6, sep="_")
rownames(cnts) <- my_genes

# Run DESeq2 analysis on the simulated counts
dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
dds <- DESeq(dds)
deseq2.res <- results(dds)

# -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792 

Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.

# Load the fold changes from DESeq2 analysis and order in decreasing order
geneList = sort(exp.fc, decreasing = TRUE) # log FC is shown

gene <- geneList[abs(geneList) >= 1]

# C1orf216

Now it’s your turn!  What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.

Happy R programming!

To leave a comment for the author, please follow the link and comment on their blog: Let's talk about science with R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)