Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data.

In this post I simulate some gene expression data and visualise it using the heatmaply package in R by Tal Galili. This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice

The following function simulates data from a multivariate normal distribution, and allows exposure dependent correlations between the data. You will need the mvrnorm function from the MASS library to run this function:

simulateExprData <- function(n, n0, p, rho0, rho1){
# n: total number of subjects
# n0: number of subjects with exposure 0
# n1: number of subjects with exposure 1
# p: number of genes
# rho0: rho between Z_i and Z_j for subjects with exposure 0
# rho1: rho between Z_i and Z_j for subjects with exposure 1

# Simulate gene expression values according to exposure 0 or 1,
# according to a centered multivariate normal distribution with
# covariance between Z_i and Z_j being rho^|i-j|
n1 <- n - n0
times <- 1:p
H <- abs(outer(times, times, "-"))
V0 <- rho0^H
V1 <- rho1^H

# rows are people, columns are genes
genes0 <- MASS::mvrnorm(n = n0, mu = rep(0,p), Sigma = V0)
genes1 <- MASS::mvrnorm(n = n1, mu = rep(0,p), Sigma = V1)
genes <- rbind(genes0,genes1)
return(genes)
}

genes <- simulateExprData(n = 100, n0 = 50, p = 100,
rho0 = 0.01, rho1 = 0.95)

Next we need to install and load the heatmaply package:

install.packages('devtools')
devtools::install_github('talgalili/heatmaply')
library(heatmaply)


The syntax is extremely simple. Lets plot a few different interactive heatmaps of the data. We first plot the genes and specify how many groups we want for the rows (subjects) and columns (genes) of the data. In this case we specify 2 groups for both the rows and columns:

heatmaply(genes, k_row = 2, k_col = 2)

We see that the clustering algorithm is not able to properly cluster the subjects because we would have expected to see two equal size groups in the dendrogram on the y-axis. Let’s instead plot the correlation matrix of the genes:

heatmaply(cor(genes), k_row = 2, k_col = 2)

We now see two distinct groups in the dendrogram, as we would expect because that’s the way we generated the data