Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data.

In this post I simulate some gene expression data and visualise it using the `heatmaply` package in `R` by Tal Galili. This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice

The following function simulates data from a multivariate normal distribution, and allows exposure dependent correlations between the data. You will need the `mvrnorm` function from the MASS library to run this function:

```simulateExprData <- function(n, n0, p, rho0, rho1){
# n: total number of subjects
# n0: number of subjects with exposure 0
# n1: number of subjects with exposure 1
# p: number of genes
# rho0: rho between Z_i and Z_j for subjects with exposure 0
# rho1: rho between Z_i and Z_j for subjects with exposure 1

# Simulate gene expression values according to exposure 0 or 1,
# according to a centered multivariate normal distribution with
# covariance between Z_i and Z_j being rho^|i-j|
n1 <- n - n0
times <- 1:p
H <- abs(outer(times, times, "-"))
V0 <- rho0^H
V1 <- rho1^H

# rows are people, columns are genes
genes0 <- MASS::mvrnorm(n = n0, mu = rep(0,p), Sigma = V0)
genes1 <- MASS::mvrnorm(n = n1, mu = rep(0,p), Sigma = V1)
genes <- rbind(genes0,genes1)
return(genes)
}

genes <- simulateExprData(n = 100, n0 = 50, p = 100,
rho0 = 0.01, rho1 = 0.95)```

Next we need to install and load the `heatmaply` package:

```install.packages('devtools')
devtools::install_github('talgalili/heatmaply')
library(heatmaply)
```

The syntax is extremely simple. Lets plot a few different interactive heatmaps of the data. We first plot the `genes` and specify how many groups we want for the rows (subjects) and columns (genes) of the data. In this case we specify 2 groups for both the rows and columns:

`heatmaply(genes, k_row = 2, k_col = 2)`

We see that the clustering algorithm is not able to properly cluster the subjects because we would have expected to see two equal size groups in the dendrogram on the y-axis. Let’s instead plot the correlation matrix of the genes:

`heatmaply(cor(genes), k_row = 2, k_col = 2)`

We now see two distinct groups in the dendrogram, as we would expect because that’s the way we generated the data