biomaRt and GenomeGraphs: a worked example

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As promised a few posts ago, another demonstration of the excellent biomaRt package, this time in conjunction with GenomeGraphs.

Here’s what we’re going to do:

  1. Grab some public microarray data
  2. Normalise and get a list of the most differentially-expressed probesets
  3. Use biomaRt to fetch the genes associated with those probesets
  4. Plot the data using GenomeGraphs

If you want to follow along on your own machine, it will need to be quite powerful. We’ll be processing exon arrays, which requires a 64-bit machine with at least 4 GB (and preferably, at least 8-12 GB) of RAM. As usual, I’m assuming some variant of Linux and that you’re comfortable at the command line.

1. The data
First, grab your raw data. In this example, we’ll use GEO series GSE12236, titled “Whole Genome Exon Arrays Identify Differential Expression of Alternatively Spliced, Cancer-related Genes in Lung Cancer.” It consists of n = 40 samples divided into 20 x 2 pairs. Each sample in a pair is either normal or adenocarcinoma lung tissue from the same individual. Create a directory for your project and:

wget ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE12236/GSE12236_RAW.tar
tar xvf GSE12236_RAW.tar

You should now see 40 files, ending with .CEL.gz, in your working directory. No need to unzip them.

Still a little more preparation before the main event. In the same directory as the CEL files, open a text editor and create a file named covdesc. A glance at the first few lines will tell you what this is all about:

	disease.state
GSM307537.CEL.gz	adenocarcinoma
GSM307538.CEL.gz	normal
GSM307539.CEL.gz	adenocarcinoma
GSM307540.CEL.gz	normal
GSM307541.CEL.gz	adenocarcinoma
GSM307542.CEL.gz	normal
GSM307543.CEL.gz	adenocarcinoma
GSM307544.CEL.gz	normal
GSM307545.CEL.gz	adenocarcinoma
...

Two columns: the first, with no header, a list of the CEL file names. You can quickly create this using “ls -1 *.gz > covdesc”. The second, headed “disease.state” (you can call it what you like) describes each file according to whether the sampled tissue is normal or adenocarcinoma. You get this information by looking at the samples (GSM) at the link to the GEO series.

Penultimate preparation step: fetch the file exon.pmcdf_1.1.tar.gz from the XMap website and install it using “R CMD INSTALL exon.pmcdf_1.1.tar.gz”. This is a customised CDF (chip descriptor file), used to analyse exon arrays.

And finally, we need a file that maps the exon probesets to human chromosomes. You can get these data from Affymetrix, after creating an account at their site and logging in. Scroll down to “Current NetAffx Annotation Files”, locate the latest probeset annotation CSV file, download it and then do the following:

# unzip to a large CSV file
unzip HuEx-1_0-st-v2.na30.hg19.probeset.csv.zip
# make it smaller with some grep and awk magic
grep -v "^#" HuEx-1_0-st-v2.na30.hg19.probeset.csv | awk '{FS=","; print $1,$2,$3,$4,$5,$6,$7}' > HuEx.small.csv

That will create a much smaller file (~ 62 MB or so) named HuEx.small.csv, with only the 7 columns that you need. This useful tip brought to you by Mark Robinson, as part of this aroma-project tutorial.

2. RMA normalisation
We’ll use the Bioconductor simpleaffy package to read and normalise the CEL data. Open R in your working directory and start with:

library(simpleaffy)
library(GenomeGraphs)  # used later, loads biomaRt
cel.data <- read.affy()  # and wait...
cel.data@cdfName <- "exon.pmcdf"
rma.data <- rma(cel.data)  # and wait even longer

Too easy. Next.

3. Probeset analysis
Next step – identify the most differentially-expressed probesets. There are multiple approaches to this problem. The simplest is to perform a pairwise-comparison (normal versus carcinoma) and calculate the fold-change and a simple significance statistic, such as a p-value. This simple approach is frowned upon by experienced microarray statisticians – at the very least, you should perform a multiple testing correction, since there are many more variables (probesets) than samples. However, we’ll stick with this simple, oft-used (and abused) approach for the purposes of this example.
Everything you need is in the simpleaffy package:

# first, pairwise comparison - prepare for a very long wait
pc.rma <- pairwise.comparison(rma.data, group = "disease.state", members = c("normal", "adenocarcinoma"), a.order = c(1:20), b.order = c(1:20))
# a quick peek at the top 10 largest fold-changes
sort(abs(fc(pc.rma)),decreasing=TRUE)[1:10]
2748560  2692898  3246417  3246412  2692900  2692894  3882388  3246418
6.512661 6.511786 6.375859 6.366003 6.251415 6.201049 6.184535 6.161279
2748556  3246411
6.159890 6.078726

# Now, filter the results
pf.rma <- pairwise.filter(pc.rma, min.present.no = "all", present.by.group = "F", fc = 1, tt = 0.001)

The pairwise.comparison method bears a little explanation. Our samples are paired and the pairings are reflected in the covdesc file: adenocarcinomas (1,3,5…) = normals (2,4,6…). This means that the lists normals(1:20) and adenocarcinomas(1:20) are in the correct order and we can tell pairwise.comparison to use paired replicates in the t-test.
The pairwise.filter method can apply various criteria to the results of the pairwise comparison. Here, we specify that we want probesets present on all microarrays, with fold change greater than 2 (remember, log2(2) = 1) and t-test p-value less than 0.001.
Let’s look at the 10 most up- and down-regulated probesets:

# most up
sort(fc(pf.rma),decreasing=TRUE)[1:10]
3751793  3751816  3751795  3751819  3751791  3751804  3751814  3142383
5.757751 5.354772 5.150401 5.031244 4.837986 4.819550 4.787895 4.721356
3751797  3142384
4.459196 4.441283
# most down
sort(fc(pf.rma),decreasing=F)[1:10]
2692898   3246417   3246412   2692900   2692894   3246418   3910438   3882403
-6.511786 -6.375859 -6.366003 -6.251415 -6.201049 -6.161279 -6.040208 -5.950049
2880431   3126018
-5.871162 -5.842652

4. Fetching the genes
Finally, we get to biomaRt. Let’s stay with the “top ten” probesets, which were up-regulated:

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# top 10 up-regulated probesets as a data frame
top10.up <- as.data.frame(sort(fc(pf.rma),decreasing=TRUE)[1:10])
colnames(top10.up) <- "fc"
# fetch HGNC symbols for those probesets
hgnc <- getBM(attributes = c("affy_huex_1_0_st_v2", "hgnc_symbol"), filters = "affy_huex_1_0_st_v2", values=c(rownames(top10.up)), mart = mart)
# match probeset IDs in hgnc with top10.up
m <- match(rownames(top10.up), hgnc$affy_huex_1_0_st_v2)
# and add a new column with HGNC to top10.up
top10.up$hgnc <- hgnc[m, "hgnc_symbol"]
top10.up
             fc    hgnc
2692898 5.757751   MUC13
3246417 5.354772    MSMB
3246412 5.150401    MSMB
2692900 5.031244   MUC13
2692894 4.837986   MUC13
3246418 4.819550    MSMB
3910438 4.787895 CYP24A1
3882403 4.721356   PLUNC
2880431 4.459196  SPINK1
3126018 4.441283    FGL1

That should be reasonably self-explanatory, from the previous post on how to use biomaRt.

5. Visualisation
Finally, we move to the GenomeGraphs package for visualisation. What we want is to plot the RMA values in genomic context. As an example, we’ll take the gene MUC13, from the top of the top ten list. Let’s break down the process into a series of tasks:

  1. Fetch the exon probeset IDs that map to gene MUC13
  2. Find the chromosome, strand and transcript coordinates for gene MUC13
  3. Find the chromosomal coordinates for the exon probesets
  4. Extract the RMA values corresponding to the MUC13 probesets
  5. Plot the chromosome, RMA values and transcripts in an attractive, informative way

Here’s the R to do that.

# read the Affymetrix data file with probeset chromosomal mappings
huex <- read.csv("HuEx.small.csv", header = T)
# fetch MUC13 probesets and chromosome
muc13.probes <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "affy_huex_1_0_st_v2"), filters = "hgnc_symbol", values="MUC13", mart = mart)
# make Gene and Transcript models for MUC13
muc13.gene <- makeGene("hgnc_symbol", id = "MUC13", biomart = mart)
muc13.transcripts <- makeTranscript("hgnc_symbol", id = "MUC13", biomart = mart)
# make an axis 5'-3' for (+), 3'-5' for (-)
muc13.strand <- as.character(muc13.gene@ens$strand[1])
if(muc13.strand ==  "1") {chrom.axis <- makeGenomeAxis(add53 = TRUE)} else
  if(muc13.strand == "-1") {chrom.axis <- makeGenomeAxis(add35 = TRUE)}
# find the coordinates of the exon probesets for MUC13
muc13.huex <- huex[huex$probeset_id %in% muc13.probes$affy_huex_1_0_st_v2,]
muc13.huex <- muc13.huex[sort.list(muc13.huex$start),]
# extract the RMA values for those probesets and create an exon object
muc13.expr <- exprs(rma.data)[featureNames(rma.data) %in% muc13.huex$probeset_id,]
# set some parameters for plotting
muc13.cols <- rep("grey", ncol(muc13.expr))
muc13.cols[seq(1, 40, by = 2)] <- "red"
muc13.exons <- makeExonArray(intensity = muc13.expr, probeStart = muc13.huex$start, probeEnd = muc13.huex$stop, probeId = as.character(muc13.huex$probeset_id), nProbes = muc13.huex$probe_count, dp = DisplayPars(color = muc13.cols))

# We're ready:  create the object to plot and plot it
muc13 <- list(makeTitle("MUC13"), makeIdeogram(chromosome = muc13.probes$chromosome_name[1]), "RMA value" = muc13.exons, chrom.axis, "gene" = muc13.gene, "transcripts" = muc13.transcripts)
png(filename = "muc13.png", width = 800, height = 600)
gdPlot(muc13, minBase = min(muc13.gene@ens$exon_chrom_start), maxBase = max((muc13.gene@ens$exon_chrom_end)))
dev.off()

muc13

Normal (grey) and adenocarcinoma (red) MUC13 exon expression

There’s a lot going on here, so let’s go through each section – it’s not as complicated as it first appears.
First, we read the modified probeset annotation file into a data frame (line 2). Next, we use a biomaRt query to fetch the chromosome and exon probeset names which correspond to our gene of interest, MUC13 (line 4).
In the next section, we start building the objects required to make a GenomeGraphs plot. Lines 6-7 create gene and transcripts, using the makeGene() and makeTranscript() functions, respectively. In lines 9-11, we identify the strand of the gene and use that information to make an axis running either 5′ – 3′ (strand +) or 3′ – 5′ (strand -).
The next block of code is concerned with creating an exon array object. We fetch the subset of the probe annotation data containing probesets for MUC13 and sort probesets by start coordinate (lines 13-14). The sort step is important, as probeset position will be indicated on the final plot. Line 16 extracts a matrix of RMA values for MUC13 probesets (which are already ordered correctly). Lines 18-19 define some graphical parameters: we first set all line colours to “grey”, then those for the adenocarcinoma samples (the odd-numbered lines in the covdesc file) to “red”. Line 20 uses the makeExonArray() function which takes all of these data: the RMA matrix, probeset starts/ends, probeset IDs, probes per probeset and the graphical parameters, to build an exon array object.

Line 23 generates the object to be plotted. You simply decide which tracks you want in the plot and the order in which they should appear, create the appropriate objects, then add them to a list. If you use the syntax “gene” = muc13.gene, the word “gene” appears as a label to the left of the gene track. The last steps (lines 24-26) are to create a PNG file and write to it using gdPlot() which takes the object to be plotted and the minimum/maximum chromosome coordinates to display as arguments. If all goes to plan, you should finish with a plot like the one shown above-left; click it to see a larger version.

So what does the plot tell us? As with all microarrays, the data are quite noisy but it’s clear that in general, expression of most exons is higher in adenocarcinomal tissue. There is an open-access publication associated with this dataset: “Whole genome exon arrays identify differential expression of alternatively spliced, cancer-related genes in lung cancer”. Interestingly, the authors do not identify MUC13 as a significantly up-regulated gene, in either the main text or the supplementary data. This may be due to differences in statistical procedure: remember, the approach taken in this post was the simplest (and worst) way to go. However, they do highlight a related gene, MUC4. MUC13 has been identified in several other cancer studies including gastric, colon and ovarian cancers, so it seems that over-expression of mucins is a quite widespread cancer phenotype.

That’s the basics of biomaRt and GenomeGraphs. Once you get the hang of it, it’s relatively easy to write more useful code which, for example, can batch-process many genes of interest and create plots with the appropriate titles and file names for each gene.


Filed under: bioinformatics, programming, R, statistics Tagged: bioconductor, microarray, tutorial

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)