# biomaRt and GenomeGraphs: a worked example

June 6, 2010
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

As promised a few posts ago, another demonstration of the excellent biomaRt package, this time in conjunction with GenomeGraphs.

Here’s what we’re going to do:

1. Grab some public microarray data
2. Normalise and get a list of the most differentially-expressed probesets
3. Use biomaRt to fetch the genes associated with those probesets
4. Plot the data using GenomeGraphs

If you want to follow along on your own machine, it will need to be quite powerful. We’ll be processing exon arrays, which requires a 64-bit machine with at least 4 GB (and preferably, at least 8-12 GB) of RAM. As usual, I’m assuming some variant of Linux and that you’re comfortable at the command line.

1. The data
First, grab your raw data. In this example, we’ll use GEO series GSE12236, titled “Whole Genome Exon Arrays Identify Differential Expression of Alternatively Spliced, Cancer-related Genes in Lung Cancer.” It consists of n = 40 samples divided into 20 x 2 pairs. Each sample in a pair is either normal or adenocarcinoma lung tissue from the same individual. Create a directory for your project and:

wget ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE12236/GSE12236_RAW.tar
tar xvf GSE12236_RAW.tar


You should now see 40 files, ending with .CEL.gz, in your working directory. No need to unzip them.

Still a little more preparation before the main event. In the same directory as the CEL files, open a text editor and create a file named covdesc. A glance at the first few lines will tell you what this is all about:

	disease.state
GSM307538.CEL.gz	normal
GSM307540.CEL.gz	normal
GSM307542.CEL.gz	normal
GSM307544.CEL.gz	normal
...


Two columns: the first, with no header, a list of the CEL file names. You can quickly create this using “ls -1 *.gz > covdesc”. The second, headed “disease.state” (you can call it what you like) describes each file according to whether the sampled tissue is normal or adenocarcinoma. You get this information by looking at the samples (GSM) at the link to the GEO series.

Penultimate preparation step: fetch the file exon.pmcdf_1.1.tar.gz from the XMap website and install it using “R CMD INSTALL exon.pmcdf_1.1.tar.gz”. This is a customised CDF (chip descriptor file), used to analyse exon arrays.

And finally, we need a file that maps the exon probesets to human chromosomes. You can get these data from Affymetrix, after creating an account at their site and logging in. Scroll down to “Current NetAffx Annotation Files”, locate the latest probeset annotation CSV file, download it and then do the following:

# unzip to a large CSV file
unzip HuEx-1_0-st-v2.na30.hg19.probeset.csv.zip
# make it smaller with some grep and awk magic
grep -v "^#" HuEx-1_0-st-v2.na30.hg19.probeset.csv | awk '{FS=","; print $1,$2,$3,$4,$5,$6,$7}' > HuEx.small.csv  That will create a much smaller file (~ 62 MB or so) named HuEx.small.csv, with only the 7 columns that you need. This useful tip brought to you by Mark Robinson, as part of this aroma-project tutorial. 2. RMA normalisation We’ll use the Bioconductor simpleaffy package to read and normalise the CEL data. Open R in your working directory and start with: library(simpleaffy) library(GenomeGraphs) # used later, loads biomaRt cel.data <- read.affy() # and wait... cel.data@cdfName <- "exon.pmcdf" rma.data <- rma(cel.data) # and wait even longer  Too easy. Next. 3. Probeset analysis Next step – identify the most differentially-expressed probesets. There are multiple approaches to this problem. The simplest is to perform a pairwise-comparison (normal versus carcinoma) and calculate the fold-change and a simple significance statistic, such as a p-value. This simple approach is frowned upon by experienced microarray statisticians – at the very least, you should perform a multiple testing correction, since there are many more variables (probesets) than samples. However, we’ll stick with this simple, oft-used (and abused) approach for the purposes of this example. Everything you need is in the simpleaffy package: # first, pairwise comparison - prepare for a very long wait pc.rma <- pairwise.comparison(rma.data, group = "disease.state", members = c("normal", "adenocarcinoma"), a.order = c(1:20), b.order = c(1:20)) # a quick peek at the top 10 largest fold-changes sort(abs(fc(pc.rma)),decreasing=TRUE)[1:10] 2748560 2692898 3246417 3246412 2692900 2692894 3882388 3246418 6.512661 6.511786 6.375859 6.366003 6.251415 6.201049 6.184535 6.161279 2748556 3246411 6.159890 6.078726 # Now, filter the results pf.rma <- pairwise.filter(pc.rma, min.present.no = "all", present.by.group = "F", fc = 1, tt = 0.001)  The pairwise.comparison method bears a little explanation. Our samples are paired and the pairings are reflected in the covdesc file: adenocarcinomas (1,3,5…) = normals (2,4,6…). This means that the lists normals(1:20) and adenocarcinomas(1:20) are in the correct order and we can tell pairwise.comparison to use paired replicates in the t-test. The pairwise.filter method can apply various criteria to the results of the pairwise comparison. Here, we specify that we want probesets present on all microarrays, with fold change greater than 2 (remember, log2(2) = 1) and t-test p-value less than 0.001. Let’s look at the 10 most up- and down-regulated probesets: # most up sort(fc(pf.rma),decreasing=TRUE)[1:10] 3751793 3751816 3751795 3751819 3751791 3751804 3751814 3142383 5.757751 5.354772 5.150401 5.031244 4.837986 4.819550 4.787895 4.721356 3751797 3142384 4.459196 4.441283 # most down sort(fc(pf.rma),decreasing=F)[1:10] 2692898 3246417 3246412 2692900 2692894 3246418 3910438 3882403 -6.511786 -6.375859 -6.366003 -6.251415 -6.201049 -6.161279 -6.040208 -5.950049 2880431 3126018 -5.871162 -5.842652  4. Fetching the genes Finally, we get to biomaRt. Let’s stay with the “top ten” probesets, which were up-regulated: mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl") # top 10 up-regulated probesets as a data frame top10.up <- as.data.frame(sort(fc(pf.rma),decreasing=TRUE)[1:10]) colnames(top10.up) <- "fc" # fetch HGNC symbols for those probesets hgnc <- getBM(attributes = c("affy_huex_1_0_st_v2", "hgnc_symbol"), filters = "affy_huex_1_0_st_v2", values=c(rownames(top10.up)), mart = mart) # match probeset IDs in hgnc with top10.up m <- match(rownames(top10.up), hgnc$affy_huex_1_0_st_v2)
# and add a new column with HGNC to top10.up
top10.up$hgnc <- hgnc[m, "hgnc_symbol"] top10.up fc hgnc 2692898 5.757751 MUC13 3246417 5.354772 MSMB 3246412 5.150401 MSMB 2692900 5.031244 MUC13 2692894 4.837986 MUC13 3246418 4.819550 MSMB 3910438 4.787895 CYP24A1 3882403 4.721356 PLUNC 2880431 4.459196 SPINK1 3126018 4.441283 FGL1  That should be reasonably self-explanatory, from the previous post on how to use biomaRt. 5. Visualisation Finally, we move to the GenomeGraphs package for visualisation. What we want is to plot the RMA values in genomic context. As an example, we’ll take the gene MUC13, from the top of the top ten list. Let’s break down the process into a series of tasks: 1. Fetch the exon probeset IDs that map to gene MUC13 2. Find the chromosome, strand and transcript coordinates for gene MUC13 3. Find the chromosomal coordinates for the exon probesets 4. Extract the RMA values corresponding to the MUC13 probesets 5. Plot the chromosome, RMA values and transcripts in an attractive, informative way Here’s the R to do that. # read the Affymetrix data file with probeset chromosomal mappings huex <- read.csv("HuEx.small.csv", header = T) # fetch MUC13 probesets and chromosome muc13.probes <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "affy_huex_1_0_st_v2"), filters = "hgnc_symbol", values="MUC13", mart = mart) # make Gene and Transcript models for MUC13 muc13.gene <- makeGene("hgnc_symbol", id = "MUC13", biomart = mart) muc13.transcripts <- makeTranscript("hgnc_symbol", id = "MUC13", biomart = mart) # make an axis 5'-3' for (+), 3'-5' for (-) muc13.strand <- as.character(muc13.gene@ens$strand[1])
if(muc13.strand ==  "1") {chrom.axis <- makeGenomeAxis(add53 = TRUE)} else
if(muc13.strand == "-1") {chrom.axis <- makeGenomeAxis(add35 = TRUE)}
# find the coordinates of the exon probesets for MUC13
muc13.huex <- huex[huex$probeset_id %in% muc13.probes$affy_huex_1_0_st_v2,]
muc13.huex <- muc13.huex[sort.list(muc13.huex$start),] # extract the RMA values for those probesets and create an exon object muc13.expr <- exprs(rma.data)[featureNames(rma.data) %in% muc13.huex$probeset_id,]
# set some parameters for plotting
muc13.cols <- rep("grey", ncol(muc13.expr))
muc13.cols[seq(1, 40, by = 2)] <- "red"
muc13.exons <- makeExonArray(intensity = muc13.expr, probeStart = muc13.huex$start, probeEnd = muc13.huex$stop, probeId = as.character(muc13.huex$probeset_id), nProbes = muc13.huex$probe_count, dp = DisplayPars(color = muc13.cols))

# We're ready:  create the object to plot and plot it
muc13 <- list(makeTitle("MUC13"), makeIdeogram(chromosome = muc13.probes$chromosome_name[1]), "RMA value" = muc13.exons, chrom.axis, "gene" = muc13.gene, "transcripts" = muc13.transcripts) png(filename = "muc13.png", width = 800, height = 600) gdPlot(muc13, minBase = min(muc13.gene@ens$exon_chrom_start), maxBase = max((muc13.gene@ens\$exon_chrom_end)))
dev.off()


Normal (grey) and adenocarcinoma (red) MUC13 exon expression

There’s a lot going on here, so let’s go through each section – it’s not as complicated as it first appears.
First, we read the modified probeset annotation file into a data frame (line 2). Next, we use a biomaRt query to fetch the chromosome and exon probeset names which correspond to our gene of interest, MUC13 (line 4).
In the next section, we start building the objects required to make a GenomeGraphs plot. Lines 6-7 create gene and transcripts, using the makeGene() and makeTranscript() functions, respectively. In lines 9-11, we identify the strand of the gene and use that information to make an axis running either 5′ – 3′ (strand +) or 3′ – 5′ (strand -).
The next block of code is concerned with creating an exon array object. We fetch the subset of the probe annotation data containing probesets for MUC13 and sort probesets by start coordinate (lines 13-14). The sort step is important, as probeset position will be indicated on the final plot. Line 16 extracts a matrix of RMA values for MUC13 probesets (which are already ordered correctly). Lines 18-19 define some graphical parameters: we first set all line colours to “grey”, then those for the adenocarcinoma samples (the odd-numbered lines in the covdesc file) to “red”. Line 20 uses the makeExonArray() function which takes all of these data: the RMA matrix, probeset starts/ends, probeset IDs, probes per probeset and the graphical parameters, to build an exon array object.

Line 23 generates the object to be plotted. You simply decide which tracks you want in the plot and the order in which they should appear, create the appropriate objects, then add them to a list. If you use the syntax “gene” = muc13.gene, the word “gene” appears as a label to the left of the gene track. The last steps (lines 24-26) are to create a PNG file and write to it using gdPlot() which takes the object to be plotted and the minimum/maximum chromosome coordinates to display as arguments. If all goes to plan, you should finish with a plot like the one shown above-left; click it to see a larger version.

So what does the plot tell us? As with all microarrays, the data are quite noisy but it’s clear that in general, expression of most exons is higher in adenocarcinomal tissue. There is an open-access publication associated with this dataset: “Whole genome exon arrays identify differential expression of alternatively spliced, cancer-related genes in lung cancer”. Interestingly, the authors do not identify MUC13 as a significantly up-regulated gene, in either the main text or the supplementary data. This may be due to differences in statistical procedure: remember, the approach taken in this post was the simplest (and worst) way to go. However, they do highlight a related gene, MUC4. MUC13 has been identified in several other cancer studies including gastric, colon and ovarian cancers, so it seems that over-expression of mucins is a quite widespread cancer phenotype.

That’s the basics of biomaRt and GenomeGraphs. Once you get the hang of it, it’s relatively easy to write more useful code which, for example, can batch-process many genes of interest and create plots with the appropriate titles and file names for each gene.

Filed under: bioinformatics, programming, R, statistics Tagged: bioconductor, microarray, tutorial

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...