BioMart (and biomaRt)

March 26, 2010
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

I’ve been vaguely aware of BioMart for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.

The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects.

You can use BioMart as a web application called MartView. However, R users should check out the biomaRt package, part of the Bioconductor suite. Here’s a couple of examples.

Example 1: fetch Ensembl gene identifiers given HGNC symbols
Let’s start with a simple example. You have a CSV file in which one of the fields is a HGNC symbol (with the column header “hgnc”) and you want to obtain Ensembl gene IDs.

library(biomaRt)
# define biomart object
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# read in the file
genes <- read.csv("myfile.csv")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", values = genes$hgnc), mart = mart)
# sample results
  ensembl_gene_id hgnc_symbol
1 ENSG00000082397     EPB41L3
2 ENSG00000168461       RAB31
3 ENSG00000176014       TUBB6
4 ENSG00000154734     ADAMTS1
5 ENSG00000197766         CFD
6 ENSG00000156284       CLDN8

You do need to know in advance that “ensembl_gene_id” and “hgnc_symbol” are valid attributes. You can get a list of all attributes for the current biomart object using “listAttributes(mart)”.

Example 2: fetch genes for microarray probesets
In this example, I assume that you have normalised some microarray samples using, for example, RMA in the affy package and used a method such as exprs() to generate a matrix of RMA values, where rows = probeset IDs and columns = sample names. We’d like to get the gene names for those probesets.

library(simpleaffy)
library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# assume that we are using the human exon array from Affymetrix
# read in .CEL files and RMA normalise
data <- read.affy()
data@cdfName <- "exon.pmcdf"
data.rma <- rma(data)
data.ex <- as.data.frame(exprs(data.rma))
# The attribute for exon array probesets is named "affy_huex_1_0_st_v2"
affy <- "affy_huex_1_0_st_v2"
# Next line would take a very long time for all exon probesets!
# We would probably select a subset of data.ex first
genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", affy), filters = affy, values=c(rownames(data.ex)), mart = mart)
# Now match the array data probesets with the genes data frame
m <- match(rownames(data.ex), genes$affy_huex_1_0_st_v2)
# And append e.g. the HGNC symbol to the array data frame
data.ex$hgnc <- genes[m, "hgnc_symbol"]
# sample result
            Con1     Con2   Treat1   Treat2   hgnc
2315603 7.164521 7.107470 7.827158 7.307056 TTLL10
2315610 6.135751 6.259306 6.691880 6.532974 TTLL10
2315614 3.017279 4.602484 5.058326 5.349798 TTLL10
2315647 5.740181 5.373581 5.885912 5.756925   <NA>
2315691 6.389818 5.562760 6.853058 6.430730 SCNN1D
2315713 5.494848 6.243931 6.550043 6.336244 SCNN1D
2315720 6.422661 6.213908 6.447777 6.591330 SCNN1D
2315736 5.882034 6.250097 6.292414 6.311813   <NA>
2315741 5.314087 5.471424 5.762590 5.896435  PUSL1
2315768 2.278067 1.652001 2.430359 2.310668   <NA>
2315787 2.308838 1.912613 2.660703 2.377608 TAS1R3
2315793 4.339545 4.505362 4.974307 4.959468 TAS1R3

Summary
That’s your basic usage of biomaRt. In the next post: how to combine biomaRt with GenomeGraphs, to generate attractive plots of features and quantitative data in genomic context.


Filed under: programming, R, research diary, statistics Tagged: biomart, data integration, ensembl

To leave a comment for the author, please follow the link and comment on his blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , ,

Comments are closed.