I’ve been vaguely aware of BioMart for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.
The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects.
You can use BioMart as a web application called MartView. However, R users should check out the biomaRt package, part of the Bioconductor suite. Here’s a couple of examples.
Example 1: fetch Ensembl gene identifiers given HGNC symbols
Let’s start with a simple example. You have a CSV file in which one of the fields is a HGNC symbol (with the column header “hgnc”) and you want to obtain Ensembl gene IDs.
library(biomaRt)
# define biomart object
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# read in the file
genes <- read.csv("myfile.csv")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "hgnc_symbol", values = genes$hgnc), mart = mart)
# sample results
ensembl_gene_id hgnc_symbol
1 ENSG00000082397 EPB41L3
2 ENSG00000168461 RAB31
3 ENSG00000176014 TUBB6
4 ENSG00000154734 ADAMTS1
5 ENSG00000197766 CFD
6 ENSG00000156284 CLDN8
You do need to know in advance that “ensembl_gene_id” and “hgnc_symbol” are valid attributes. You can get a list of all attributes for the current biomart object using “listAttributes(mart)”.
Example 2: fetch genes for microarray probesets
In this example, I assume that you have normalised some microarray samples using, for example, RMA in the affy package and used a method such as exprs() to generate a matrix of RMA values, where rows = probeset IDs and columns = sample names. We’d like to get the gene names for those probesets.
library(simpleaffy)
library(biomaRt)
mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
# assume that we are using the human exon array from Affymetrix
# read in .CEL files and RMA normalise
data <- read.affy()
data@cdfName <- "exon.pmcdf"
data.rma <- rma(data)
data.ex <- as.data.frame(exprs(data.rma))
# The attribute for exon array probesets is named "affy_huex_1_0_st_v2"
affy <- "affy_huex_1_0_st_v2"
# Next line would take a very long time for all exon probesets!
# We would probably select a subset of data.ex first
genes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", affy), filters = affy, values=c(rownames(data.ex)), mart = mart)
# Now match the array data probesets with the genes data frame
m <- match(rownames(data.ex), genes$affy_huex_1_0_st_v2)
# And append e.g. the HGNC symbol to the array data frame
data.ex$hgnc <- genes[m, "hgnc_symbol"]
# sample result
Con1 Con2 Treat1 Treat2 hgnc
2315603 7.164521 7.107470 7.827158 7.307056 TTLL10
2315610 6.135751 6.259306 6.691880 6.532974 TTLL10
2315614 3.017279 4.602484 5.058326 5.349798 TTLL10
2315647 5.740181 5.373581 5.885912 5.756925 <NA>
2315691 6.389818 5.562760 6.853058 6.430730 SCNN1D
2315713 5.494848 6.243931 6.550043 6.336244 SCNN1D
2315720 6.422661 6.213908 6.447777 6.591330 SCNN1D
2315736 5.882034 6.250097 6.292414 6.311813 <NA>
2315741 5.314087 5.471424 5.762590 5.896435 PUSL1
2315768 2.278067 1.652001 2.430359 2.310668 <NA>
2315787 2.308838 1.912613 2.660703 2.377608 TAS1R3
2315793 4.339545 4.505362 4.974307 4.959468 TAS1R3
Summary
That’s your basic usage of biomaRt. In the next post: how to combine biomaRt with GenomeGraphs, to generate attractive plots of features and quantitative data in genomic context.
Filed under: programming, R, research diary, statistics Tagged: biomart, data integration, ensembl
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Zero Inflated Models and Generalized Linear Mixed Models with R.
Zuur, Saveliev, Ieno (2012).