Some basics of biomaRt

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the commonest bioinformatics questions, at Biostars and elsewhere, takes the form: “I have a list of identifiers (X); I want to relate them to a second set of identifiers (Y)”. HGNC gene symbols to Ensembl Gene IDs, for example.

When this occurs I have been known to tweet “the answer is BioMart” (there are often other solutions too) and I’ve written a couple of blog posts about the R package biomaRt in the past. However, I’ve realised that we need to take a step back and ask some basic questions that new users might have. How do I find what marts and datasets are available? How do I know what attributes and filters to use? How do I specify different genome build versions?

1. What marts and datasets are available?
The function listMarts() returns a data frame with mart names and descriptions.

library(biomaRt)

marts <- listMarts()
head(marts)

              biomart                             version
1             ensembl        ENSEMBL GENES 79 (SANGER UK)
2                 snp    ENSEMBL VARIATION 79 (SANGER UK)
3          regulation   ENSEMBL REGULATION 79 (SANGER UK)
4                vega                VEGA 59  (SANGER UK)
5       fungi_mart_26           ENSEMBL FUNGI 26 (EBI UK)
6 fungi_variations_26 ENSEMBL FUNGI VARIATION 26 (EBI UK)

The first 4 of those are what you would see if you visited Ensembl BioMart on the Web and clicked the “choose database” drop-down box.

Given a mart object, listDatasets() returns the available datasets. We get a mart object using useMart().

datasets <- listDatasets(useMart("ensembl"))
head(datasets)

                         dataset                                description version
1         oanatinus_gene_ensembl     Ornithorhynchus anatinus genes (OANA5)   OANA5
2        cporcellus_gene_ensembl            Cavia porcellus genes (cavPor3) cavPor3
3        gaculeatus_gene_ensembl     Gasterosteus aculeatus genes (BROADS1) BROADS1
4         lafricana_gene_ensembl         Loxodonta africana genes (loxAfr3) loxAfr3
5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2
6        choffmanni_gene_ensembl        Choloepus hoffmanni genes (choHof1) choHof1

2. How to specify a different genome build version?
Assuming that you want human genes in the most recent Ensembl build, you supply mart and dataset information from the previous step to useMart() like this:

mart.hs <- useMart("ensembl", "hsapiens_gene_ensembl")

What if you want an older genome build? First, visit the Ensembl Archives page for more information. Next, use the URL information there to supply a host = argument. For example, to see what marts are available for Ensembl version 72 (genome build GRCh37.p11):

listMarts(host = "jun2013.archive.ensembl.org")

               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 72
2     ENSEMBL_MART_SNP  Ensembl Variation 72
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 72
4    ENSEMBL_MART_VEGA               Vega 52
5                pride        PRIDE (EBI UK)

Available datasets can be found as before with the addition of the host = argument to useMart(). To create the version 72 mart object:

mart72.hs <- useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl", host = "jun2013.archive.ensembl.org")

3. What filters and attributes can I use?
Let’s review the BioMart terminology.
Attributes are the identifiers that you want to retrieve. For example HGNC gene ID, chromosome name, Ensembl transcript ID.
Filters are the identifiers that you supply in a query. Some but not all of the filter names may be the same as the attribute names.
Values are the filter identifiers themselves. For example the values of the filter “HGNC symbol” could be 3 genes “TP53″, “SRY” and “KIAA1199″.

But how do you know what attributes and filters are available? If you guessed listAttributes() and listFilters(), you’d be right.

attributes <- listAttributes(mart.hs)
head(attributes)

                   name           description
1       ensembl_gene_id       Ensembl Gene ID
2 ensembl_transcript_id Ensembl Transcript ID
3    ensembl_peptide_id    Ensembl Protein ID
4       ensembl_exon_id       Ensembl Exon ID
5           description           Description
6       chromosome_name       Chromosome Name

filters <- listFilters(mart.hs)
head(filters)

             name     description
1 chromosome_name Chromosome name
2           start Gene Start (bp)
3             end   Gene End (bp)
4      band_start      Band Start
5        band_end        Band End
6    marker_start    Marker Start

You can search for specific attributes by running grep() on the name. For example, if you’re looking for Affymetrix microarray probeset IDs:

attributes[grep("affy", attributes$name),]

                   name                  description
102        affy_hc_g110        Affy HC G110 probeset
103       affy_hg_focus       Affy HG FOCUS probeset
104 affy_hg_u133_plus_2 Affy HG U133-PLUS-2 probeset
105     affy_hg_u133a_2     Affy HG U133A_2 probeset
106       affy_hg_u133a       Affy HG U133A probeset
107       affy_hg_u133b       Affy HG U133B probeset
108      affy_hg_u95av2      Affy HG U95AV2 probeset
109        affy_hg_u95b        Affy HG U95B probeset
110        affy_hg_u95c        Affy HG U95C probeset
111        affy_hg_u95d        Affy HG U95D probeset
112        affy_hg_u95e        Affy HG U95E probeset
113        affy_hg_u95a        Affy HG U95A probeset
114       affy_hugenefl      Affy HuGene FL probeset
115      affy_primeview               Affy primeview
116       affy_u133_x3p       Affy U133 X3P probeset

4. A worked example
As an example, let’s find SNPs on the Y chromosome which have starts that are located within exons. First, we create two mart objects for the genes and SNP datasets. We query the first mart for exons on chromosome Y and the second for SNPs on the same chromosome. Finally we’ll employ the incredibly-useful findOverlaps() function from the GenomicRanges package to get the exonic SNPs.

There are various ways to run BioMart queries; I find the simplest is to use getBM().

library(biomaRt)
library(GenomicRanges)

mart.hs     <- useMart("ensembl", "hsapiens_gene_ensembl")
mart.hs_snp <- useMart("snp", "hsapiens_snp")

exons <- getBM(attributes = c("ensembl_gene_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end", "strand"), filters = "chromosome_name", values = "Y", mart = mart.hs)
snps  <- getBM(attributes = c("refsnp_id", "chr_name", "chrom_start"), filters = "chr_name", values = "Y", mart = mart.hs_snp)

gr.exons <- GRanges(seqnames = "chrY", ranges = IRanges(start = exons$exon_chrom_start, end = exons$exon_chrom_end, names = exons$ensembl_exon_id))
gr.snps  <- GRanges(seqnames = "chrY", ranges = IRanges(start = snps$chrom_start, end = snps$chrom_start, names = snps$refsnp_id))
ov <- findOverlaps(gr.snps, gr.exons, type = "within")

# ov is a matrix with 2 columns
# col 1 = row numbers for query hits (snps); col 2 = row numbers for subject hits (exons)
exons_snps <- cbind(exons[ov@subjectHits, ], snps[ov@queryHits, ])

head(exons_snps)

     ensembl_gene_id ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end strand refsnp_id chr_name chrom_start
1967 ENSG00000231341 ENSE00001732601               Y          5207215        5208069     -1    rs4913        Y     5207285
1792 ENSG00000226092 ENSE00001743398               Y         21927521       21927819     -1    rs3802        Y    21927601
588  ENSG00000224657 ENSE00001783563               Y         22658581       22658878      1    rs3802        Y    22658799
2231 ENSG00000215540 ENSE00001693911               Y         23537308       23537606     -1    rs3802        Y    23537388
74   ENSG00000215507 ENSE00001733694               Y         26133010       26133308      1    rs3802        Y    26133228
275  ENSG00000215414 ENSE00001542224               Y         13286638       13287378      1   rs15434        Y    13287334

Filed under: bioinformatics, programming, R, statistics Tagged: biomart, how to, tutorial

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)