The biomaRt package

October 2, 2014

(This article was first published on log Fold Change » r, and kindly contributed to R-bloggers)

The biomaRt package allows to query the gigantic ensembl data base directly from R. Since it happens only once in a while, I find myself reading the biomaRt vignette every time I’m using it. Here are the crucial points for the most common application.

ensembl <- useMart( "ensembl", dataset="hsapiens_gene_ensembl" )
f <- listFilters( ensembl )
a <- listAttributes( ensembl )

The data frames f and a hold the available filters and attributes, respectively. To retrieve information, we need to specify by what key we are searching (filters), what keys do we want to retrieve from the database (attributes), what are the values of our search keys (values) and from which mart we want to retrieve the information.

g <- c( "CCL2", "DCN", "LIF", "PLAU", "IL6" )
d <- getBM( 
        attributes= c( "entrezgene", "description", "uniprot_genename" ), 
        filters= "hgnc_symbol", 
        values= g, 
        mart= ensembl )

This returns the following:

  entrezgene                                                      description uniprot_genename
1       6347    chemokine (C-C motif) ligand 2 [Source:HGNC Symbol;Acc:10618]             CCL2
2       1634                            decorin [Source:HGNC Symbol;Acc:2705]              DCN
3       3569 interleukin 6 (interferon, beta 2) [Source:HGNC Symbol;Acc:6018]              IL6
4       3976         leukemia inhibitory factor [Source:HGNC Symbol;Acc:6596]              LIF
5       5328   plasminogen activator, urokinase [Source:HGNC Symbol;Acc:9052]             PLAU

OK, one problem that presents itself is that if there are multiple matching values for a given key, you are getting multiple lines for that key. It may be useful to aggregate this information.

g <- c( "BTC" )
d <- getBM( attributes= c( "uniprot_genename", "description", "ensembl_gene_id" ), filters= "hgnc_symbol", values= g, mart= ensembl )

will return two lines:

  uniprot_genename                                description ensembl_gene_id
1              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530
2              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000174808

We can use aggregate to collapse the ensembl IDs:

ff <- function( x ) { x <- x[ ! x ) & x != "" ] ; paste( unique( x ), collapse= ";" ) }
d <- aggregate( d, by= list( d$uniprot_genename ), FUN=ff )[,-1]

The ff function ignores empty strings and NAs. Also, we remove the first column as it is added by the aggregate function. The result:

  Group.1 uniprot_genename                                description                 ensembl_gene_id
1     BTC              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530;ENSG00000174808

To leave a comment for the author, please follow the link and comment on their blog: log Fold Change » r. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)