The biomaRt package

[This article was first published on log Fold Change » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The biomaRt package allows to query the gigantic ensembl data base directly from R. Since it happens only once in a while, I find myself reading the biomaRt vignette every time I’m using it. Here are the crucial points for the most common application.

ensembl <- useMart( "ensembl", dataset="hsapiens_gene_ensembl" )
f <- listFilters( ensembl )
a <- listAttributes( ensembl )

The data frames f and a hold the available filters and attributes, respectively. To retrieve information, we need to specify by what key we are searching (filters), what keys do we want to retrieve from the database (attributes), what are the values of our search keys (values) and from which mart we want to retrieve the information.

g <- c( "CCL2", "DCN", "LIF", "PLAU", "IL6" )
d <- getBM( 
        attributes= c( "entrezgene", "description", "uniprot_genename" ), 
        filters= "hgnc_symbol", 
        values= g, 
        mart= ensembl )

This returns the following:

  entrezgene                                                      description uniprot_genename
1       6347    chemokine (C-C motif) ligand 2 [Source:HGNC Symbol;Acc:10618]             CCL2
2       1634                            decorin [Source:HGNC Symbol;Acc:2705]              DCN
3       3569 interleukin 6 (interferon, beta 2) [Source:HGNC Symbol;Acc:6018]              IL6
4       3976         leukemia inhibitory factor [Source:HGNC Symbol;Acc:6596]              LIF
5       5328   plasminogen activator, urokinase [Source:HGNC Symbol;Acc:9052]             PLAU

OK, one problem that presents itself is that if there are multiple matching values for a given key, you are getting multiple lines for that key. It may be useful to aggregate this information.

g <- c( "BTC" )
d <- getBM( attributes= c( "uniprot_genename", "description", "ensembl_gene_id" ), filters= "hgnc_symbol", values= g, mart= ensembl )

will return two lines:

  uniprot_genename                                description ensembl_gene_id
1              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530
2              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000174808

We can use aggregate to collapse the ensembl IDs:

ff <- function( x ) { x <- x[ !is.na( x ) & x != "" ] ; paste( unique( x ), collapse= ";" ) }
d <- aggregate( d, by= list( d$uniprot_genename ), FUN=ff )[,-1]

The ff function ignores empty strings and NAs. Also, we remove the first column as it is added by the aggregate function. The result:

  Group.1 uniprot_genename                                description                 ensembl_gene_id
1     BTC              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530;ENSG00000174808

To leave a comment for the author, please follow the link and comment on their blog: log Fold Change » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)