Using R and Bioconductor for sequence analysis

[This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here’s another quick R vignette, in case I pick this up later and need to remind myself where I got stuck. I was trying to use R for a bit of basic sequence analysis, with mixed results.

First, install the BSgenome package, which is part of Bioconductor. Get GeneR while you’re at it.

> source("")
> biocLite("BSgenome")
> biocLite("GeneR")

Follow the instructions in the document How to forge a BSgenome data package. You’ll need to get fasta files from somewhere such as NCBI’s Entrez Genome. Another nice data source is Regulatory Sequence Analysis Tools.

I created a BSgenome package for our favorite model organism Halobacterium salinarum NRC-1, which I named halo for short. Now, I can ask what sequences make up the halo genome and find out how long they are.

> library(BSgenome.halo.NCBI.1)
> seqnames(halo)
[1] "chr"     "pNRC200" "pNRC100"
> seqlengths(halo)
    chr pNRC200 pNRC100 
2014239  365425  191346
> length(halo$chr)
[1] 2014239

There are a few things I wanted to do next. First, I wanted to load a list of genes with their coordinates. That should allow me to quickly get the sequence for each gene, or get sequence of upstream regions for regulatory motif finding. Second, if I’m going to find any new protein coding regions, I’d like to have a function that could take a stretch of DNA and find ORFs (open reading frames). As far as I can tell, all there is to ORF finding is searching each reading frame for long stretches that start with a methionine (AUG) and end with a stop codon (UAG, UGA, and UAA ). Maybe there’s more to it than that.

This is where I left off. GeneR seems to use an entirely different way of encoding sequence based on buffers. I have to admit to being a little disappointed. I hope it’s just my cluelessness and there’s really a reasonable way to do this kind of thing in R and Bioconductor.

Related stuff from Blue Collar Bioinformatics

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)