Using R and Bioconductor for sequence analysis

August 26, 2009

(This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers)

Here’s another quick R vignette, in case I pick this up later and need to remind myself where I got stuck. I was trying to use R for a bit of basic sequence analysis, with mixed results.

First, install the BSgenome package, which is part of Bioconductor. Get GeneR while you’re at it.

> source("")
> biocLite("BSgenome")
> biocLite("GeneR")

Follow the instructions in the document How to forge a BSgenome data package. You’ll need to get fasta files from somewhere such as NCBI’s Entrez Genome. Another nice data source is Regulatory Sequence Analysis Tools.

I created a BSgenome package for our favorite model organism Halobacterium salinarum NRC-1, which I named halo for short. Now, I can ask what sequences make up the halo genome and find out how long they are.

> library(BSgenome.halo.NCBI.1)
> seqnames(halo)
[1] "chr"     "pNRC200" "pNRC100"
> seqlengths(halo)
    chr pNRC200 pNRC100 
2014239  365425  191346
> length(halo$chr)
[1] 2014239

There are a few things I wanted to do next. First, I wanted to load a list of genes with their coordinates. That should allow me to quickly get the sequence for each gene, or get sequence of upstream regions for regulatory motif finding. Second, if I’m going to find any new protein coding regions, I’d like to have a function that could take a stretch of DNA and find ORFs (open reading frames). As far as I can tell, all there is to ORF finding is searching each reading frame for long stretches that start with a methionine (AUG) and end with a stop codon (UAG, UGA, and UAA ). Maybe there’s more to it than that.

This is where I left off. GeneR seems to use an entirely different way of encoding sequence based on buffers. I have to admit to being a little disappointed. I hope it’s just my cluelessness and there’s really a reasonable way to do this kind of thing in R and Bioconductor.

Related stuff from Blue Collar Bioinformatics

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)