Using ENCODE methylation data (RRBS) in R

March 1, 2013

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

ENCODE project has generated reduced-representation bilsulfite sequencing data for multiple cell lines. The data is organized in an extended bed format with additional columns denoting % methylation and coverage per base. Luckily, this sort of generic % methylation information can be read in by R package methylKit, which is a package for analyzing basepair resolution 5mC and 5hmC data.

The code snippets below show how to read RRBS bed file produced by ENCODE. But, let’s first download the files.



Unfortunately, methylKit currently can not read them directly because the track definition line causes a problem. It should be deleted from each bed file. Ideally, methylKit should be able to skip over lines (this issue should be fixed in later versions)
For now, we have to use some unix tools to remove the first lines from the bed files. You run the code below in your terminal. This set of commands will delete the first line from every *.gz file in the directory so be careful.

for files in *.gz
gzip -dc "$files" | tail +2 | gzip -c > "$files".tmp
if [ "$?" -eq 0 ]; then
mv "$files".tmp "$files"

Now we can read the files using methylKit. The pipeline argument defines which columns in the file are corresponding to chr,start,end,strand, percent methylation and coverage:
You can also read multiple files at a time:

Since we have read the files and now they are methylKit objects, we can use all of the methylKit functionality on these objects. For example, the code below plots the distribution of methylation % for covered bases.

getMethylationStats(obj[[1]], plot = TRUE)

You can check the methylKit vignette and the website for the rest of the functionality and details.

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)