How to read BSMAP methylation ratio files into R via methylKit

October 5, 2012

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

BSMAP is an aligner for bisulfite sequencing reads. It outputs aligned reads as well as methylation ratios per base (via script). The methylation ratios can be read into R via methylKit package and regular methylKit analysis can be performed using the BSMAP data.

First,  you need to separate BSMAP methylation ratio file based on methylation context. Cs in CpG, CHH and CHG context should be separated into different files. This can be achieved with a perl or awk one-liner. If you want to analyze Cs from all contexts then of course you don’t need to separate the file. For example, to get all Cs in the CpG context you will need to do something like following:

awk ‘($3==”-” && $4~/^.{1}CG/ ) || ($3==”+” &&  $4~/^.{2}CG/)’ BSMAPexample.txt > CpG.txt

Here is how methylation ratio file from BSMAP looks like (file includes only Cs in CpG context):

chr  pos     strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 + CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 + TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 + GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 + CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 + TACGC 0.000 56 0 0.000 0.064
chr1 3121631 + CTCGT 0.000 56 0 0.000 0.064

You can read the methylation ratio file by using “pipeline” argument in read() function. You need to provide a list of column numbers corresponding to chr,start,end,strand,coverage and ratio of methylation. Actually, you can read any generic methylation ratio or percentage file using this method. The file needs to have the location information (chr,start,end and strand), coverage information and methylation percentage or ratio information.

Thanks to Maxime Caron for sharing the BSMAP methylation ratio file.

Reference for methylKit:
Altuna Akalin, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, Christopher E. Mason.(2012)methylKit: A comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biology , 13:R87.

PS: For the newer releases of BSMAP (~v2.73), you may need to coerce the CT counts to an integer. BSMAP ratio extractor may output effective CT counts as float but they should be coerced to integer before they are read in by methylKit. methylKit (>= v0.5.6) will automatically coerce those values to nearest integer. However, you can also use awk to coerce float to integer values. The example below coerces the 6th column of the the file to integer

awk ‘{OFS=”\t”;print $1,$2,$3,$4,$5,int($6);}’ sample.BSMAP.txt

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)