methylKit v0.9.6

June 2, 2016
By

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

We released a new version of methylKit, which is a package for DNA methylation analysis with bisulfite-seq data. This version comes with many changes summarized below. you can also have a look at the release notes. The vignette is now converted to HTML and also lives on rpubs.

Working with large and/or many files

This turns out to be more and more of a problem. When we first designed methylKit it was tested on RRBS data with couple of million CpGs covered. These days many people have either many RRBS or similar reduced representation sequencing results or a fair amount whole-genome bisulfite sequencing. Due to their size, it may be difficult to work with these kind of samples in-memory. We developed a flatfile database based on “tabix”(http://www.htslib.org/doc/tabix.html). This means large objects are written out as tabix files, they are compressed and indexed. We can work with them chunk-by-chunk in a parallelized manner. However, that comes with a cost. It will take time to initially create tabix files but you will be able to work with large data sets without so much of the memory bottleneck.

C/C++ based parsing of Bismark SAM/BAM files

We used to parse sorted Bismark SAM files with a perl script. Now, that script is converted to C/C++ using samtools libraries and Rcpp in the background. This made parsing faster also got rid of a system dependency. processBismarkAln() function is the one where this functionality is implemented. It can parse sorted BAM/SAM files.

More differential methylation tests and dealing with covariates

We added more differential methylation tests. Main improvement is that we can adjust for overdispersion within the logistic regression model. In addition, we can also add covariates such as age and sex into the model and take them into account when testing for differential methylation. We also added a beta-binomial distribution based test by wrapping the functions in the bioconductor DSS package.

Methylation profile segmentation 

Regulatory regions can be discovered by analyzing the methylation patterns, segments with low methylation could indicate a regulatory region. We used a segmentation strategy based on change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but could have applications in methylation context as well.  In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. This functionality is also mentioned here in this blog before, now it is a part of the current version.

Reading processed Bismark files 

Bismark aligner scripts can produce per base methylation files. These files, namely “cytosine report”  and “coverage” files can be read in R using methylKit methRead() function. See Bismark manual for details on the files.

Function name changes 

The following function names are changed for package wide naming consistency. The old names will continue to work for a while.
    – read() to methRead()
    – read.bismark to processBismarkAln()
    – adjust.methylC() to adjustMethylC()
    – get.methylDiff() to getMethylDiff()
    – annotate.WithFeature() to annotateWithFeature()
    – annotate.WithFeature.Flank() to annotateWithFeatureFlank()
    – annotate.WithGenicParts() to annotateWithGenicParts()
    – read.bed() to readBed()
    – read.feature.flank() to readFeatureFlank()
    – read.transcript.features() to readTranscriptFeatures()

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)