It Takes 2 Lines of R Code to Discover Interesting Biology

October 23, 2012

(This article was first published on mintgene » R, and kindly contributed to R-bloggers)

The following biological phenomenon demonstrates just how elegant R code can be.

In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the system are methylated (apart from structures called CpG islands). If a spontaneous event of deamination (removal of amine group) occurs on cytosine, different outcomes happen depending on the methylation state. If cytosine was not methylated, it is then converted to uracil nucleotide, which is not a constituent member of DNA sequences and is quickly repaired by the system. However, if the cytosine is methylated, conversion to another nucleotide occurs – thymine. Since thymine is natural part of each DNA sequence, system cannot recognise this change. Thus in vertebrate genomes, we see much lower proportion of CpG dinucleotides than expected.

It takes 2 lines of R code to discover this phenomenon.

## input is the control sample in ChIP-seq experiment (sequences of un-chipped DNA)
inpt <- readAligned(dirPath="/YourPathToInputFiles/", pattern="Input.bam", type="BAM")

## divide sequence data into dinucleotides and calculate their frequency
ntab <- table(unlist(lapply(sread(inpt)[sample(1:length(inpt), 10000)], function(i) strsplit(gsub("(..)", "\\1_", i), split="_")[[1]])))

You may have also some unexpected nucleotides (>16 permutations) that contain “N”. These are the bases that could not be assigned by sequencing machine.
Finally, let’s visualise CpG dinucleotide depletion.

barplot(ntab[!grepl("N", names(ntab))], ylab="Frequency")

## readAligned is part of ShortRead package, to download visit
## table calculates string (dinucleotide) frequency
## we iterate over 10,000 randomly chosen sequences with lapply function
## sread can access sequence reads from the (binary) sequence alignment map
## strsplit splits the strings (sequences) based on pattern defined by split parameter
## gsub unlike sub replaces the pattern over all matches
## \\1 is a backreference to the grouping regular expression (..); it returns all two-character elements

You may wonder why CpG and not GpC? Cytosine in GpC is rarely methylated (

Edit: You can find the dataset used in this example at
Original research is presented at


To leave a comment for the author, please follow the link and comment on their blog: mintgene » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)