The following biological phenomenon demonstrates just how elegant R code can be.
In vertebrate genomes, a methyl group (-CH3) can be added to nucleotides. Such process of methylation is commonly associated with gene suppression. Most of the cytosines in the system are methylated (apart from structures called CpG islands). If a spontaneous event of deamination (removal of amine group) occurs on cytosine, different outcomes happen depending on the methylation state. If cytosine was not methylated, it is then converted to uracil nucleotide, which is not a constituent member of DNA sequences and is quickly repaired by the system. However, if the cytosine is methylated, conversion to another nucleotide occurs – thymine. Since thymine is natural part of each DNA sequence, system cannot recognise this change. Thus in vertebrate genomes, we see much lower proportion of CpG dinucleotides than expected.
It takes 2 lines of R code to discover this phenomenon.
## input is the control sample in ChIP-seq experiment (sequences of un-chipped DNA) inpt <- readAligned(dirPath="/YourPathToInputFiles/", pattern="Input.bam", type="BAM") ## divide sequence data into dinucleotides and calculate their frequency ntab <- table(unlist(lapply(sread(inpt)[sample(1:length(inpt), 10000)], function(i) strsplit(gsub("(..)", "\\1_", i), split="_")[])))
You may have also some unexpected nucleotides (>16 permutations) that contain “N”. These are the bases that could not be assigned by sequencing machine.
Finally, let’s visualise CpG dinucleotide depletion.
barplot(ntab[!grepl("N", names(ntab))], ylab="Frequency")
## readAligned is part of ShortRead package, to download visit bioconductor.org ## table calculates string (dinucleotide) frequency ## we iterate over 10,000 randomly chosen sequences with lapply function ## sread can access sequence reads from the (binary) sequence alignment map ## strsplit splits the strings (sequences) based on pattern defined by split parameter ## gsub unlike sub replaces the pattern over all matches ## \\1 is a backreference to the grouping regular expression (..); it returns all two-character elements
You may wonder why CpG and not GpC? Cytosine in GpC is rarely methylated (http://www.ncbi.nlm.nih.gov/pubmed/6254144).