discrimination between CpG islands and random sequences using Markov chains

[This article was first published on mintgene » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Major part of modern research is trying to find patterns in the given dataset using learning methods. One of the methods that can use a priori information for such purpose is Markov chains, in which the probability of symbol occurrence depends only on previous symbol (appropriate for the dinucleotide modeling).

The question I am addressing here is – given the random sequence from mouse genome, can you predict that this is a CpG island? In the next few lines of code you will find many useful approaches to MC modeling, splitting of odd and even elements of R vector using “substring” function, querying the UCSC database solely via R, and informative visualization with transparent histograms.

Note that I got sequences from BSgenome.Mmusculus.UCSC.mm9 library, which contains complete mouse genome. Since the download of the same might be tedious for some (~750mb), I uploaded the cpg and random regions to pastebin.

CpG Islands – http://pastebin.com/372cfs6U
Random – http://pastebin.com/wZyKFFkx

# load libraries
library(compiler)
library(rtracklayer)
library(GenomicFeatures)

# ignore following if you're taking sequences from pastebin
library(BSgenome.Mmusculus.UCSC.mm9)

# models derived by ML estimator (a_st = c_st / sum(c_st')) from CpG and non-CpG islands
# in each row, nucleotide has a probability of being followed by respective base in the column,
# thus rows should sum up to 1

# notice the drop of G being followed by C in negative model (due to C being target of spontaneous deamination upon methylation),
# same is preserved in CpG-islands rich regions due to the lack of methylation supression on promoter regions

positive.model <- t(matrix(
    data  = c(0.180, 0.274, 0.426, 0.120, 0.171, 0.368, 0.274, 0.188, 0.161, 0.339, 0.375, 0.125, 0.079, 0.355, 0.384, 0.182),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
))

negative.model <- t(matrix(
    data  = c(0.300, 0.205, 0.285, 0.210, 0.322, 0.298, 0.078, 0.302, 0.248, 0.246, 0.298, 0.208, 0.177, 0.239, 0.292, 0.292),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
))

# log-likelihood table derived from log-odds ratio of two models
# used for discrimination of sequences as CpG or non-CpG islands

likelihood.values <- round(log2(positive.model/negative.model), 3)

# score (in bits) of the sequence normalized by its length
bits.score <- function(sequence = DNA_region, model = likelihood.values) {

    # sequence length
    seq.length <- nchar(sequence)

    # if sequence is odd, trim the last element
    if (seq.length %% 2 != 0) {
        sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
        seq.length <- seq.length - 1
    }

    # extract dinucleotides
    idx   <- seq(1, seq.length)
    odds  <- idx[(idx %% 2) == 1]
    evens <- idx[(idx %% 2) == 0]

    tmp.split <- substring(sequence, odds, evens)

    # return normalized sum of the log-likelihood scores
    return(sum(unlist(lapply(tmp.split, function(tmp.dinucleotide) {
        lookup <- strsplit(tmp.dinucleotide, split = "")[[1]]
        return(model[lookup[1], lookup[2]])
    }))) / seq.length)
}

bits.score <- cmpfun(bits.score)

# setting up the discrimination model
model.score <- function(sequences = DNA_region) {
    all.dinucleotides <- do.call("c", lapply(sequences, function(sequence) {

        # sequence length
        seq.length <- nchar(sequence)

        # if sequence is odd, trim the last element
        if (seq.length %% 2 != 0) {
            sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
            seq.length <- seq.length - 1
        }

        # extract dinucleotides
        idx   <- seq(1, seq.length)
        odds  <- idx[(idx %% 2) == 1]
        evens <- idx[(idx %% 2) == 0]

        tmp.split <- substring(sequence, odds, evens)
    }))

    all.counts <- t(matrix(table(all.dinucleotides), ncol = 4, nrow = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))))
    all.counts <- t(apply(all.counts, 1, function(tmp.row) { tmp.row / sum(tmp.row) } ))

return(all.counts)
}

model.score <- cmpfun(model.score)

# following code extracts sequences from annotated and random parts of the mm9 genome
# ignore it if you're using pastebin text
# querying the UCSC database for CpG islands track
session               <- browserSession()
genome(session)       <- "mm9"
query                 <- ucscTableQuery(session, "CpG Islands")
cpg.islands           <- getTable(query)
cpg.islands           <- subset(cpg.islands, cpg.islands$chrom == "chr1", select = c("chrom", "chromStart", "chromEnd", "length"))
colnames(cpg.islands) <- c("chr", "region.start", "region.end", "region.width")

# generate random positions via "sample" function, but not exceeding largest width of CpG island
random.regions            <- data.frame(chr = "chr1", region.start = sample(seq(0, seqlengths(Mmusculus)[["chr1"]] - max(cpg.islands$region.width)), nrow(cpg.islands)))
random.regions$region.end <- random.regions$region.start + cpg.islands$region.width

markov.chain.seq <- lapply(c("cpg.islands", "random.regions"),  FUN = function(tmp.sample) {
    tmp.dataset <- get(tmp.sample)

    if (nrow(tmp.dataset) > 0) {
        tmp.sview        <- as.character(DNAStringSet(Views(unmasked(Mmusculus[["chr1"]]), start = tmp.dataset$region.start, end = tmp.dataset$region.end)))
    }

    return(tmp.sview)
})

# only keep sequences from non-repetitive parts of genome
markov.chain.seq        <- lapply(markov.chain.seq, function(tmp.sample) { do.call("c", lapply(tmp.sample, function(tmp.seq) { tmp.seq[!grepl("N", tmp.seq)] })) })
names(markov.chain.seq) <- c("cpg.islands", "random.regions")

# you can substitute "lapply" with "mclapply" for parallel processing with multicore package at this point
# output of the discrimination plot
png("~/MarkovChain.png", 1000, 600)
    hist(unlist(lapply(markov.chain.seq[[1]], bits.score)), breaks = 100, xlim = c(-0.5, 0.5), ylim = c(0, 50), col = "#78B90050", main = "Markov Chain model discriminates CpG from non-CpG sequences", xlab = "bits score")
    hist(unlist(lapply(markov.chain.seq[[2]], bits.score)), breaks = 100, col = "#444D6150", add = TRUE)
legend("topright", c("CpG-rich sequences", "random sequences"), col = c("#78B900", "#444D61"), pch = 15, cex = 2, bty = "n")
dev.off()

Similar to the HMM example from previous post, the Markov chain is also based on Durbin (1998.) book.

Cheers, mintgene.


To leave a comment for the author, please follow the link and comment on their blog: mintgene » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)