discrimination between CpG islands and random sequences using Markov chains

February 8, 2012
By

(This article was first published on mintgene » R, and kindly contributed to R-bloggers)

Major part of modern research is trying to find patterns in the given dataset using learning methods. One of the methods that can use a priori information for such purpose is Markov chains, in which the probability of symbol occurrence depends only on previous symbol (appropriate for the dinucleotide modeling).

The question I am addressing here is – given the random sequence from mouse genome, can you predict that this is a CpG island? In the next few lines of code you will find many useful approaches to MC modeling, splitting of odd and even elements of R vector using “substring” function, querying the UCSC database solely via R, and informative visualization with transparent histograms.

Note that I got sequences from BSgenome.Mmusculus.UCSC.mm9 library, which contains complete mouse genome. Since the download of the same might be tedious for some (~750mb), I uploaded the cpg and random regions to pastebin.

CpG Islands – http://pastebin.com/372cfs6U
Random - http://pastebin.com/wZyKFFkx

# load libraries
library(compiler)
library(rtracklayer)
library(GenomicFeatures)

# ignore following if you're taking sequences from pastebin
library(BSgenome.Mmusculus.UCSC.mm9)

# models derived by ML estimator (a_st = c_st / sum(c_st')) from CpG and non-CpG islands
# in each row, nucleotide has a probability of being followed by respective base in the column,
# thus rows should sum up to 1

# notice the drop of G being followed by C in negative model (due to C being target of spontaneous deamination upon methylation),
# same is preserved in CpG-islands rich regions due to the lack of methylation supression on promoter regions

positive.model <- t(matrix(
    data  = c(0.180, 0.274, 0.426, 0.120, 0.171, 0.368, 0.274, 0.188, 0.161, 0.339, 0.375, 0.125, 0.079, 0.355, 0.384, 0.182),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
))

negative.model <- t(matrix(
    data  = c(0.300, 0.205, 0.285, 0.210, 0.322, 0.298, 0.078, 0.302, 0.248, 0.246, 0.298, 0.208, 0.177, 0.239, 0.292, 0.292),
    nrow  = 4,
    ncol  = 4,
    dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
))

# log-likelihood table derived from log-odds ratio of two models
# used for discrimination of sequences as CpG or non-CpG islands

likelihood.values <- round(log2(positive.model/negative.model), 3)

# score (in bits) of the sequence normalized by its length
bits.score <- function(sequence = DNA_region, model = likelihood.values) {

    # sequence length
    seq.length <- nchar(sequence)

    # if sequence is odd, trim the last element
    if (seq.length %% 2 != 0) {
        sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
        seq.length <- seq.length - 1
    }

    # extract dinucleotides
    idx   <- seq(1, seq.length)
    odds  <- idx[(idx %% 2) == 1]
    evens <- idx[(idx %% 2) == 0]

    tmp.split <- substring(sequence, odds, evens)

    # return normalized sum of the log-likelihood scores
    return(sum(unlist(lapply(tmp.split, function(tmp.dinucleotide) {
        lookup <- strsplit(tmp.dinucleotide, split = "")[[1]]
        return(model[lookup[1], lookup[2]])
    }))) / seq.length)
}

bits.score <- cmpfun(bits.score)

# setting up the discrimination model
model.score <- function(sequences = DNA_region) {
    all.dinucleotides <- do.call("c", lapply(sequences, function(sequence) {

        # sequence length
        seq.length <- nchar(sequence)

        # if sequence is odd, trim the last element
        if (seq.length %% 2 != 0) {
            sequence   <- paste(strsplit(sequence, split = "")[[1]][-seq.length], collapse = "")
            seq.length <- seq.length - 1
        }

        # extract dinucleotides
        idx   <- seq(1, seq.length)
        odds  <- idx[(idx %% 2) == 1]
        evens <- idx[(idx %% 2) == 0]

        tmp.split <- substring(sequence, odds, evens)
    }))

    all.counts <- t(matrix(table(all.dinucleotides), ncol = 4, nrow = 4, dimnames = list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))))
    all.counts <- t(apply(all.counts, 1, function(tmp.row) { tmp.row / sum(tmp.row) } ))

return(all.counts)
}

model.score <- cmpfun(model.score)

# following code extracts sequences from annotated and random parts of the mm9 genome
# ignore it if you're using pastebin text
# querying the UCSC database for CpG islands track
session               <- browserSession()
genome(session)       <- "mm9"
query                 <- ucscTableQuery(session, "CpG Islands")
cpg.islands           <- getTable(query)
cpg.islands           <- subset(cpg.islands, cpg.islands$chrom == "chr1", select = c("chrom", "chromStart", "chromEnd", "length"))
colnames(cpg.islands) <- c("chr", "region.start", "region.end", "region.width")

# generate random positions via "sample" function, but not exceeding largest width of CpG island
random.regions            <- data.frame(chr = "chr1", region.start = sample(seq(0, seqlengths(Mmusculus)[["chr1"]] - max(cpg.islands$region.width)), nrow(cpg.islands)))
random.regions$region.end <- random.regions$region.start + cpg.islands$region.width

markov.chain.seq <- lapply(c("cpg.islands", "random.regions"),  FUN = function(tmp.sample) {
    tmp.dataset <- get(tmp.sample)

    if (nrow(tmp.dataset) > 0) {
        tmp.sview        <- as.character(DNAStringSet(Views(unmasked(Mmusculus[["chr1"]]), start = tmp.dataset$region.start, end = tmp.dataset$region.end)))
    }

    return(tmp.sview)
})

# only keep sequences from non-repetitive parts of genome
markov.chain.seq        <- lapply(markov.chain.seq, function(tmp.sample) { do.call("c", lapply(tmp.sample, function(tmp.seq) { tmp.seq[!grepl("N", tmp.seq)] })) })
names(markov.chain.seq) <- c("cpg.islands", "random.regions")

# you can substitute "lapply" with "mclapply" for parallel processing with multicore package at this point
# output of the discrimination plot
png("~/MarkovChain.png", 1000, 600)
    hist(unlist(lapply(markov.chain.seq[[1]], bits.score)), breaks = 100, xlim = c(-0.5, 0.5), ylim = c(0, 50), col = "#78B90050", main = "Markov Chain model discriminates CpG from non-CpG sequences", xlab = "bits score")
    hist(unlist(lapply(markov.chain.seq[[2]], bits.score)), breaks = 100, col = "#444D6150", add = TRUE)
legend("topright", c("CpG-rich sequences", "random sequences"), col = c("#78B900", "#444D61"), pch = 15, cex = 2, bty = "n")
dev.off()

Similar to the HMM example from previous post, the Markov chain is also based on Durbin (1998.) book.

Cheers, mintgene.


To leave a comment for the author, please follow the link and comment on his blog: mintgene » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.