**Computational Biology Blog in fasta format**, and kindly contributed to R-bloggers)

For now, I am going to introduce how to build our own Markov Chain of zero order and first order in R programming language. The definition of a zero order Markov Chain relies in that, the current state (or nucleotide) does not depends on the previous state, so there’s no “memory” and every state is untied.

For the first order Markov Chain the case is different because the current state actually depends only on the previous state. Given that points clear, a second order Markov Model will be a model that reflects that the current state only depends on the previous two states before it (This model will be useful for the study of codons, given that they are substrings of 3 nucleotides long).

Download the scripts (R script and graphviz scripts) here

Example of a Markov Chain of zero order (the current nucleotide is totally independent of the previous nucleotide).

The multinomial model is:

p(A)+p(C)+p(G)+p(T) = 1.0

0.4 +0.1 +0.1 +0.4 = 1.0

Example of the structure of the zero order model:

Example of a Markov Chain of first order (the current nucleotide only depends on the previous nucleotide).

The multinomial model per base is:

p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0

p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0

p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0

p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0

So:

0.6 + 0.1 + 0.1 + 0.2 = 1.0

0.1 + 0.5 + 0.3 + 0.1 = 1.0

0.05+ 0.2 + 0.7 + 0.05 = 1.0

0.4 + 0.05+0.05 + 0.5 = 1.0

Example of the structure of the first order model:

`# Author: Benjamin Tovar`

# Date: 13/April/2012

#

# Example of a Markov Chain of zero order (the current nucleotide is

# totally independent of the previous nucleotide).

# The multinomial model is:

# p(A)+p(C)+p(G)+p(T) = 1.0

# 0.4 +0.1 +0.1 +0.4 = 1.0

# Define the DNA alphabet that will be used to put names to the objects

alp <- c("A","C","G","T")

# Create the vector that represents the probability distribution of the model

zeroOrder <- c(0.4,0.1,0.1,0.4)

# Put the name of reference of each base

names(zeroOrder) <- alp

# Create a sequence of 1000 bases using this model.

zeroOrderSeq <- sample(alp,1000,rep=T,prob=zeroOrder)

# ***** Study the composition bias of the sequence *****

# We wil use the "seqinr" package.

# For the installation of the package, type:

# install.packages("seqinr")

# Then, load the package:

require("seqinr")

# Count the frequency of each base

# in the sequence using the "count" function

zeroOrderFreq <- count(zeroOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides

# in the sequence using the "count" function

zeroOrderFreqDin <- count(zeroOrderSeq,2,alphabet=alp,freq=TRUE)

# Now, plot the results in the same plot:

layout(1:2)

barplot(zeroOrderFreq,col=1:4,main="Compositional bias of each nucleotide",

xlab="Base",

ylab="Base proportion")

barplot(zeroOrderFreqDin,col=rainbow(16),

main="Compositional bias of each dinucleotide",

xlab="Base",

ylab="Base proportion")

# &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&

# Example of a Markov Chain of first order (the current nucleotide only

# depends on the previous nucleotide).

# The multinomial model per base is:

# p(A|A)+p(C|A)+p(G|A)+p(T|A) = 1.0

# p(A|C)+p(C|C)+p(G|C)+p(T|C) = 1.0

# p(A|G)+p(C|G)+p(G|G)+p(T|G) = 1.0

# p(A|T)+p(C|T)+p(G|T)+p(T|T) = 1.0

# So:

# 0.6 + 0.1 + 0.1 + 0.2 = 1.0

# 0.1 + 0.5 + 0.3 + 0.1 = 1.0

# 0.05+ 0.2 + 0.7 + 0.05 = 1.0

# 0.4 + 0.05+0.05 + 0.5 = 1.0

# Create the matrix that will store the probability distribution given

# a certain nucleotide:

firstOrderMat <- matrix(NA,nr=4,nc=4)

# Put names to the 2 dimensions of the matrix

colnames(firstOrderMat) <- rownames(firstOrderMat) <- alp

# Add the probability distribution per base:

firstOrderMat[1,] <- c(0.6,0.1,0.1,0.2) # Given an A in the 1st position

firstOrderMat[2,] <- c(0.1,0.5,0.3,0.1) # Given a C in the 1st position

firstOrderMat[3,] <- c(0.05,0.2,0.7,0.05)# Given a G in the 1st position

firstOrderMat[4,] <- c(0.4,0.05,0.05,0.5)# Given a T in the 1st position

# Now we got a matrix

# > firstOrderMat

# A C G T

# A 0.60 0.10 0.10 0.20

# C 0.10 0.50 0.30 0.10

# G 0.05 0.20 0.70 0.05

# T 0.40 0.05 0.05 0.50

# In order to continue, we need an initial probability distribution to know

# which base is the most probable to start up the sequence.

inProb <- c(0.4,0.1,0.1,0.4); names(inProb) <- alp

# So, the sequence will have a 40% to start with an A or a T and 10% with C or G

# Create a function to generate the sequence.

# NOTE: To load the function to the current environment, just copy

# the entire function and paste it inside the R prompt.

generateFirstOrderSeq <- function(lengthSeq,

alphabet,

initialProb,

firstOrderMatrix){

# lengthSeq = length of the sequence

# alphabet = alphabet that compounds the sequence

# initialProb = initial probability distribution

# firstOrderMatrix = matrix that stores the probability distribution of a

# first order Markov Chain

# Construct the object that stores the sequence

outputSeq <- rep(NA,lengthSeq)

# Which is the first base:

outputSeq[1] <- sample(alphabet,1,prob=initialProb)

# Let the computer decide:

for(i in 2:length(outputSeq)){

prevNuc <- outputSeq[i-1]

currentProb <- firstOrderMatrix[prevNuc,]

outputSeq[i] <- sample(alp,1,prob=currentProb)

}

cat("** DONE: Sequence computation is complete **\n")

return(outputSeq)

}

# Use the generateFirstOrderSeq function to generate a sequence of 1000 bases

# long

firstOrderSeq <- generateFirstOrderSeq(1000,alp,inProb,firstOrderMat)

# ***** Study the composition bias of the sequence *****

# We wil use the "seqinr" package.

# For the installation of the package, type:

# install.packages("seqinr")

# Then, load the package:

require("seqinr")

# Count the frequency of each base

# in the sequence using the "count" function

firstOrderFreq <- count(firstOrderSeq,1,alphabet=alp,freq=TRUE)

# Count the frequency of dinucleotides

# in the sequence using the "count" function

firstOrderFreqDin <- count(firstOrderSeq,2,alphabet=alp,freq=TRUE)

# Now, plot the results in the same plot:

layout(1:2)

barplot(firstOrderFreq,col=1:4,main="Compositional bias of each nucleotide",

xlab="Base",

ylab="Base proportion")

barplot(firstOrderFreqDin,col=rainbow(16),

main="Compositional bias of each dinucleotide",

xlab="Base",

ylab="Base proportion")

## Lets plot the 4 plots in one window

layout(matrix(1:4,nr=2,nc=2))

# Results from the zero order

barplot(zeroOrderFreq,col=1:4,

main="Compositional bias of each nucleotide\nZero Order Markov Chain",

xlab="Base",

ylab="Base proportion")

barplot(zeroOrderFreqDin,col=rainbow(16),

main="Compositional bias of each dinucleotide\nZero Order Markov Chain",

xlab="Base",

ylab="Base proportion")

# Results from the first order

barplot(firstOrderFreq,col=1:4,

main="Compositional bias of each nucleotide\nFirst Order Markov Chain",

xlab="Base",

ylab="Base proportion")

barplot(firstOrderFreqDin,col=rainbow(16),

main="Compositional bias of each dinucleotide\nFirst Order Markov Chain",

xlab="Base",

ylab="Base proportion")

# end.

**leave a comment**for the author, please follow the link and comment on their blog:

**Computational Biology Blog in fasta format**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...