Getting Started with Markov Chains: Part 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
In a previous post, I showed some elementary properties of discrete time Markov Chains could be calculated, mostly with functions from the markovchain package. In this post, I would like to show a little bit more of the functionality available in that package by fitting a Markov Chain to some data. In this first block of code, I load the gold data set from the forecast package which contains daily morning gold prices in US dollars from January 1, 1985 through March 31, 1989. Next, since there are few missing values in the sequence, I impute them with a simple “ad hoc” process by substituting the previous day's price for one that is missing. There are two statements in the loop because there are a number of instances where there are two missing values in a row. Note that some kind of imputation is necessary because I will want to compute the autocorrelation of the series, and like many R functions acf() does not like NAs. (it doesn't make sense to compute with NAs.)
library(forecast)
library(markovchain)
data(gold) # Load gold time series
# Impute missing values
gold1 <- gold
for(i in 1:length(gold)){
gold1[i] <- ifelse(is.na(gold[i]),gold[i-1],gold[i])
gold1[i] <- ifelse(is.na(gold1[i]),gold1[i-1],gold1[i])
}
plot(gold1, xlab = “Days”, ylab = “US$”, main = “Gold prices 1/1/85 – 1/31/89”)
This is an interesting series with over 1,000 points, but definitely not stationary; so it is not a good candidate for trying to model as a Markov Chain. The series produced by taking first differences is more reasonable. The series flat, oscillating about a mean (0.07) slightly above zero and the autocorrelation trails off as one might expect for a stationary series.
# Take first differences to try and get stationary series
goldDiff <- diff(gold1)
par(mfrow = c(1,2))
plot(goldDiff,ylab=””,main=”1st differences of gold”)
acf(goldDiff)
Next, we set up for modeling by constructing a series of labels. In this analysis, I settle for constructing a two state chain that reflects whether the series of differences assumes positive or non-positive values. Note that I have introduced a few zero values into the series because of my crude imputation process. Here, i just lump them in with the negatives.
# construct a series of labels
goldSign <- vector(length=length(goldDiff))
for (i in 1:length(goldDiff)){
goldSign[i] <- ifelse(goldDiff[i] > 0, “POS”,”NEG”)
}
Next, we can use some of the statistical tests built into the markovchain package to assess our assumptions so far. The function verifyMarkovProperty() attempts to verify that a sequence satisfies the Markov property by performing Chi squared tests on a series of contingency tables where the columns are sequences of past to present to future transitions and the rows are sequences of state transitions. Large p values indicate that one should not reject the null hypothesis that the Markov property holds for a specific transition. The output of verifyMarkovProperty() is a list with an entry for each possible state transition. The package vignette An introduction to the markovchain package presents the details of how this works. The following shows the output for the NEG to POS transitions.
# Verify the Markov property
vmp <- verifyMarkovProperty(goldSign)
vmp[2]
# $NEGPOS
# $NEGPOS$statistic
# X-squared
# 0
#
# $NEGPOS$parameter
# df
# 1
#
# $NEGPOS$p.value
# [1] 1
#
# $NEGPOS$method
# [1] “Pearson's Chi-squared test with Yates' continuity correction”
#
# $NEGPOS$data.name
# [1] “table”
#
# $NEGPOS$observed
# SSO TSO-SSO
# NEG 138 116
# POS 164 139
#
# $NEGPOS$expected
# SSO TSO-SSO
# NEG 137.7163 116.2837
# POS 164.2837 138.7163
#
# $NEGPOS$residuals
# SSO TSO-SSO
# NEG 0.02417181 -0.02630526
# POS -0.02213119 0.02408452
#
# $NEGPOS$stdres
# SSO TSO-SSO
# NEG 0.04843652 -0.04843652
# POS -0.04843652 0.04843652
#
# $NEGPOS$table
# SSO TSO
# NEG 138 254
# POS 164 303
The assessOrder() function uses a Chi squared test to test that hypothesis that the sequence is consistent with a first order Markov Chain
assessOrder(goldSign)
# The assessOrder test statistic is: 0.4142521
# the Chi-Square d.f. are: 2 the p-value is: 0.8129172
# $statistic
# [1] 0.4142521
#
# $p.value
# [1] 0.8129172
There is an additional function assessStationarity() to test for the stationarity of the sequence. However in this case, the chisq.test() function at the heart of things reports that the p-values are unreliable.
Next, we use the markovchainFit() function to fit a Markov Chain to the data using the maximum likelihood estimator explained in the vignette mentioned above. The output from the function includes the estimated transition matrix, the estimated error and lower and upper endpoint transition matrices which provide a confidence interval for the transition matrix.
goldMC <- markovchainFit(data = goldSign, method="mle", name = "gold mle")
goldMC$estimate
# gold mle
# A 2 – dimensional discrete Markov Chain with following states
# NEG POS
# The transition matrix (by rows) is defined as follows
# NEG POS
# NEG 0.4560144 0.5439856
# POS 0.5519126 0.4480874
goldMC$standardError
# NEG POS
# NEG 0.02861289 0.03125116
# POS 0.03170655 0.02856901
goldMC$confidenceInterval
# [1] 0.95
#
# $lowerEndpointMatrix
# NEG POS
# NEG 0.4089504 0.4925821
# POS 0.4997599 0.4010956
#
# $upperEndpointMatrix
# NEG POS
# NEG 0.5030784 0.5953892
# POS 0.6040652 0.4950793
The transition matrix does show some interesting structure with it being more likely for the chain to go from a negative to a positive value than to stay negative. And, once it is positive, the chain is more likely to stay positive than go negative.
Finally, we use the predict() function to produce a three day, look ahead forecast for the situation where the series has been negative for the last two days.
predict(object = goldMC$estimate, newdata = c(“POS”,”POS”),n.ahead=3)
#”NEG” “POS” “NEG”
I still have not exhausted what is in the markovchain package. Perhaps, in an other post I will look at the functions for continuous time chains. What is presented here though should be enough to have some fun hunting for Markov Chains in all kinds of data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.