[This article was first published on mind of a Markov chain » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previously, I did a simple Bayesian projection of recent temperature using proxy data and the methods shown in McShane and Wyner (2010). I showed that when you take out the last 30 years of data (1969~1998), the projection does not track the recent uptick in temperatures well. The “projection” is a simple unparametric bootstrap which draws parameter values from the posterior and fits them to the predictors (proxies).

Now let’s try this for the reconstruction period. The strategy is the same as before. Estimate the posterior using JAGS and backcast by starting at 1998 and iteratively move forward given parameter draws. Do this 100 times, and we get the 95% credible interval.

I did this over the instrumental period as well, for two calibration periods: one for the CRU dataset, and another that took out the most recent 30 year period.

(click for source code: probably has to be combined with the code in the previous article). (data @ McShane’s website)

```setwd("../")
setwd("mysession/")

# Parameters:
startdate <- 1000
numpc <- 10
numlags <- 2
numsim <- 100

# Get proxies going back to start date:
sel <- allproxy1209info[,"StartYear"] <= startdate
tmp <- colnames(allproxy1209)[-1][sel]
proxy <- allproxy1209[,tmp]
sel <- rowSums(is.na(proxy))==0
proxy <- proxy[sel,]
years <- allproxy1209[sel,1]
rownames(proxy) <- years
sel <- instrument[,"Year"] %in% years
temp <- instrument[sel,-1]
rownames(temp) <- instrument[sel,"Year"]

# Run proxy pca:
mypc <- prcomp(proxy, center=TRUE, scale=TRUE)

# calibrate, then project
mylm2 <- lm(temp[,1] ~ ., data.frame(subset(mypc\$x[,1:10], as.numeric(rownames(mypc\$x)) >= 1850)))
yhat2 <- predict(mylm2, data.frame(subset(mypc\$x[,1:10], as.numeric(rownames(mypc\$x)) < 1850)))

# from CRUTEMP3v
# http://www.cru.uea.ac.uk/cru/data/temperature/crutem3vnh.txt
cru_recent <- c(.725, .613, .790, .863, .811, .828, .889, .920, 1.025, .773, .789)
names(cru_recent) <- 1999:2009

# plot non-held
plot(as.numeric(rownames(mypc\$x)), rep(c(-1, 1), len = length(rownames(mypc\$x))), pch = "", xlab = "Year", ylab = "Temp Anomaly")
lines(as.numeric(rownames(temp)), temp[,1], col = "salmon", lwd = 2)
lines(as.numeric(names(yhat2)), yhat2, col = "orange", lwd = 2)
lines(as.numeric(names(cru_recent)), cru_recent, col = "red", lwd = 2)

# recon whole dataset
recon.sim <- 100
inits <- as.numeric(mcshane.data[1,2:3]) # say known
recon.forecast <- matrix(0, nrow = recon.sim, ncol = 1001)
for (i in 1:recon.sim) {
for (t in 1:1001) {
if (t == 1)
dat <- inits
if (t == 2)
dat <- c(recon.forecast[i,t-1], inits)
if (t >= 3)
dat <- recon.forecast[i,t-(1:2)]
#  PC
pc <- as.numeric(mypc\$x[1001-t+1,1:10])
# process error
mu <- sum(apply(mcshane.beta[], 2, function(x) sample(x, 1)) *
c(1, as.numeric(dat), pc))
# resid error
recon.forecast[i,t] <- rnorm(1, mu,
1/sqrt(sample(mcshane.precsigma[], 1)))
}
}

# plot ahoy!
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(recon.forecast), 1, function(x) quantile(x, .5))), span = .005), col = "green", lwd = 3, type = "l")
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(recon.forecast), 1, function(x) quantile(x, .975))), span = .005), col = "green", lwd = 1, lty = 2)
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(recon.forecast), 1, function(x) quantile(x, .025))), span = .005), col = "green", lwd = 1, lty = 2)

# -- hold out
reconH.sim <- 100
inits <- as.numeric(mcshane.data[1,2:3]) # say known
reconH.forecast <- matrix(0, nrow = reconH.sim, ncol = 1001)
for (i in 1:reconH.sim) {
for (t in 1:1001) {
if (t == 1)
dat <- inits
if (t == 2)
dat <- c(reconH.forecast[i,t-1], inits)
if (t >= 3)
dat <- reconH.forecast[i,t-(1:2)]
#  PC
pc <- as.numeric(mypc\$x[1001-t+1,1:10])
# process error
mu <- sum(apply(hold.beta[], 2, function(x) sample(x, 1)) *
c(1, as.numeric(dat), pc))
# resid error
reconH.forecast[i,t] <- rnorm(1, mu,
1/sqrt(sample(hold.precsigma[], 1)))
}
}

# plot ahoy!
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(reconH.forecast), 1, function(x) quantile(x, .5))), span = .005), col = "purple", lwd = 3, type = "l")
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(reconH.forecast), 1, function(x) quantile(x, .975))), span = .005), col = "purple", lwd = 1, lty = 2)
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(reconH.forecast), 1, function(x) quantile(x, .025))), span = .005), col = "purple", lwd = 1, lty = 2)
``` The orange line is the frequentist estimate directly from MW2010 (w/o AR2 structure), the green line is the Bayesian estimate with 95% credible interval, and the purple line is the Bayesian interval w/o the last 30 years of data to calibrate.

One thing you’ll notice is that both projections underestimate the most recent warming. Not surprisingly, the green (which uses the whole instrumental period) does better in this period. What’s even more interesting is the deviation in the early years between the two projections.

When I saw the results of M&W2010, I noticed the acute angle the “hockey stick” has compared to other recent reconstructions (Mann et al. 2008; Li et al. 2010; Li et al. 2007; Hopcroft et al. 2009; Smith 2010; Tingley 2009; Hegerl et al. 2007; Lee and Zwiers 2008). The method seems to be fairly sensitive to certain assumptions. The paper pushed the hardest for the model with 10 principal components and an AR(2) structure. Given that the data that’s used is from Mann et al. (2008), I thought it was a little weird that there was a large discrepancy between the two.

Now, there’s obviously a ton of things to be done to check (and much of it has been done), but I was interested in the inability to predict the short run-ups in recent periods. My hypothesis was, the tremendous increase in greenhouse gases has created an atmosphere (no pun intended) that is wholly different from anything in the past. I could test this by borrowing ideas from Li et al. (2010) — to use forcing data in the model.

Traditionally, reconstructions of past climate have come from aggregating proxy data. But Li et al. (2010) created a Bayesian hierarchical model (BHM) that enables separating out error between residuals, process and priors (instead of just two out of three). Climate forcing (data) can predict temperature, so Li et al. used forcing data as the temperature driver, while measurement/observation error was articulated through the proxies. The BHM is still incomplete (only uses CO2, volcano, sun as forcings), but it’s shown to reduce both the bias and variance in the models.

I might do a BHM later, but for now, an conceptually easy compromise is to collapse this hierarchy, and just use it as a linear predictor to the original model. $textnormal{temp}_t = beta_0 + beta_1textnormal{temp}_{t-1} + beta_2textnormal{temp}_{t-2} + sum_{i=3}^{12} beta_i(textnormal{PC})_{t,i} + beta_{13}log(textnormal{CO2}_t) + sigma^2_t$

I only include a CO2 predictor for simplicity, and the usual uninformative priors. I did the same simple projection process as before, and overlaid the results in blue:

data @ Institute for Mathematics Applied to Geosciences (IMAGe)

```# now do the Li et al. (2010)
# http://www.image.ucar.edu/Data/paleoindex.html
rownames(li.solar) <- li.solar[,1]
rownames(li.volcano) <- li.volcano[,1]
rownames(li.gas) <- li.gas[,1]

years <- 1850:1998
yrs <- 998:1998
l.gas <- scale(log(li.gas[as.character(c(yrs,1999)),2]))
forcings <- cbind(#scale(li.solar[as.character(yrs),2]),
#scale(li.volcano[as.character(yrs),2]),
#                  scale(log(li.gas[as.character(yrs),2])),
l.gas[1:(length(l.gas)-1)])
#                  l.gas[2:length(l.gas)])
rownames(forcings) <- yrs

## produce the data along with mcshane.data
mcshane.data2 <- data.frame(mcshane.data,
#                            solar = rev(forcings[as.character(years),1]),
#                            volcano = rev(forcings[as.character(years),2]),
#                            gas = rev(forcings[as.character(years),3]))
gas = rev(forcings[as.character(years),1]))
#         gas1 = rev(forcings[as.character(years),2]))

# calibrate, then project
mylm3 <- lm(temp[,1] ~ ., subset(data.frame(mypc\$x[,1:10], forcings), as.numeric(rownames(mypc\$x)) >= 1850))
yhat3 <- predict(mylm3, subset(data.frame(mypc\$x[,1:10], forcings), as.numeric(rownames(mypc\$x)) < 1850))

plot(as.numeric(rownames(mypc\$x)), rep(c(-1, 1), len = length(rownames(mypc\$x))), pch = "", xlab = "Year", ylab = "Temp Anomaly")
lines(as.numeric(rownames(temp)), temp[,1], col = "salmon", lwd = 2)
lines(as.numeric(names(yhat2)), yhat2, col = "orange", lwd = 2)
lines(as.numeric(names(cru_recent)), cru_recent, col = "red", lwd = 2)
lines(as.numeric(names(yhat3)), yhat3, col = "blue", lwd = 2)

nrows <- 149
beta0 <- rep(0,length(mcshane.data2))
precbeta <- diag(.001, length(mcshane.data2))

li.all <- list(y = mcshane.data2[,1],
Lagy = as.matrix(mcshane.data2[,2:3]),
PC = as.matrix(mcshane.data2[,4:13]),
#               solar = mcshane.data2\$solar,
#               volcano = mcshane.data2\$volcano,
#               gas = mcshane.data2\$gas,
gas = mcshane.data2\$gas,
#               gas1 = mcshane.data2\$gas.1,
nrows = nrows, beta0 = beta0, precbeta = precbeta)

li.inits <- list(beta=beta0)

li.jags <- jags.model(file = "lijags.bug",
data = li.all,
inits = li.inits,
n.chain = 1)

# burn in
update(li.jags, 1000)

# monitor
li.out <- jags.samples(li.jags, c("beta", "mu", "precsigma"),
n.iter = 10000, thin = 40)

li.beta <- as.mcmc.list(li.out\$beta)
li.mu <- as.mcmc.list(li.out\$mu)
li.precsigma <- as.mcmc.list(li.out\$precsigma)

li.sim <- 100
inits <- as.numeric(mcshane.data2[1,2:3]) # say known
li.forecast <- matrix(0, nrow = li.sim, ncol = 1001)
for (i in 1:li.sim) {
for (t in 1:1001) {
if (t == 1)
dat <- inits
if (t == 2)
dat <- c(li.forecast[i,t-1], inits)
if (t >= 3)
dat <- li.forecast[i,t-(1:2)]
#  PC
pc <- as.numeric(mypc\$x[1001-t+1,1:10])
# forcings
force <- as.numeric(forcings[1001-t+1,])
# process error
mu <- sum(apply(li.beta[], 2, function(x) sample(x, 1)) *
c(1, as.numeric(dat), pc, force))
# resid error
li.forecast[i,t] <- rnorm(1, mu,
1/sqrt(sample(li.precsigma[], 1)))
}
}

# plot w/ forcings
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(li.forecast), 1, function(x) quantile(x, .5))), span = .005), col = "blue", lwd = 3, type = "l")
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(li.forecast), 1, function(x) quantile(x, .975))), span = .005), col = "blue", lwd = 1, lty = 2)
lines(supsmu(as.numeric(rownames(mypc\$x)), rev(apply(t(li.forecast), 1, function(x) quantile(x, .025))), span = .005), col = "blue", lwd = 1, lty = 2)
legend(1400, 1.1, c("M&W PC10 reconstruction", "CRU 1850~1998", "CRU 1999~2009", "Bayesian reconstruction 50% quantile", "Bayesian reconstruction 95% interval", "Bayesian reconstruction 50% quantile w/o last 30 years", "Bayesian reconstruction 95% interval w/o last 30 years", "Bayesian reconstruction 50% quantile w/ log(CO2) predictor", "Bayesian reconstruction 95% interval w/ log(CO2) predictor"), col = c("orange", "salmon", "red", "green", "green", "purple", "purple", "blue", "blue"), lty = c(1,1,1,1,2,1,2,1,2), lwd = c(2,2,2,2,1,2,1,2,1), box.lwd = 0)
``` This model with the CO2 predictor predicts the recent upswing in temperature the best. Not only that, the predictions of past temperatures are lower than the other two models. The purple line omits the last thirty years, while the blue line does not omit the last thirty years worth of data, but they are still the most similar.

My guess why this happens, is that the hidden CO2 effect “corrupts” the green reconstruction because you’re parameterizing against something that has not happened before. The purple line has less data, but it’s at least calibrating against data where CO2 forcing is not as strong, similar to anything before the modern period. The blue model with the CO2 predictor corrects for both, and predicts both the end and beginning of the series well.

Now, the usual disclaimers: this is not a rigorous analysis. The stationarity assumption may be violated easily with the CO2 forcing (or any of the other PCs), and I’ve obviously not considered other forcing that will certainty change the proportional effect of log(CO2). So don’t go bonkers!

Refs:

1. BLAKELEY B. MCSHANE AND ABRAHAM J. WYNER (2010). A STATISTICAL ANALYSIS OF MULTIPLE TEMPERATURE PROXIES: ARE RECONSTRUCTIONS OF SURFACE TEMPERATURES OVER THE LAST 1000 YEARS RELIABLE? Annals of Applied Statistics, 4 (3) (link)
2. Bo Li, Douglas W. Nychka and Caspar M. Ammann (2010). The Value of Multi-proxy Reconstruction of Past Climate Journal of the American Statistical Association(link)