The hockeystick revisited

[This article was first published on sieste » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Previous posts:
Correlation of temperature proxies with observations
The “best” proxies for temperature reconstruction

Okay, I couldn’t resist. I wanted to provide some more in depth analysis of temperature proxies, but I just went ahead and did my own little reconstruction of Northern hemisphere annual average temperatures over the past millenium using McShane et al.‘s (2011) data. First of all, look at the following plot:

NA matrix of proxies for temperature reconstructions

produced by this R code:

##this R script produces an image which inicates that 
##proxy x at year y is NA by a red dot at (x,y)
#load the data
t.nh <- instrument$CRU_NH_reform
allproxy1209 <- allproxy1209[,-c(1,grep("lutannt",names(allproxy1209)))]
na.mat <-*1

#open device
#plot NA-matrix red-white coded
#some axes and text
#legend and close

It shows proxies along the x-axis and time along the y-axis. A red dot is plotted whenever a proxy is unavailable in the particular year. Huh? Around 1500, the data becomes quite sparse. How am I supposed to fit a decent statistical model using a regressor matrix that consists mainly of NA‘s? Apparently, there exist sophisticated methods for imputation of missing data but I don’t like to invent data. My approach is rather that if an input is unavailable you cannot use it to build a model. So here’s what I do:

Normalize the proxy data to zero mean and unit standard deviation
Loop through the years 1000 through 2000:
    check which proxies are available for this particular year
    repeat 10 times:
        randomly sample 100 years from the 150 years of observation period
        using only the available proxies, fit a cross-validated lasso model to these 100 observations
        using this model, reconstruct temperature of the particular year

So 10000 lasso models have to be fit, possibly using hundreds of predictors. I guess some parallelism is appropriate, conveniently provided by the package multicore (which rocks by the way).
Using this R code

# loop through lines of proxy matrix (years):
#       filter available proxies
#       fit a cv lasso model to observations, using only these proxies
#       reconstruct temperature of the respective year


#read data, remove "year" and lutannt from proxy data
t.nh <- instrument$CRU_NH_reform
y.inst <- instrument$Year
allproxy1209 <- allproxy1209[,-c(1,grep("lutannt",names(allproxy1209)))]

#normalize proxies
m <- colMeans(allproxy1209,na.rm=T)
sd <- colMeans(allproxy1209^2,na.rm=T)-m^2
allproxy1209 <- scale(allproxy1209,center=m,scale=sqrt(sd))
#intersect years
y.ind <- which(instrument$Year %in% rownames(allproxy1209))
proxy.inst <- allproxy1209[paste(instrument$Year[y.ind]),]
t.nh <- instrument$CRU_NH_reform[y.ind]
y.inst <- y.inst[y.ind]
#erase na's from proxies during observation period
na.ind <- unique(which(,arr.ind=T)[,1])
if (!is.null(na.ind)) {
    proxy.inst <- proxy.inst[-na.ind,]
    t.nh <- t.nh[-na.ind]
    y.inst <- y.inst[-na.ind]

##loop through years 1000...2000 and reconstruct based on available proxies
#function for a single year <- function(y) {
    #remove na columns
    na.col.ind <- which([paste(y),]))
    if(length(na.col.ind)==0) na.col.ind <- dim(proxy.inst)[2]+1
    r <- vector()
    for (i in seq(10)) { <- sample(seq(dim(proxy.inst)[1]),100)
        x <- proxy.inst[,-na.col.ind]
        #fit cv-lasso model and predict
        lasso.model <- cv.glmnet(x=x,y=t.nh[])
        coefz <- as.matrix(coef(lasso.model,s=lasso.model$lambda.min))
        r <- rbind(r,t(c(1,allproxy1209[paste(y),-na.col.ind])) %*% coefz)
    #return reconstructions
recon <- mclapply(seq(1000,2000),,mc.cores=8)

recon <- t(apply(as.matrix(recon),1,unlist))
#open device
plot(NULL,xlim=c(1000,2000),ylim=c(-1,1),xlab="year",ylab="temp. anomaly")
#plot observations
#plot reconstructions
lapply(seq(10),function(z) lines(seq(1000,2000),recon[,z]))
legend(x=1000,y=1,c("lasso reconst.","observations"),lwd=c(1,4),col=c("black","#AAAAFF"))
#close device

I produced the following plot (see the earlier posts above about where to get the data):

The reconstruction is quite flat (plus noise), except during the observation period where it goes up. It requires some fantasy to see a little ice age or a medieval warm period.

A possible reason for this is that there is no actual relation between temperature observations and proxies. McShane et al. (2011) show that the lasso might fail to shrink coefficients to zero if the corresponding regressor is autocorrelated (which is true for most the proxies). Alternatively, past temperatures might indeed have never been as high as they are today. I should remark that my hockey stick above looks much less frightening than the one from the IPCC report.

To leave a comment for the author, please follow the link and comment on their blog: sieste » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)