[This article was first published on sieste » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Okay, I couldn’t resist. I wanted to provide some more in depth analysis of temperature proxies, but I just went ahead and did my own little reconstruction of Northern hemisphere annual average temperatures over the past millenium using McShane et al.‘s (2011) data. First of all, look at the following plot:

produced by this R code:

##this R script produces an image which inicates that
##proxy x at year y is NA by a red dot at (x,y)
#load the data
load("../data/data_allproxy1209.Rdata")
load("../data/data_instrument.Rdata")
t.nh <- instrument$CRU_NH_reform allproxy1209 <- allproxy1209[,-c(1,grep("lutannt",names(allproxy1209)))] na.mat <- is.na(allproxy1209)*1 ##plot #open device jpeg(file="na.plot.jpg",width=500,height=500,quality=100) #plot NA-matrix red-white coded image(t(na.mat[paste(seq(2000,1000)),]),col=c("white","red"),breaks=c(-.5,.5,1.5),axes=F) #some axes and text axis(side=2,at=seq(0,1,length.out=11),labels=paste(seq(2000,1000,length.out=11)),line=.5) mtext(side=2,text="Year",line=3) axis(side=1,at=c(0,1),labels=c("",""),line=.5) mtext(side=1,text="proxy",line=1) #legend and close legend(x=.7,y=.2,c("NA","available"),fill=c("red","white")) dev.off()  It shows proxies along the x-axis and time along the y-axis. A red dot is plotted whenever a proxy is unavailable in the particular year. Huh? Around 1500, the data becomes quite sparse. How am I supposed to fit a decent statistical model using a regressor matrix that consists mainly of NA‘s? Apparently, there exist sophisticated methods for imputation of missing data but I don’t like to invent data. My approach is rather that if an input is unavailable you cannot use it to build a model. So here’s what I do: Normalize the proxy data to zero mean and unit standard deviation Loop through the years 1000 through 2000: check which proxies are available for this particular year repeat 10 times: randomly sample 100 years from the 150 years of observation period using only the available proxies, fit a cross-validated lasso model to these 100 observations using this model, reconstruct temperature of the particular year end end  So 10000 lasso models have to be fit, possibly using hundreds of predictors. I guess some parallelism is appropriate, conveniently provided by the package multicore (which rocks by the way). Using this R code # loop through lines of proxy matrix (years): # filter available proxies # fit a cv lasso model to observations, using only these proxies # reconstruct temperature of the respective year library(glmnet) library(multicore) #read data, remove "year" and lutannt from proxy data load("../data/data_allproxy1209.Rdata") load("../data/data_instrument.Rdata") t.nh <- instrument$CRU_NH_reform
y.inst <- instrument$Year allproxy1209 <- allproxy1209[,-c(1,grep("lutannt",names(allproxy1209)))] #normalize proxies m <- colMeans(allproxy1209,na.rm=T) sd <- colMeans(allproxy1209^2,na.rm=T)-m^2 allproxy1209 <- scale(allproxy1209,center=m,scale=sqrt(sd)) #intersect years y.ind <- which(instrument$Year %in% rownames(allproxy1209))
proxy.inst <- allproxy1209[paste(instrument$Year[y.ind]),] t.nh <- instrument$CRU_NH_reform[y.ind]
y.inst <- y.inst[y.ind]
#erase na's from proxies during observation period
na.ind <- unique(which(is.na(proxy.inst),arr.ind=T)[,1])
if (!is.null(na.ind)) {
proxy.inst <- proxy.inst[-na.ind,]
t.nh <- t.nh[-na.ind]
y.inst <- y.inst[-na.ind]
}

##loop through years 1000...2000 and reconstruct based on available proxies
#function for a single year
recon.fun <- function(y) {
#remove na columns
na.col.ind <- which(is.na(allproxy1209[paste(y),]))
if(length(na.col.ind)==0) na.col.ind <- dim(proxy.inst)[2]+1
r <- vector()
for (i in seq(10)) {
i.cv <- sample(seq(dim(proxy.inst)[1]),100)
x <- proxy.inst[i.cv,-na.col.ind]
#fit cv-lasso model and predict
lasso.model <- cv.glmnet(x=x,y=t.nh[i.cv])
coefz <- as.matrix(coef(lasso.model,s=lasso.model\$lambda.min))
r <- rbind(r,t(c(1,allproxy1209[paste(y),-na.col.ind])) %*% coefz)
}
#return reconstructions
r
}
recon <- mclapply(seq(1000,2000),recon.fun,mc.cores=8)
save(file="lasso1.Rdata",list=c("recon"))

##plot
load("lasso1.Rdata")
recon <- t(apply(as.matrix(recon),1,unlist))
#open device
jpeg(file="lasso1.jpg",width=500,height=500,quality=100)
plot(NULL,xlim=c(1000,2000),ylim=c(-1,1),xlab="year",ylab="temp. anomaly")
#plot observations
lines(y.inst,t.nh,col="#AAAAFF",lwd=4)
#plot reconstructions
lapply(seq(10),function(z) lines(seq(1000,2000),recon[,z]))
#legend
legend(x=1000,y=1,c("lasso reconst.","observations"),lwd=c(1,4),col=c("black","#AAAAFF"))
#close device
dev.off()


I produced the following plot (see the earlier posts above about where to get the data):

The reconstruction is quite flat (plus noise), except during the observation period where it goes up. It requires some fantasy to see a little ice age or a medieval warm period.

A possible reason for this is that there is no actual relation between temperature observations and proxies. McShane et al. (2011) show that the lasso might fail to shrink coefficients to zero if the corresponding regressor is autocorrelated (which is true for most the proxies). Alternatively, past temperatures might indeed have never been as high as they are today. I should remark that my hockey stick above looks much less frightening than the one from the IPCC report.

To leave a comment for the author, please follow the link and comment on their blog: sieste » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts.(You will not see this message again.)

Click here to close (This popup will not appear again)