# Correlation of temperature proxies with observations

April 28, 2012
By

(This article was first published on sieste » R, and kindly contributed to R-bloggers)

The climate change debate focuses mainly around the assumption that the annual global mean temperatures of the past few decades have been the highest in the past millenium. How do we know what the annual global mean temperature was in the year, say, 1351 AD? The answer is: Through temperature proxies. Such proxies include tree rings, lake sediment cores, and air bubbles trapped in ice cores. The idea is that certain properties of these proxies should be related to or influenced by temperature, and if we can find this relationship we can reconstruct past temperatures from the proxies.

Last year a statistical analysis of temperature proxies was published (McShane 2011, http://arxiv.org/pdf/1104.4002). The authors argue that these proxies might be unreliable and point out a number of statistical pitfalls in their analysis. In an exemplary manner, the authors have made their data publicly available at http://lib.stat.cmu.edu/aoas/398/supplement.zip. In this (and a couple of future posts) I will have a look at this data set using R.

First of all, `unzip supplement` produces a directory `CodeForRelease` which includes the files
`data_allproxy.R` and `data_instrument.R`, the proxy data up to 10000 years in the past, and the temperature observations of the past 150 years. I renamed these files to `*.Rdata` because my R-scripts already have the suffix `*.R`.

The first thing I want to do is look at correlations between the observed Norther-Hemisphere annual global mean temperatures and the proxy data during the observation period 1850-2000. There are more than 1000 proxies in the data base, so it is worthwhile to look at the distribution of all the correlation coefficients. If there is a strong linear relationship between global mean temperature and the proxies, we should find a lot of correlation coefficients either close to $1$ or close to $-1$.

The R code snippet to calculate the distribution of correlation coefficients is this:

```#read data, remove "year"-column from proxy data
t.nh <- instrument\$CRU_NH_reform
proxy.inst <- allproxy1209[paste(instrument\$Year),-1]
#calculate all correlation coefficients and estimate their distribution
cc <- sapply(seq(dim(proxy.inst)[2]),function(z) cor(t.nh,proxy.inst[,z],use="complete.obs"))
dcc <- density(cc)```

Similar to McShane (2011) I produce another data set of pseudo proxies. Each time series of a true proxy is replaced by a time series with similar statistical properties, but which has by construction no relationship to the temperature record. I calculate three kinds of pseudo proxies: independent Gaussian, autoregressive first order, and autoregressive second order. The idea is that the pseudo proxies show us the typical correlation coefficients that we would see if there were no relationship between proxies and temperature. By comparing the correlation coefficients observed for the true proxies with those observed in the pseudo proxies we can judge whether the true proxies are significantly correlated with the observations. Here is the code snippet to calculate the three pseudo proxies, their correlations with the observations, and the distributions of the correlation coefficients:

```#fit an AR(p) model to each of the proxies and build a new matrix
#of pseudo proxies where each "true" proxy is replaced by a realization
#of its AR(p) fit
build.pseudoproxy <- function(data,ar.order) {
i.na <- which(is.na(data))
m <- mean(data[-i.na])
s <- sd(data[-i.na])
data <- (data-m)/s
if (ar.order == 0) {
pproxy <- rnorm(length(data))
} else {
arm <- ar(data,order.max=ar.order,aic=F,na.action=na.omit)
pproxy <- arima.sim(model=list(order=c(ar.order,0,0),ar=arm\$ar),n=length(data))
}
as.numeric(pproxy*s+m)
}
pseudo.prox.0 <- sapply( proxy.inst, build.pseudoproxy, ar.order=0 )
pseudo.prox.1 <- sapply( proxy.inst, build.pseudoproxy, ar.order=1 )
pseudo.prox.2 <- sapply( proxy.inst, build.pseudoproxy, ar.order=2 )
#calculate correlation coefficients of observations with pseudo
#proxies and their distributions
cc0 <- apply(pseudo.prox.0,2,function(z) cor(t.nh,z))
dcc0 <- density(cc0)
cc1 <- apply(pseudo.prox.1,2,function(z) cor(t.nh,z))
dcc1 <- density(cc1)
cc2 <- apply(pseudo.prox.2,2,function(z) cor(t.nh,z))
dcc2 <- density(cc2)```

At last we plot the densities in the same diagram, which is done by this piece of code:

```plot(NULL,type="l",lwd=2,xlim=c(-1,1),ylim=c(0,5),xlab="corrcoef.",ylab="density")
lines(dcc0\$x,dcc0\$y,lwd=2,col="orange")
lines(dcc1\$x,dcc1\$y,lwd=2,col="blue")
lines(dcc2\$x,dcc2\$y,lwd=2,col="green")
lines(dcc\$x,dcc\$y,lwd=4,col="black")
legend(x=-1,y=4.5,c("proxies","pseudo AR(0)","pseudo AR(1)","pseudo AR(2)"),lwd=c(4,rep(2,3)), lty=rep(1,4),col=c("black","orange","blue","green"))```

And this is the output:

The pseudo proxies produce correlation coefficients peaked at zero. The distribution becomes wider the more the proxies are autocorrelated. What is interesting is that the “true” proxies have a distribution of correlation coefficients that is not very different from that of the pseudo proxies. The mode of the distribution is slightly shifted towards the right and there is a little bump around $0.5$.

We can conclude that, if there is any (linear) relationship between the proxies and the actual temperatures, it is rather weak. There is nothing wrong with being critical about reconstructions of past temperatures based on these proxies. At least one should expect large uncertainties.

Stay tuned. More analysis of this data set is coming soon.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...