# CRU graph yet again (with R)

December 13, 2009
By

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.

If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can’t learn much of anything from the original “result.” This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).

Our article not an article on global warming, but an article on analysis technique. Human driven global warming is either happening or not happening independent of any bad analysis. Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.

First: load the data (I re-formatted it at bit so R can read it: jonesmannrogfig2c.txt, data1400.dat_.txt ) , perform the principle components reduction and fit a first
model.

> library(lattice)
> d1400r <- as.matrix(d1400[,2:23])
> pcomp <- prcomp(na.omit(d1400r))
> plot(pcomp)
> vars <- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1]) > jones <- read.table('jonesmannrogfig2c.txt',sep='\t',header=TRUE) > datUnion <- merge(vars,jones,all=TRUE) > datUnion$avgTemp <- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0)
> model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion)
> summary(model)

Call:
lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion)

Residuals:
Min         1Q     Median         3Q        Max
-0.8811679 -0.2658117  0.0008174  0.2933058  1.0450044

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.0065252  0.5750696   0.011   0.9910
PC1         -0.0001683  0.0003912  -0.430   0.6679
PC2         -0.0003678  0.0010114  -0.364   0.7168
PC3          0.0003177  0.0014821   0.214   0.8307
PC4          0.0044084  0.0019351   2.278   0.0246 *
PC5          0.0188520  0.0205137   0.919   0.3601
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4505 on 113 degrees of freedom
(484 observations deleted due to missingness)
Multiple R-squared: 0.05223,	Adjusted R-squared: 0.01029
F-statistic: 1.245 on 5 and 113 DF,  p-value: 0.2927


We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).

However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).

> model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion)
> summary(model)

Call:
lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 +
PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion)

Residuals:
Min       1Q   Median       3Q      Max
-0.87249 -0.25951  0.03996  0.25055  0.99039

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  7.431e-01  1.424e+00   0.522   0.6028
PC1         -1.796e-04  3.665e-04  -0.490   0.6253
PC2         -4.179e-04  9.759e-04  -0.428   0.6694
PC3          3.306e-05  1.430e-03   0.023   0.9816
PC4          3.416e-03  1.803e-03   1.894   0.0609 .
PC5          4.032e-02  1.978e-02   2.039   0.0440 *
PC6         -3.260e-03  2.660e-02  -0.123   0.9027
PC7         -7.134e-02  3.620e-02  -1.971   0.0514 .
PC8         -1.339e-01  7.895e-02  -1.696   0.0928 .
PC9          7.577e-02  5.734e-02   1.321   0.1892
PC10         2.700e-01  5.878e-02   4.594 1.22e-05 ***
PC11         8.562e-02  6.741e-02   1.270   0.2068
PC12        -8.057e-02  1.053e-01  -0.765   0.4461
PC13        -4.099e-02  1.064e-01  -0.385   0.7008
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4141 on 105 degrees of freedom
(484 observations deleted due to missingness)
Multiple R-squared: 0.2558,	Adjusted R-squared: 0.1637
F-statistic: 2.777 on 13 and 105 DF,  p-value: 0.001961


This is a degenerate model that essentially didn’t fit (thought the significance on PC10 component fools the fitter, but PC10 can’t be usable- it is essentially noise). Graphically we can see the fit is not very useful (despite having a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting. Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.

> dRange <- datUnion[datUnion$Year>=1856 & datUnion$Year<=1980,]
> xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)


Now the statement they wanted to make is that the present looks nothing like the past. The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present. Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model’s past looks a lot like the model’s present.

> datUnion$prediction <- predict(model,newdata=datUnion) > xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)  What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting). I don’t have the heart to unify the colors or remove the legend, but here is the graph below: > datUnion$dinked <- datUnion$prediction > datUnion$dinked[!is.na(datUnion\$avg)] <- NA
> xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)


The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from). Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.

Essentially this methodology was so poor it could not have illustrated or contradicted recent global warming. There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques). The problems of the presentation are, of course, not essential problems in detecting global warming (which likely is occurring and likely will be a drain on future quality of life) but problems found in a single bad analysis.

Related posts:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...