For a luddite like me, this is a big step – posting something on the inter-web. I’m not on Facebook. I don’t know what Twitter is. Hell, I don’t even own a smartphone. But, I’ve been a devoted follower of Myles’ blog for some time, and he was kind enough to let his fellow-geek-of-a-brother contribute something to everyday analytics, so who was I to pass up such an opportunity?
The impetus for my choice of analysis was this: in celebration of Earth Day, my colleagues and I watched a film about global climate change, which was a nice excuse to eat pizza and slouch in an office chair while sipping Dr. Pepper instead of doing other, presumably useful, things. Anyway, a good chunk of the film centred on the evidence for anthropogenic greenhouse gas emissions altering the global climate system.
While I’ve seen lots of evidence for recent increases in air temperature in the mid-latitude areas of the planet, there’s nothing quite so convincing as doing your own analysis. So, I downloaded climate data from Environment Canada and did my own climate change analysis. I’m an aquatic scientist, not a climate scientist, so if I’ve made any egregious mistakes here, perhaps someone well-versed in climatology will show me the error of my ways, and I’ll learn something. Anyway, here we go.
Let’s start with mean monthly temperatures from daily means (the average, for each month of the year, of the daily mean temperatures) for the city of Toronto, for which a fairly good record exists (1940 to 2012). Here’s what the data look like:
So, you can see the clear trend in the data, can’t you? Trend analysis is a tricky undertaking for a number of reasons, one of which is that variation can exist on a number of temporal scales. We’re looking at temperatures here, so obviously we would expect significant seasonality in the data, and we are not disappointed:
So, this is all nice and good from a data visualization standpoint, but we need to perform some statistics in order to quantify the rate of change, and to decide if the change is significant in the statistical sense. Below are the results from linear regression analyses of temperature vs. year using the original monthly means, the deseasonalized data, and the annual means.
Dependent (Response) Variable
Monthly Mean Temperature
Deasonalized Monthly Temperatures
5.82 x 10-12
Annual Mean Temperature
4.65 x 10-5
All 3 analyses yielded a slope of 0.022 °C/yr, which is to say, the average rate of change during the 70 years analysed was 1.54°C. The regression based on monthly mean temperatures had a very low goodness of fit (R2 = 0.001) and was not significant at the conventional cut-off level of p < 0.05. This is not surprising given the scatter we observed in the data due to seasonality. What is therefore also not a surprise, is that the deseasonalized data had much better goodness of fit (R2 = 0.05), as did the annual mean temperatures (R2 = 0.20). The much higher level of statistical significance of the regression on deseasonalized data than on the annual means is likely a function of the higher power of the analysis (i.e., 876 data vs. only 73).
Before we get too carried away here interpreting these results, is there anything we’re forgetting? Right, those annoying underlying assumptions of the statistical test we just used. According to Zar (1999), for simple linear regression these are:
- For any value of X there exists in the population a normal distribution of Y values. This also means that, for each value of X there exists in the population a normal distribution of Ɛ’s.
- Must assume homogeneity of variances; that is, the variances of these population distributions of Y values (and of Ɛ’s) must all be equal to one another.
- The actual relationship is linear.
- The values of Y are to have come at random from the sampled population and are to be independent of one another.
- The measurements of X are obtained without error. This…requirement…is typically impossible; so what we are doing in practice is assuming that the errors in the X data are negligible, or at least are small compared with the measurement errors in Y.
Hmm, this suddenly became a lot more complicated. Let’s check the validity of these assumptions for the regression of the deseasonalized monthly temperatures vs. year. Well, we can safely say that number 5 is not a concern, i.e., that the dates were measured without error, but what about the others? Arguably, the data are not actually linear, because of the fall in temperature between 1960 and 1970, so this is something of a concern. The Shapiro-Wilk test tells us that the residuals are not significantly non-normal (assumption 1) but just barely (p = 0.056). We can visualize this via a Q-Q (Quantile-Quantile) plot of the residuals:
What about assumption 2, homogeneity of variances? This is typically assessed by plotting the residuals against the fitted values, like so:
There does not appear to be a systematic change in the magnitude of the residuals as a function of the predicted values, or at least nothing overly worrisome, so we’re good here, too.
Last, but certainly not least, do our data represent independent measurements? This last assumption is frequently a problem in trend analysis. While each temperature was presumably measured on a different day, in the statistical sense this does not necessarily imply that the measurements are not autocorrelated. Several years of data could be influenced by an external factor which influences temperature over a multi-year timescale (El Niño?) which would cause the data from sequential years to be strongly correlated. Such temporal autocorrelation (serial dependence) can be visualized using an autocorrelation function (ACF):
The plot tells us that at a variety of lag periods (differences between years) the level of autocorrelation is significant (i.e., the ACF is above the blue line). The Durbin-Watson test confirms that the overall level of autocorrelation in the residuals is highly significant (p = 4.04 x 10-13).
So, strictly speaking, linear regression is not appropriate for our data due to the presence of nonlinearity and serial correlation, which violate two of the five assumptions of linear regression analysis. Now, don’t get me wrong, people violate these assumptions all the time. Hell, you may have already violated them earlier today if you’re anything like I was in my early days of grad school. But, as I said, this is my first blog post ever, and I don’t want to come across as some sloppy, apathetic, slap-dash, get-away-with-whatever-the-peer-reviewers-don’t-call-me-out-on type scientist - so let’s shoot for real statistical rigour here!
Fortunately, this is not too onerous a task, as there is a test that was tailor-made for trend analysis, and doesn’t have the somewhat strict requirements of linear regression. Enter the Hirsch-Slack Test, a variation of the Seasonal Kendall Trend Test, which corrects for both seasonality and temporal autocorrelation. I could get into more explanation as to how the test works, but this post is getting to be a little long, and hopefully you trust me by now. So, drum roll please….
The Hirsch-Slack test gives very similar results to those obtained using linear regression; it indicates a highly significant (p = 1.48 x 10-4) increasing trend in temperature (0.020°C/yr), which is very close to the slope of 0.022°C/yr obtained by linear regression.
So, no matter which way you slice it, there was a significant increase in Toronto’s temperature over the past 70 years. I’m curious about what caused the dip in temperature between ~1960 and ~1970, and have a feeling it may reflect changes in aerosols and other aspects of air quality related to urbanization, but don’t feel comfortable speculating too much. Perhaps it reflects some regional or global variation related to volcanic activity or something, I really have no idea. Obviously, if we’d performed the analysis on the years 1970 to 2010 the slope (i.e., rate of temperature increase) would have been much higher than for the entire period of record.
I was also curious if Toronto was a good model for the rest of Canada given that it is a large, rapidly growing city, and changes in temperature there could have been related to urban factors, such as the changes in air quality I already speculated about. For this reason, I performed the same analysis on data from rural Coldwater (near where Myles and I grew up) and obtained very similar results, which suggests the trend is not unique to the city of Toronto.
In case you’re wondering, the vast majority (98%) of Canadians believe the global climate is changing, according to a recent poll by Insightrix Research (but note that far fewer believe that human activity is solely to blame.) So, perhaps the results of this analysis won’t be a surprise to very many people, but I did find it satisfying to perform the analysis myself, and with local data.
Well, that`s all for now - time to brace ourselves for the coming heat of summer. I think I need a nice, cold beer.
References & ResourcesZar, J.H. (1999) Biostatistical Analysis, 4th ed. Upper Saddle River, New Jersey: Prentice Hall.
Hirsch, R.M. & Slack, J.R. (1984). A Nonparametric Trend Test for Seasonal Data With Serial Dependence. Water Resources Research 20(6), 727-732. doi: 10.1029/WR020i006p00727
National Post: Climate Change is real, Canadians say, but they can't agree on the cause
Climate Data at Canadian National Climate Data and Information Archive
Joel Harrison, PhD, Aquatic Scientist