**The Pith of Performance**, and kindly contributed to R-bloggers)

During a lunchtime discussion among recent GCaP class attendees, the topic of weather came up and I casually mentioned that the weather in Melbourne, Australia, can be very changeable because the continent is so old that there is very little geographical relief to moderate the prevailing winds coming from the west.

In general, Melbourne is said to have a mediterranean climate, but it can also get cold blasts of air coming up from Antarctic regions in the winter. Fortunately, the island state of Tasmania acts as something of a geographical barrier against these winds. Understanding possible relationships between these effects presents an interesting exercise in correlation analysis.

### Gathering Weather Data

Weather data for all major Australia cities is available from the Bureau of Meteorology. The subsequent discussion will employ weather records for the past calendar year (2013) collected from Perth, in Western Australia, and Hobart and Launceston, in Tasmania. The city of Perth has been in the news lately because it’s the base for aircraft searching for wreckage of Malaysian Airlines flight MH 370. The available weather indicators include daily min and max temperatures and rainfall.

Figure 1 shows maximum temperatures in degrees Celsius. The trough occurs in the middle of the calendar year because that’s the winter season in Australia.

Which city is most strongly correlated with Melbourne’s temperatures? It’s impossible to decide based on the raw data alone. To answer such questions more rigorously we can use the cross correlation function (CCF) in R.

### Cross Correlation Plots

Applying the `ccf` function to the data in Fig. 1:

df.mel <- read.table("~/.../mel.csv",header=TRUE,sep=",")

df.per <- read.table("~/.../per.csv",header=TRUE,sep=",")

df.hob <- read.table("~/.../hob.csv",header=TRUE,sep=",")

df.laun <- read.table("~/.../laun.csv",header=TRUE,sep=",")

mel.ts <- ts(df.mel$MaxT)

per.ts <- ts(df.per$MaxT)

hob.ts <- ts(df.hob$MaxT)

laun.ts <- ts(df.laun$MaxT)

ccf(per.ts,mel.ts)

ccf(hob.ts,mel.ts)

ccf(laun.ts,mel.ts)

produces the plots shown in Fig. 2.

Like a ripple in a pond, there can be a delay or lag between an event exhibiting itself in one time series and it’s effect showing up in the other time series.

The CCF is defined as the set of correlations (height of the vertical line segments in Fig. 2) between two time series $x_t + h$ and $y_t$ for lags $h = 0, \pm1, \pm2, \ldots$. A negative value for $h$ represents a correlation between the x-series at a time **before** $t$ and the y-series at time $t$. If, for example, the lag $h = -3$, then the cross correlation value would give the correlation between $x_t – 3$ and $y_t$. Negative line segments correspond to events that are anti-correlated.

The CCF helps to identify lags of $x_t$ that could be predictors of the $y_t$ series.

- When $h < 0$ (left side of plots in Fig. 2), $x$
*leads*$y$. - When $h > 0$ (right side of plots in Fig. 2), $x$
*lags*$y$.

For the weather correlation analysis, we would like to identify which series is leading or influencing the Melbourne time series.

### Interpreting the CCF Plots

The dominant or fundamental signal over 365 days in Fig. 1 resembles one period of a sine wave. The first row in Fig. 3. shows two pure sine waves (red and blue) that are in phase with each other (*left column*). The correlation plot (*right column *) shows a peak at $h=0$, in the middle of the plot, indicating that the two curves are most strongly correlated when there is no horizontal displacement between the curves.

The second row in Fig. 3. shows sine waves that are 90 degrees out of phase with each other (*left column *). The correlation plot (*right column *) shows that these two curves are most weakly correlated at zero lag. Conversely, they are more strongly correlated at $h=-16$ (left side of CCF plot) or anti-correlated at $h=+16$ (right side of CCF plot).

The third row in Fig. 3 is similar to the first row but with some Gaussian noise added to both signals. The correlation plot shows a slight loss of symmetry but otherwise doesn’t indicate much additional structure because the randomness of the noise in both signals tends to cancel out.

Figure 4 has the same signals as Fig. 3 but with 365 sample points to match the weather data in Fig. 1. This has the effect of broadening out the correlation plots and, indeed, they do more closely resemble the correlation plots in Fig. 2.

# Perth-Melbourne analysis

ccf(per.ts,mel.ts,plot=FALSE)

# Produce the numerical output:

Autocorrelations of series ‘X’, by lag

-22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6

0.542 0.498 0.488 0.511 0.525 0.545 0.563 0.550 0.549 0.554 0.588 0.576 0.599 0.594 0.549 0.540 0.615

-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11

0.631 0.617 0.656 0.595 0.508 0.475 0.512 0.559 0.605 0.618 0.555 0.500 0.512 0.543 0.548 0.536 0.533

12 13 14 15 16 17 18 19 20 21 22

0.525 0.504 0.523 0.520 0.489 0.478 0.494 0.501 0.484 0.503 0.519

pm <- ccf(per.ts,mel.ts)

max.pmc <- max(pm$acf)

# Produces the output:[1] 0.6564855

# At what lag?

pm$lag[which(pm$acf > max.pmc-0.01 & pm$acf < max.pmc+0.01)]

# Produces the output: [1] -3

We can carry out the same analysis for Hobart-Melbourne time series:

# Hobart-Melbourne analysis

hm <- ccf(hob.ts,mel.ts)

max.hmc <- max(hm$acf)

pm$lag[which(hm$acf > max.hmc-0.01 & hm$acf < max.hmc+0.01)]

# 0.8269252 occurs at lag h = 0

and Launceston-Melbourne time series:

# Launceston-Melbourne analysis

lm <- ccf(laun.ts,mel.ts)

max.lmc <- max(lm$acf)

lm$lag[which(lm$acf > max.lmc-0.01 & lm$acf < max.lmc+0.10)]

# Two lags satisfy this criterion

# 0.801 occurs at lag h = 0

# 0.791 occurs at lag h = -1

Next, we need to interpret all these statistics.

### Analysis and Conclusions

It does indeed take about three days for weather to cross the 2000 miles between Perth and Melbourne. But the correlation at $h = -3$ lag is only 0.66, whereas it’s around 0.8 for Hobart and Launceston. The coords of the respective cities are:

- Melbourne: -37.813611, 144.963056
- Perth: -31.952222, 115.858889
- Launceston: -41.441944, 147.145
- Hobart: -42.880556, 147.325

The prevailing westerly winds originate with the *Roaring Forties* in the Indian Ocean. That name is a reference to 40 degrees south latitude. Melbourne is located at about 38 degrees south latitude. Perth, on the other hand, is located at a latitude much further north; it’s even north of Sydney! In addition, there is a considerable desert region (roughly two thirds of the breadth of the continent) between Perth and Melbourne. Therefore, we can expect the correlations between Perth and Melbourne temperatures to be weaker than those associated with the Tasmanian cities.

Hobart is further south than Melbourne and, although it’s on the eastern side of the Tasmanian island, there is no other land mass between the longitudes at Perth and Hobart. Hence, Hobart and Melbourne are more strongly correlated than Perth at zero lag.

Launceston is closest to Melbourne by latitude and, at zero lag, has a similar correlation to that for Hobart. No surprise there. There is one difference, however. A similar correlation exists at $h = -1$, which means Launceston *leads* Melbourne by a day, even though it is slightly *east* of Melbourne by about two degrees longitude. How can that be? One possibility is that it represents the effect of more southerly winds, such as those originating with the *Screaming Sixties* and circulating roughly counter-clockwise around the east coast of Tasmania. Cross correlated cross winds.

I’ll talk more about time series analysis in the upcoming Guerrilla Data Analysis Techniques class.

**leave a comment**for the author, please follow the link and comment on his blog:

**The Pith of Performance**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...