Tapply on steroids – by (climate trends in Scottish seasons)

[This article was first published on R – scottishsnow, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A while back I wrote a post on using the R tool tapply, which is my most popular post. Tapply allows you to repeat a function over a vector split into groups, e.g. you’ve two columns of data with the first being Date and the second being a variable (snow days, banana imports, whatever) and you want to look at your variable split by month, year, etc.. The examples in this post use Met Office data I’ve previously used for September weather and Scottish winters. You can download these and write them to a database to follow this post using my Bitbucket repository. An example of tapply is to find the mean over each season, i.e.:


# Setup
library(RSQLite)

# Get all Scottish data from db
data = dbGetQuery(clim.db, "SELECT Year, Month AS Season,
    Variable, Vals FROM data WHERE Country='Scotland' AND
    (Month='SPR' OR Month='SUM' OR Month='AUT' OR Month='WIN')")

# reorder factor
data$Season = factor(data$Season, levels=c("SPR", "SUM",
    "AUT", "WIN"))
# relabel factor
data$Season = factor(data$Season, labels=c("Spring", "Summer",
    "Autumn", "Winter"))

# reorder factor
data$Variable = factor(data$Variable, levels=c("frost", "maxtemp",
    "meantemp", "mintemp", "precipdays", "precip", "sun"))
# relabel factor
data$Variable = factor(data$Variable, labels=c("Frost days",
    "Max temp (degC)", "Mean temp (degC)", "Min temp (degC)",
    "Precipitation days", "Precipitation (mm)", "Sunshine (hrs)"))

# ----------------------------------------------
# ----------------------------------------------

# Find the mean value for each season
data.mean = tapply(data$Vals, list(data$Season, data$Variable), mean)

# Find the mean value and ignore NAs
data.mean = tapply(data$Vals, list(data$Season, data$Variable),
    mean, na.rm=T)

So now data.mean is a matrix of means with rows as seasons and columns as climate attributes. Great! But not the point of this post, I covered this previously. What tapply can’t do is iterate over more than a vector, i.e. it can only apply your function to a single vector at a time. If you wanted to iterate over two columns finding the correlations between them, it can’t do this. Which brings me neatly onto the function by. Using our previously loaded data, we can now make a Pearson’s correlation between years and variables. This is basically r, of r² fame, where the sign indicates the direction of correlation; i.e. a negative value indicates a negative correlation (y decreases for increasing x) and visa versa. The closer to 1 or -1 this correlation gets the better the line fit to the data points. However this can be misleading, e.g. if you only have two points you will always get a correlation of 1 or -1. Here this is in practice:


# Correlation
data.cor = by(data, list(data$Season, data$Variable),
    function(i){cor(x=i[,1], y=i[,4], method="pearson")})
# So the explaining variable (x) is the first column of data (Year)
# and the dependent variable (y) is the fourth column of data (Vals)

Now data.cor is a list (sort of) of correlations for each variable and season. Have a look, they’re pretty terrible. This list is slightly awkward to work with, the following code puts it into a useable format:


# Tidy correlation list
data.cor = stack(data.frame(rbind(data.cor)))
data.cor = data.frame(Season=sort(unique(data$Season)),
    Variable=data.cor[,2], Correlation=round(data.cor[,1], 2))

We can also repeat this for linear models/regression lines/best fit:


# Regression line
data.lm = by(data, list(data$Season, data$Variable),
    function(i){lm(i[,1] ~ i[,4])})

# Tidy regression data
data.lm = data.frame(Season=sort(unique(data$Season)),
    Variable=rep(sort(unique(data$Variable)),
    each=length(data.lm)/length(unique(data$Variable))),
    round(t(sapply(data.lm, coef)), 1))
names(data.lm)[3:4] = c("Intercept", "Slope")

Excitingly, we can turn these results into text strings for plot annotation:


# Turn correlation and regression into text annotation
# \n means a new line
text.ann = data.frame(data.lm[,1:2], lab=paste0("Correlation = ",
    data.cor[,3], "\nSlope = ", data.lm[,4], "\nIntercept = ",
    data.lm[,3]))

And then create a plot matrix of our data with results annotated on top:


# Plot results
library(ggplot2)

# Get plotting position for annotation
text.ann$pl = rep(tapply(data$Vals, data$Variable, max, na.rm=T),
    each=4)

# Write plot to png
png("~/results.png", width=1600, height=2000, res=210)
ggplot(data, aes(x=Year, y=Vals)) +
    geom_point(shape=1) +
    stat_smooth(method="lm", colour="black", size=1) +
    facet_grid(Variable ~ Season, scales="free_y") +
    ylab("Climate variables") +
    geom_text(data=text.ann[-c(6,10,14,25,26),],
        aes(label=lab, y=pl*.9), x=1930, size=2) +
    geom_text(data=text.ann[c(6,10,14,25,26),],
        aes(label=lab, y=pl*.3), x=1930, size=2)
dev.off()

The results of this are the plot below:

Seasonal Scottish climate trends (click image for higher resolution).

Seasonal Scottish climate trends (click image for higher resolution).


To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)