[This article was first published on Nor Talk Too Wise » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(This is the second of a series of ongoing posts on using Graph Algebra in the Social Sciences.)

First-order linear difference equations are powerful, yet simple modeling tools.  They can provide access to useful substantive insights to real-world phenomena.  They can have powerful predictive ability when used appropriately.  Additionally, they may be classified in any number of ways in accordance with the parameters by which they are defined.  And though they are not immune to any of a host of issues, a thoughtful approach to their application can always yield meaningful information, if not for discussion then for further refinement of the model.

Let’s look at that example from the last post:

df <- read.csv(file="http://nortalktoowise.com/code/datasets/Electricity.csv", head=TRUE, sep=",") attach(df) lagvar <- function(x,y){return(c(rep(NA,y),x[-((length(x)-y+1):length(x))]))} lagvar1 <- lagvar(electricalusage,1) model <- lm(electricalusage ~ lagvar1) y1 <- 290 y2 <- 0 t <- 0 a <- model$coefficients[] b <- model$coefficients[] timeserieslength <- nrow(df) for (i in 1:timeserieslength) { y2[i] <- (a*y1[i])+b t[i] <- i if (i < timeserieslength) y1[i+1]=y2[i]} plot(year, electricalusage, xlab="Year", ylab="Electricity Usage (in per capita KWh)", main="Figure 1: Annual Electricity Usage, 1920-70", pch=19) lines(year, y2, lwd=2)

Alright, this probably needs a little bit of explaining.  This first few lines are just getting the data from the URL, and attaching the set so we can call the variable names directly.  I’ll have to justify that weird function.  Instead of breaking it down character for character, I’m just going to get away with explaining what it does.  lagvar is function that takes a table, looks at the variable you tell it to (in this case, electricalusage), and copies it into a new array.  Why would you want that?  Well, it has the added flexibility of delaying the variable by a number of rows (which you can readily specify.  This one, for instance, takes our dataset:

> df year electricalusage 1  1920             339 2  1921             347 3  1922             359 4  1923             368 5  1924             378 ... 47 1966            5265 48 1967            5577 49 1968            6057 50 1969            6571 51 1970            7066

and generates lagvar1, which “lags” electricalusage by one row:

> newdata year electricalusage lagvar1 1  1920             339      NA 2  1921             347     339 3  1922             359     347 4  1923             368     359 5  1924             378     368 ... 47 1966            5265    4933 48 1967            5577    5265 49 1968            6057    5577 50 1969            6571    6057 51 1970            7066    6571

This is extremely useful for calculating differences over time.  Then we run a simple linear regression: electricalusage as explained by lagvar1.  However, the regression is not the end-all of this process (and a million freshman statistics students gasp in horror).  We just run the regression to estimate our slope and intercept (a and b respectively).  The rest of the code if effectively identical to the example from my last post, though I must make one small confession: you might have noticed that y1 is a constant, while the other parameter variables are called from the dataset or linear model.  The truth is that I don’t know an easy mathematical way to pick a perfect y1.  It should be pretty close to electricalusage at the beginning of the graph, which means 339-ish.  It’s actually a bit lower (290) because you must pick a value which represents the electricalusage in a place where the line doesn’t go, just the tiniest bit behind the graph.  In other words, just estimate it.  If your curve is off-target, just fiddle around with the y1 value until it looks right.  But if your y1 is crappy, your prediction curve is still going to beat the hell out of the linear model:

So, what’s Y* in this graph?  Mathematically it is b/(1-a) = 64, but the substantive implication is silly.  The close fit seen in figure 1 is compelling for predictive accuracy (and this may indeed be a fine graph for future predictions), but taking this model at face value still leaves us with a completely incorrect substantive conclusion: Human beings back to the beginning of time have had about 64 kWhrs at their disposal annually.  While this may not play a meaningful part in any model which incorporates these data, it is an example of the need to at least be aware of the substantive implications of a model.

Next time: When a first-order linear difference equation doesn’t cut it.

References: