Dynamic Modeling 3: When the first-order difference model doesn’t cut it

June 12, 2010

(This article was first published on Nor Talk Too Wise » R, and kindly contributed to R-bloggers)

Data must be selected carefully.  The predictive usefulness of the model is grossly diminished if outliers taint the available data.  Figure 1, for instance, shows the Defense spending (as a fraction of the national budget) between 1948 and 1968.

Note how the trend curve (as defined by our linear difference model from the last post: see appendix for a fuller description) is a very poor predictor.  Whatever is going on here isn’t a first-order process.  So, let’s neglect the model entirely for a moment.  The huge variations in spending between 1950 and 1952 indicate there were years within the selected time span for which the defense spending dramatically increased because of some exogenous shock, and then spending trended downwards.

One way we can judge the usefulness a little more scientifically is to run a regression on the differences.  In other words, Plot Y(t+1) on top of Y(t), and insert the regression line.  Check it out:

In general, the regression is a pretty good fit for the clustered points in the middle.  However, we also have some nasty outliers.  Something about this data isn’t a first-order process.  Wikipedia to the Rescue.  It is history that inconveniently interrupts our model, causing these outliers.  President Harry Truman cut back military spending in the wake of World War II.  However, any hope Truman may have had for shifting the national focus away from foreign military affairs was ruined by the onset of the Korean War late in 1950.  Thus we see the large spike in spending evident in Figure 1.  A predictive first-order model which spans this event with just these data will be limited in its effectiveness.  The grossly underfitted line of Figure 1 is little better than useless for indicating the spending during any given year.  A much more effective model would start by following the onset of the Korean War (a completely unpredictable event if prediction were solely based on these data), and trace the evolution of the expenditure as it decreased from its start-of-conflict high in 1952, as we see in Figure 3.

It ain’t perfect, but it’s definitely much better.  So, dynamic modeling with first-order linear difference equations has an enormous array of applications.  However, it is easy to be seduced by the numbers without carefully considering the data and the substantive implications of these findings.  Any inattentiveness in this respect may easily lead to meaningless, incorrect, or downright silly conclusions.

Next Time: Saving the code, changing the method.


Code adapted from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2NationalDefenseOutlaysCourtneyBrownMathModeling.htm

Dataset taken from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2CourtneyBrownMathModeling.htm


This is the R code to generate these Graphs:

df <- read.csv(file="http://nortalktoowise.com/code/datasets/Defense.csv", head=TRUE, sep=",")
lagvar <- function(x,y){return(c(rep(NA, y),x[-((length(x)-y+1):length(x))]))}
lagvar1 <- lagvar(defensespending,1)
model <- lm(defensespending ~ lagvar1)
y2 <- 0
t <- 0
y1 <- .3
a <- model$coefficients[[2]]
b <- model$coefficients[[1]]
timeserieslength <- nrow(df)
for (i in 1:timeserieslength) {
	y2[i] <- (a*y1[i])+b
	t[i] <- i
	if (i < timeserieslength) y1[i+1]=y2[i]}
plot(t, defensespending, xlab="Years", ylab="Defense Spending as Fraction of Budget", main="Figure 1: U.S. Defense Expenditure, 1948-68", pch=19)
lines(t, y2, lwd=2)
plot(lagvar1, defensespending, xlab="", ylab="", pch=19)
title(xlab="Y", ylab="Y(t+1)", main="Figure 2: Plot of the first differences", cex=1.5, col="black", font=2)
abline(model, lwd=2)
newdf <- df[which(year>1951, arr.in=TRUE),]
lagvar2 <- lagvar(newdf$defensespending,1)
model2 <- lm(newdf$defensespending ~ lagvar2)
y2 <- 0
t <- 0
y1 <- .68
a <- model2$coefficients[[2]]
b <- model2$coefficients[[1]]
timeserieslength <- nrow(newdf)
for (i in 1:timeserieslength) {
	y2[i] <- (a*y1[i])+b
	t[i] <- i
	if (i < timeserieslength) y1[i+1]=y2[i]}
plot(newdf$year, newdf$defensespending, xlab="Years", ylab="Defense Spending as Fraction of Budget", main="Figure 3: U.S. Defense Expenditure, 1952-68", pch=19)
lines(newdf$year, y2, lwd=2)

Please forgive my poor style (reusing variable names and whatnot).  It works, and that’s enough for me at the moment.

To leave a comment for the author, please follow the link and comment on their blog: Nor Talk Too Wise » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)