**Nor Talk Too Wise » R**, and kindly contributed to R-bloggers)

Data must be selected carefully. The predictive usefulness of the model is grossly diminished if outliers taint the available data. Figure 1, for instance, shows the Defense spending (as a fraction of the national budget) between 1948 and 1968.

Note how the trend curve (as defined by our linear difference model from the last post: see appendix for a fuller description) is a very poor predictor. Whatever is going on here isn’t a first-order process. So, let’s neglect the model entirely for a moment. The huge variations in spending between 1950 and 1952 indicate there were years within the selected time span for which the defense spending dramatically increased because of some exogenous shock, and then spending trended downwards.

One way we can judge the usefulness a little more scientifically is to run a regression on the differences. In other words, Plot Y(t+1) on top of Y(t), and insert the regression line. Check it out:

In general, the regression is a pretty good fit for the clustered points in the middle. However, we also have some nasty outliers. Something about this data isn’t a first-order process. Wikipedia to the Rescue. It is history that inconveniently interrupts our model, causing these outliers. President Harry Truman cut back military spending in the wake of World War II. However, any hope Truman may have had for shifting the national focus away from foreign military affairs was ruined by the onset of the Korean War late in 1950. Thus we see the large spike in spending evident in Figure 1. A predictive first-order model which spans this event *with just these data* will be limited in its effectiveness. The grossly underfitted line of Figure 1 is little better than useless for indicating the spending during any given year. A much more effective model would start by following the onset of the Korean War (a completely unpredictable event if prediction were solely based on these data), and trace the evolution of the expenditure as it decreased from its start-of-conflict high in 1952, as we see in Figure 3.

It ain’t perfect, but it’s definitely much better. So, dynamic modeling with first-order linear difference equations has an enormous array of applications. However, it is easy to be seduced by the numbers without carefully considering the data and the substantive implications of these findings. Any inattentiveness in this respect may easily lead to meaningless, incorrect, or downright silly conclusions.

Next Time: Saving the code, changing the method.

**References:**

Code adapted from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2NationalDefenseOutlaysCourtneyBrownMathModeling.htm

Dataset taken from http://www.courtneybrown.com/classes/ModelingSocialPhenomena/Assignments/Assignment2CourtneyBrownMathModeling.htm

**Appendix:**

This is the R code to generate these Graphs:

df <- read.csv(file="http://nortalktoowise.com/code/datasets/Defense.csv", head=TRUE, sep=",") attach(df) lagvar <- function(x,y){return(c(rep(NA, y),x[-((length(x)-y+1):length(x))]))} lagvar1 <- lagvar(defensespending,1) model <- lm(defensespending ~ lagvar1) y2 <- 0 t <- 0 y1 <- .3 a <- model$coefficients[[2]] b <- model$coefficients[[1]] timeserieslength <- nrow(df) for (i in 1:timeserieslength) { y2[i] <- (a*y1[i])+b t[i] <- i if (i < timeserieslength) y1[i+1]=y2[i]} plot(t, defensespending, xlab="Years", ylab="Defense Spending as Fraction of Budget", main="Figure 1: U.S. Defense Expenditure, 1948-68", pch=19) lines(t, y2, lwd=2) windows() plot(lagvar1, defensespending, xlab="", ylab="", pch=19) title(xlab="Y", ylab="Y(t+1)", main="Figure 2: Plot of the first differences", cex=1.5, col="black", font=2) abline(model, lwd=2) windows() newdf <- df[which(year>1951, arr.in=TRUE),] lagvar2 <- lagvar(newdf$defensespending,1) model2 <- lm(newdf$defensespending ~ lagvar2) y2 <- 0 t <- 0 y1 <- .68 a <- model2$coefficients[[2]] b <- model2$coefficients[[1]] timeserieslength <- nrow(newdf) for (i in 1:timeserieslength) { y2[i] <- (a*y1[i])+b t[i] <- i if (i < timeserieslength) y1[i+1]=y2[i]} plot(newdf$year, newdf$defensespending, xlab="Years", ylab="Defense Spending as Fraction of Budget", main="Figure 3: U.S. Defense Expenditure, 1952-68", pch=19) lines(newdf$year, y2, lwd=2)

Please forgive my poor style (reusing variable names and whatnot). It works, and that’s enough for me at the moment.

**leave a comment**for the author, please follow the link and comment on their blog:

**Nor Talk Too Wise » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...