# Evaluating model performance – A practical example of the effects of overfitting and data size on prediction

May 3, 2014
By

(This article was first published on me nugget, and kindly contributed to R-bloggers)

Following my last post on decision making trees and machine learning, where I presented some tips gathered from the "Pragmatic Programming Techniques" blog, I have again been impressed by its clear presentation of strategies regarding the evaluation of model performance. I have seen some of these topics presented elsewhere - especially graphics showing the link between model complexity and prediction error (i.e. "overfitting") - but this particular presentation made me want to go back to this topic and try to make a practical example in R that I could use when teaching.

Effect of overfitting on prediction
The above graph shows polynomial fitting of various degrees to an artificial data set - The "real" underlying model is a 3rd-degree polynomial (y ~ b3*x^3 + b2*x^2 + b1*x + a). One gets a good idea that the higher degree models are incorrect give the single-term removal significance tests provided by the summary function (e.g. 5th-degree polynomial model):

Call:
lm(formula = ye ~ poly(x, degree = 5), data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4916 -2.0382 -0.4417 2.2340 8.1518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.3696 0.4304 68.242 < 2e-16 ***
poly(x, degree = 5)1 74.4980 3.0432 24.480 < 2e-16 ***
poly(x, degree = 5)2 54.0712 3.0432 17.768 < 2e-16 ***
poly(x, degree = 5)3 23.5394 3.0432 7.735 9.72e-10 ***
poly(x, degree = 5)4 -3.0043 3.0432 -0.987 0.329
poly(x, degree = 5)5 1.1392 3.0432 0.374 0.710
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.043 on 44 degrees of freedom
Multiple R-squared: 0.9569, Adjusted R-squared: 0.952
F-statistic: 195.2 on 5 and 44 DF, p-value: < 2.2e-16

Nevertheless, a more robust analysis of prediction error is through a cross-validation - by splitting the data into training and validation sub-sets. The following example does this split at 50% training and 50% validation, with 500 permutations.

So, here we have the typical trend of increasing prediction error with model complexity (via cross-validation - CV) when the model is overfit (i.e. > 3rd-degree polynomial, vertical grey dashed line). As reference, the horizontal grey dashed line shows the original amount of error added, which is where the CV error reaches a minimum.

Effect of data size on prediction
Another interesting aspect presented in the post is the use of CV in estimating the relationship between prediction error and the amount of data used in the model fitting (credit given to Andrew Ng from Stanford). This is helpful concept when determining what the benefit in prediction would be following an invest in more data sampling:

Here we see that, given a fixed model complexity, training error and CV error converges. Again, the horizontal grey dashed line indicates the actual measurement error of the response variable. So, in this example, there is not much improvement in prediction error following a data size of ca. 100. Interestingly, the example also demonstrates that even with an overfit model containing a 7th-degree polynomial, the increased prediction error is overcome with a larger data set. For comparison, the same exercise done with the correct 3rd-degree model shows that even the smaller data set achieves a relatively low prediction error even when the data size is small (2.6 MAE in 3rd-degree poly. vs 3.7 MAE in 7th-degree poly.):

Code to reproduce example:

### Data and model fittingset.seed(1111)n <- 50x <- sort(runif(n, -2, 2))y <- 3*x^3 + 5*x^2 + 0.5*x + 20 # a 3 polynomial modelerr <- rnorm(n, sd=3)ye <- y + errdf <- data.frame(x, ye)nterm <- c(1,2,3,5,7,9) png("model_fit~terms.png", width=5, height=4, units="in", res=400, type="cairo")par(mar=c(4,4,1,1))plot(ye~x, df, ylab="y")PAL <- colorRampPalette(c("blue", "cyan", "yellow", "red"))COLS <- PAL(length(nterm))for(i in seq(nterm)){ fit <- lm(ye ~ poly(x, degree=nterm[i]), data=df) newdat <- data.frame(x=seq(min(df$x), max(df$x),,100)) lines(newdat$x, predict(fit, newdat), col=COLS[i])}legend("topleft", legend=paste0(nterm, c("", "", "*", "", "", "")), title="polynomial degrees", bty="n", col=COLS, lty=1)dev.off() ### Term significancefit <- lm(ye ~ poly(x, degree=5), data=df)summary(fit) ### Error as a function of model complexityset.seed(1111)n <- 50nterm <- seq(12)perms <- 500frac.train <- 0.5 #training fraction of datarun <- data.frame(n, nterm, train.err=NaN, cv.err=NaN)runx <- sort(runif(n, -2, 2))y <- 3*x^3 + 5*x^2 + 0.5*x + 20 # a 3 polynomial modelerr <- rnorm(n, sd=3)ye <- y + errdf <- data.frame(x, ye) for(i in seq(nrow(run))){ pred.train <- matrix(NaN, nrow=nrow(df), ncol=perms) pred.valid <- matrix(NaN, nrow=nrow(df), ncol=perms) for(j in seq(perms)){ train <- sample(nrow(df), nrow(df)*frac.train) valid <- seq(nrow(df))[-train] dftrain <- df[train,] dfvalid <- df[valid,] fit <- lm(ye ~ poly(x, degree=run$nterm[i]), data=dftrain)  pred.train[train,j] <- predict(fit)  pred.valid[valid,j] <- predict(fit, dfvalid) } run$train.err[i] <- mean(abs(df$ye - pred.train), na.rm=TRUE) # sqrt(mean((df$ye - pred.train)^2, na.rm=TRUE)) run$cv.err[i] <- mean(abs(df$ye - pred.valid), na.rm=TRUE) # sqrt(mean((df$ye - pred.valid)^2, na.rm=TRUE)) print(i)} png("error~complexity.png", width=5, height=4, units="in", res=400, type="cairo")par(mar=c(4,4,1,1))ylim <- range(run$train.err, run$cv.err)plot(run$nterm, run$train.err, log="y", col=1, t="o", ylim=ylim, xlab="Model complexity [polynomial degrees]", ylab="Mean absolute error [MAE]")lines(run$nterm, run$cv.err, col=2, t="o")abline(v=3, lty=2, col=8)abline(h=mean(abs(err)), lty=2, col=8)legend("top", legend=c("Training error", "Cross-validation error"), bty="n", col=1:2, lty=1, pch=1)dev.off()   ### Error as a function of data sizeset.seed(1111)n <- round(exp(seq(log(50), log(500),, 10)))nterm <- 7perms <- 500frac.train <- 0.5 #training fraction of datarun <- data.frame(n, nterm, train.err=NaN, cv.err=NaN)runx <- sort(runif(max(n), -2, 2))y <- 3*x^3 + 5*x^2 + 0.5*x + 20 # a 3 polynomial modelerr <- rnorm(max(n), sd=3)ye <- y + errDF <- data.frame(x, ye) for(i in seq(nrow(run))){ df <- DF[1:run$n[i],] pred.train <- matrix(NaN, nrow=nrow(df), ncol=perms) pred.valid <- matrix(NaN, nrow=nrow(df), ncol=perms) for(j in seq(perms)){ train <- sample(nrow(df), nrow(df)*frac.train) valid <- seq(nrow(df))[-train] dftrain <- df[train,] dfvalid <- df[valid,] fit <- lm(ye ~ poly(x, degree=run$nterm[i]), data=dftrain)  pred.train[train,j] <- predict(fit)  pred.valid[valid,j] <- predict(fit, dfvalid) } run$train.err[i] <- mean(abs(df$ye - pred.train), na.rm=TRUE) # sqrt(mean((df$ye - pred.train)^2, na.rm=TRUE)) run$cv.err[i] <- mean(abs(df$ye - pred.valid), na.rm=TRUE) # sqrt(mean((df$ye - pred.valid)^2, na.rm=TRUE)) print(i)} png(paste0("error~data_size_", paste0(nterm, "term"), ".png"), width=5, height=4, units="in", res=400, type="cairo")par(mar=c(4,4,1,1))ylim <- range(run$train.err, run$cv.err)plot(run$n, run$train.err, log="xy", col=1, t="o", ylim=ylim, xlab="Data size [n]", ylab="Mean absolute error [MAE]")lines(run$n, run$cv.err, col=2, t="o")abline(h=mean(abs(err)), lty=2, col=8)legend("bottomright", legend=paste0("No. of polynomial degrees = ", nterm), bty="n")legend("top", legend=c("Training error", "Cross-validation error"), bty="n", col=1:2, lty=1, pch=1)dev.off()
Created by Pretty R at inside-R.org