AIC & BIC vs. Crossvalidation

[This article was first published on Are you cereal? » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Model selection is a process of seeking the model in a set of candidate models that gives the best balance between model fit and complexity (Burnham & Anderson 2002). I have always used AIC for that. But you can also do that by crossvalidation. Specifically, Stone (1977) showed that the AIC and leave-one out crossvalidation are asymptotically equivalent. I wanted to experience it myself through a simple exercise.

My goal was to (1) generate artificial data by a known model, (2) to fit various models of increasing complexity to the data, and (3) to see if I will correctly identify the underlying model by both AIC and cross-validation. Out of curiosity I also included BIC (Bayesian Information Criterion).

Here is the model that I used to generate the data:

 y= 5 + 2x + x^2 + 2x^3 + \varepsilon
 \varepsilon \sim Normal (\mu=0, \sigma^2=1)

I then fitted seven polynomials to the data, starting with a line (1st degree) and going up to 7th degree:

polynomials
Figure 1| The dots are artificially generated data (by the model specified above). The lines are seven fitted polynomials of increasing degree, from 1 (red straight line) to 7.

My next step was to find which of the seven models is most parsimonous. I calculated AIC, BIC (R functions AIC() and BIC()) and the take-one-out crossvalidation for each of the models. This is the function that I used to do the crossvalidation:

crossvalidate <- function(x, y, degree)
{
  preds <- numeric(length(x))
	for(i in 1:length(x))
	{
		x.in <- x[-i]
		x.out <- x[i]
		y.in <- y[-i]
		y.out <- x[i]
		m <- lm(y.in ~ poly(x.in, degree=degree) )
		new <- data.frame(x.in = seq(-3, 3, by=0.1))
		preds[i]<- predict(m, newdata=data.frame(x.in=x.out))
	}
  # the squared error:
  return(sum((y-preds)^2))
}

Here are the results:

AIC_BIC_Cross
Figure 2| Comparison of effectiveness of AIC, BIC and crossvalidation in selecting the most parsimonous model (black arrow) from the set of 7 polynomials that were fitted to the data (Fig. 1).

All three methods correctly identified the 3rd degree polynomial as the best model. So it works. Interestingly, all three methods penalize lack of fit much more heavily than redundant complexity. I knew this about AIC, which is notoriously known for insufficient penalization of overly complex models. BIC should penalize complexity more than AIC does (Hastie et al. 2009), which is what Fig. 2 shows clearly. But still, the difference is not that pronounced. I was surprised to see that crossvalidation is also quite benevolent in terms of complexity penalization - perhaps this is really because crossvalidation and AIC are equivalent (although the curves in Fig. 2 do not seem identical).

References
1. Burnham K. P. & Anderson D. R. (2002) Model selection and multimodel inference: A practical information-theoretic approach. Springer.
2. Hastie T., Tibshirani R. & Friedman J. (2009) The elements of statistical learning: Data mining, inference, and prediction. Springer.
3. Shao J. (1993) Linear model selection by cross-validation. Journal of American Statistical Association, 88, 486-494.
4. Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7.

R code for the whole exercise

# the figures require ggplot2 library and
# all packages it depends on
library(ggplot2)

# generate the x predictor
x <- runif(100,-2,2)
# generate the y response
y <- 2*x^3 + x^2 - 2*x +5 + rnorm(100)
xy <- data.frame(x=x, y=y)
# specify the maximum polynomial degree that will be explored
max.poly <- 7

# cretaing data.frame which will store model predictions
# that will be used for the smooth curves in Fig. 1
x.new <- seq(min(x), max(x), by=0.1)
degree <- rep(1:max.poly, each=length(x.new))
predicted <- numeric(length(x.new)*max.poly)
new.dat <- data.frame(x=rep(x.new, times=max.poly),
                      degree,
                      predicted)

# fitting lm() polynomials of increasing complexity
# (up to max.degree) and storing their predictions
# in the new.dat data.frame
for(i in 1:max.poly)
{
  sub.dat <- new.dat[new.dat$degree==i,]
  new.dat[new.dat$degree==i,3] <- predict(lm(y~poly(x, i)),
                                          newdata=data.frame(x=x.new))
}

# plotting the data and the fitted models
p <- ggplot()
p + geom_point(aes(x, y), xy, colour="darkgrey")
p + geom_line(aes(x, predicted,
                  colour=as.character(degree)),
                  new.dat)
p + scale_colour_discrete(name = "Degree")
p

# creating empty data.frame that will store
# AIC and BIC values of all of the models
AIC.BIC <- data.frame(criterion=c(rep("AIC",max.poly),
                                  rep("BIC",max.poly)),
                      value=numeric(max.poly*2),
                      degree=rep(1:max.poly, times=2))

# calculating AIC and BIC values of each model
for(i in 1:max.poly)
{
  AIC.BIC[i,2] <- AIC(lm(y~poly(x,i)))
  AIC.BIC[i+max.poly,2] <- BIC(lm(y~poly(x,i)))
}

# function that will perform the "leave one out"
# crossvalidation for a y~poly(x, degree) polynomial
crossvalidate <- function(x, y, degree)
{
  preds <- numeric(length(x))
	for(i in 1:length(x))
	{
		x.in <- x[-i]
		x.out <- x[i]
		y.in <- y[-i]
		y.out <- x[i]
		m <- lm(y.in ~ poly(x.in, degree=degree) )
		new <- data.frame(x.in = seq(-3, 3, by=0.1))
		preds[i]<- predict(m, newdata=data.frame(x.in=x.out))
	}
  # the squared error:
  return(sum((y-preds)^2))
}

# crossvalidating all of the polynomial models
# and storing their squared errors in
# the "a" object
a <- data.frame(cross=numeric(max.poly))
for(i in 1:max.poly)
{
  a[i,1] <- crossvalidate(x, y, degree=i)
}

# plotting AIC and BIC against model complexity
# (which is the polynomial degree)
AIC.plot <- qplot(degree, value, data=AIC.BIC,
                  geom="line", linetype=criterion) +
                  xlab("Polynomial degree") +
                  ylab("Criterion value") +
                  labs(title="Information theory & Bayes")+
                  geom_segment(aes(x=3, y=400,
                                   xend=3, yend=325),
                  arrow = arrow(length = unit(0.3, "cm"),
                  angle=20, type="closed")) +
                  theme(legend.position=c(0.8,0.5))
AIC.plot

# plotting crossvalidated squared errors agains
# model complexity
cross.plot <- qplot(1:max.poly,cross, data=a, geom=c("line"))+
                    xlab("Polynomial degree") +
                    ylab("Squared error") +
                    geom_segment(aes(x=3, y=400,
                                     xend=3, yend=200),
                    arrow = arrow(length = unit(0.3, "cm"),
                    angle=20, type="closed")) +
                    labs(title="Crossvalidation")
cross.plot

To leave a comment for the author, please follow the link and comment on their blog: Are you cereal? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)