Early stopping and penalized likelihood

[This article was first published on Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don’t perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data.

This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That’s how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins.

In any case, the trick is to stop at the right point–not so soon that you’re ignoring the data but not so late that you end up with something too noisy. Here’s an example of what you might want:

Screen shot 2011-07-05 at 2.32.06 PM.png

The trouble is, you don’t actually know the true value so you can’t directly use this sort of plot to make a stopping decision.

Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. By stopping early, you’re compromising between the prior and the data estimates.

Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function.

A key question, though, is where to stop? Different rules correspond to different solutions. For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. This would stop some of the more extreme varieties of overfitting. Such a procedure, however, is not the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter.

I discussed this in my very first statistics paper, “Constrained maximum entropy methods in an image reconstruction problem” (follow the link above). The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. The particular models and methods discussed in this article are not really of interest any more, but I think the general principles are still relevant.

The key ideas of the article appear in section 5 on page 433.

Non-statistical content

The background to the above paper is as follows. In 1986 or 1987 I was browsing the statistics section of the once-great Stanford bookstore and noticed (and bought) an interesting-looking book on maximum entropy and Bayesian methods. The book was indeed thought-provoking, especially its chapters by Jaynes, Gull, and Skilling. Some months later I saw an announcement of a maximum entropy conference to be held in Cambridge, England, in the summer of 1988. That sounded appealing so I submitted an abstract, which was accepted. This was my first academic talk, and in preparing it I followed Hal Stern’s advice: Don’t try to blow them away, just do something solid. The conference was ok. I went to every talk (there were no parallel sessions) and wrote about 100 postcards (I had to keep myself busy somehow while this was all happening.

The participants at the conference were very pleasant. I was the only statistician there, and they treated me with respect. They were pretty much in the margins of their academic fields and didn’t seem to care much about status.

My own presentation came right before lunch on the last day, or maybe it was the second-to-last day. I do remember that the talks before mine went too long, to the extent that by the time my 15-minute talk came up, it was already 20 minutes late! It went ok, though, and I’ve tried to follow Hal’s advice ever since. (Also Joe Schafer’s advice, to reduce the number of words on the screen.)

The other thing I remember from the conference was that many of the participants were politically conservative. One speaker, who was from Texas, told me that he registered as a Democrat so that in the primaries he could vote for the most left-wing candidate who would then be the easiest target for the Republicans. Somebody else told me that the organizers of the conference were extreme Thatcherites. And I have no idea of the politics of Ed Jaynes, but based on photographs, he certainly looks like a Republican. (I couldn’t tell much from this mini-bio, but I did learn that Jaynes spent some time at the Naval Research Lab, a place where I worked several decades later.)

I don’t see any significance to the politics of the maximum-entropy researchers; it just seemed worth noting because it’s unusual in the civilian scientific research fields I’ve encountered.

To leave a comment for the author, please follow the link and comment on their blog: Statistical Modeling, Causal Inference, and Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)