Overfitting

[This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What do you think when you see a model like the one below?


Overfitting

Does this strike you as a good model? Or as a bad model?

There’s no right or wrong answer to this question, but I’d like to argue that models that are able to match white noise are typically bad things, especially when you don’t have a clear cross-validation paradigm that will allow you to demonstrate that your model’s ability to match complex data isn’t a form of overfitting.

There are many objective reasons to suspect complicated models, but I’d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model1 because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?

Unless the functional form of a model and its dependence on inputs is simple, I’m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it’s not desirable), I think the default reaction to a model fit like the one above should be: “why does the model make such complex predictions? Isn’t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?”

  1. Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!

To leave a comment for the author, please follow the link and comment on their blog: John Myles White » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)