Learning Data Science: Why a High R^2 Can Be Misleading

Learning Machines

4 hours ago

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A high can make a regression model look impressively accurate — but this number can be deceptive. If you want to understand why a high is not always a sign of a good model, read on!

In the post, Learning Data Science: Modelling Basics, we built a simple model to predict income from age. R printed a model summary containing something called R-squared, but we did not yet discuss what that value actually means.

At first sight, a high looks highly reassuring. In our example, the linear model achieved an close to 90%. That sounds impressive.

However, just as high classification accuracy can be misleading — as discussed in ZeroR: The Simplest Possible Classifier, or Why High Accuracy can be Misleading — a high can also create a false sense of confidence.

To understand why, it helps to examine the formula itself and then revisit the three models from the previous post: the mean model, the linear model, and the polynomial model.

The Meaning of

The coefficient of determination is defined as:

At first glance, the formula appears intimidating, but its basic idea is relatively simple.

The denominator

measures the total variation in the target variable. It quantifies how strongly the observed values differ from their mean.

The numerator

measures the remaining unexplained error after fitting the model.

Thus, measures the proportion of variation explained by the model.

An of:

0 means the model explains none of the variation,
1 means the model explains all variation perfectly.

This sounds straightforward enough. The difficulty is that perfectly explaining the observed data is not necessarily the same thing as building a useful predictive model.

The Mean Model

Let us begin with the simplest possible regression model.

Suppose we completely ignore age and simply predict the average income for every individual:

This is effectively the regression equivalent of ZeroR. The model does not learn any relationship at all.

In this case:

Therefore, the residual sum of squares becomes identical to the total sum of squares:

Substituting this into the formula gives:

The model explains none of the variation in the data.

This corresponds to the underfitting case discussed previously: the model is too simple to capture the underlying structure.

The Polynomial Model

Now consider the opposite extreme.

Instead of fitting a straight line, suppose we fit a polynomial of sufficiently high degree. In fact, if we have observations with distinct age values, a polynomial of degree up to can pass exactly through all observed data points:

In that case:

for all observations, implying:

and therefore:

The model achieves a perfect fit.

At first sight, this appears ideal. In practice, however, such a model often performs poorly on unseen data because it has adapted itself not only to the underlying relationship, but also to random fluctuations and noise within the training data.

This is the classical overfitting problem.

A perfect may therefore indicate not a particularly good model, but a model that has become too flexible.

The Linear Model

The linear model from the previous post lies between these two extremes.

It is simple enough to avoid memorizing every random fluctuation, yet flexible enough to capture a meaningful trend in the data.

This balance between simplicity and flexibility is one of the central themes in statistical learning.

The idea was summarized in the previous post with the following plot:

and by the famous observation attributed to George Box:

“All models are wrong, but some are useful.”

The objective in modelling is therefore not to maximize complexity or maximize , but to find a model that generalizes well beyond the observed sample.

Why Alone Is Insufficient

The key limitation of is that it evaluates fit on the observed data only.

It does not directly measure:

predictive performance on unseen data,
robustness,
causal validity, or
generalization ability.

As model complexity increases, almost always increases as well. A sufficiently flexible model can often achieve values very close to 1 even when its predictions on new data are poor.

For this reason, practical data science relies on additional evaluation methods such as:

train-test splits,
cross-validation,
regularization,
adjusted , and
out-of-sample testing.

The goal is not to reproduce historical observations perfectly, but to construct models that remain useful when confronted with new data.

A high can therefore mean two very different things:

the model has identified a genuine structure,
or the model has merely adapted itself too closely to the training data.

Distinguishing between these possibilities is one of the central challenges of machine learning and statistical modelling.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Meaning of <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&amp;ssl=1" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21">

The Mean Model

The Polynomial Model

The Linear Model

Why <img loading="lazy" decoding="async" src="https://i1.wp.com/blog.ephorie.de/wp-content/ql-cache/quicklatex.com-c7d6931063ed333ca39b952ccfd482b8_l3.png?resize=21%2C15&amp;ssl=1" alt="R^2" title="Rendered by QuickLaTeX.com" height="15" width="21"> Alone Is Insufficient

Related

The Meaning of

Why Alone Is Insufficient