Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A high
In the post, Learning Data Science: Modelling Basics, we built a simple model to predict income from age. R printed a model summary containing something called R-squared, but we did not yet discuss what that value actually means.
At first sight, a high
However, just as high classification accuracy can be misleading — as discussed in ZeroR: The Simplest Possible Classifier, or Why High Accuracy can be Misleading — a high
To understand why, it helps to examine the formula itself and then revisit the three models from the previous post: the mean model, the linear model, and the polynomial model.
The Meaning of
The coefficient of determination is defined as:
At first glance, the formula appears intimidating, but its basic idea is relatively simple.
The denominator
measures the total variation in the target variable. It quantifies how strongly the observed values differ from their mean.
The numerator
measures the remaining unexplained error after fitting the model.
Thus,
An
- 0 means the model explains none of the variation,
- 1 means the model explains all variation perfectly.
This sounds straightforward enough. The difficulty is that perfectly explaining the observed data is not necessarily the same thing as building a useful predictive model.
The Mean Model
Let us begin with the simplest possible regression model.
Suppose we completely ignore age and simply predict the average income for every individual:
This is effectively the regression equivalent of ZeroR. The model does not learn any relationship at all.
In this case:
Therefore, the residual sum of squares becomes identical to the total sum of squares:
Substituting this into the formula gives:
The model explains none of the variation in the data.
This corresponds to the underfitting case discussed previously: the model is too simple to capture the underlying structure.
The Polynomial Model
Now consider the opposite extreme.
Instead of fitting a straight line, suppose we fit a polynomial of sufficiently high degree. In fact, if we have
In that case:
for all observations, implying:
and therefore:
The model achieves a perfect fit.
At first sight, this appears ideal. In practice, however, such a model often performs poorly on unseen data because it has adapted itself not only to the underlying relationship, but also to random fluctuations and noise within the training data.
This is the classical overfitting problem.
A perfect
The Linear Model
The linear model from the previous post lies between these two extremes.
It is simple enough to avoid memorizing every random fluctuation, yet flexible enough to capture a meaningful trend in the data.
This balance between simplicity and flexibility is one of the central themes in statistical learning.
The idea was summarized in the previous post with the following plot:
and by the famous observation attributed to George Box:
“All models are wrong, but some are useful.”
The objective in modelling is therefore not to maximize complexity or maximize
Why Alone Is Insufficient
The key limitation of
It does not directly measure:
- predictive performance on unseen data,
- robustness,
- causal validity, or
- generalization ability.
As model complexity increases,
For this reason, practical data science relies on additional evaluation methods such as:
- train-test splits,
- cross-validation,
- regularization,
- adjusted
, and - out-of-sample testing.
The goal is not to reproduce historical observations perfectly, but to construct models that remain useful when confronted with new data.
A high
- the model has identified a genuine structure,
- or the model has merely adapted itself too closely to the training data.
Distinguishing between these possibilities is one of the central challenges of machine learning and statistical modelling.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.