Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.)

The dataset is prgeng, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our polyreg package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as  it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:

> set.seed(88888)
> getPE()
> pe1 <- pe[,c(1,2,4,6,7,12:16,3)]
> testidxs <- sample(1:nrow(pe1),1000)
> testset <- pe1[testidxs,]
> trainset <- pe1[-testidxs,]


As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset)
> predvals <- predict(lmout,testset[,-11])
> mean(abs(predvals - testset[,11]))
[1] 25456.98


Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2)
> mean(abs(predict(pfout,testset[,-11])
- testset[,11]))
[1] 24249.83


Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

 deg MAPE # terms 3 23951.69 118 4 23974.76 226 5 24340.85 371 6 24554.34 551 7 36463.61 767 8 74296.09 1019

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.