We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit.
The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique kgrams and the model can be anything from Naive Bayes to conditional random fields. This sort of modeling situation exposes the modeler to a lot of training bias. You can get models that look good on training data even though they have no actual value on new data (very poor generalization performance). In this sort of situation you are very vulnerable to having fit mere noise.
Often there is a feeling if a model is doing really well on training data then must be some way to bound generalization error and at least get useful performance on new test and production data. This is, of course, false as we will demonstrate by building deliberately useless features that allow various models to perform well on training data. What is actually happening is you are working through variations of worthless models that only appear to be good on training data due to overfitting. And the more “tweaking, tuning, and fixing” you try only appears to improve things because as you peak at your testdata (which you really should have held some out until the entire end of project for final acceptance) your test data is becoming less exchangeable with future new data and more exchangeable with your training data (and thus less helpful in detecting overfit).
Any researcher that does not have proper perfeature significance checks or holdout testing procedures will be fooled into promoting faulty models.
Many predictive NLP (natural language processing) applications require the use of very many very rare (almost unique) text features. A simple example would be 4grams or sequences of 4consecutive works from a document. At some point you are tracking phrases that occur in only 1 to 2 documents in your training corpus. A tempting intuition is that each of these rare features is in fact a low utility clue for document classification. The hope is if we track enough of them then enough are available when scoring a given document to make a reliable classification.
These features may in fact be useful, but you must be careful to have procedures to determine which features are in fact useful and which are mere noise. The issue is that rare features are only seen in a few training examples, so it is hard to reliably estimate their value during training. We will demonstrate (in R) some absolutely useless variables masquerading as actual signal during training. Our example is artificial, but if you don’t have proper holdout testing procedures you can easily fall into a similar trap.
Our code to create a bad example is as follows:
runExample < function(rows,features,rareFeature,trainer,predictor) { print(sys.call(0)) # print call and arguments set.seed(123525) # make result deterministic yValues < factor(c('A','B')) xValues < factor(c('a','b','z')) d < data.frame(y=sample(yValues,replace=T,size=rows), group=sample(1:100,replace=T,size=rows)) if(rareFeature) { mkRandVar < function() { v < rep(xValues[[3]],rows) signalIndices < sample(1:rows,replace=F,size=2) v[signalIndices] < sample(xValues[1:2],replace=T,size=2) v } } else { mkRandVar < function() { sample(xValues[1:2],replace=T,size=rows) } } varValues < as.data.frame(replicate(features,mkRandVar())) varNames < colnames(varValues) d < cbind(d,varValues) dTrain < subset(d,group<=50) dTest < subset(d,group>50) formula < as.formula(paste('y',paste(varNames,collapse=' + '),sep=' ~ ')) model < trainer(formula,data=dTrain) tabTrain < table(truth=dTrain$y, predict=predictor(model,newdata=dTrain,yValues=yValues)) print('train set results') print(tabTrain) print(fisher.test(tabTrain)) tabTest < table(truth=dTest$y, predict=predictor(model,newdata=dTest,yValues=yValues)) print('holdout test set results') print(tabTest) print(fisher.test(tabTest)) }
This block of code builds a universe of examples of size rows. The groundtruth we are trying to predict is if y is “A” or “B”. Each row has a number of features (equal to features). And these features are considered rare if we have rareFeature=T (if so the feature spends almost all of its time parked at the constant “z”). The point is each and every feature in this example is random and built without looking at the actual truthvalues or y’s (and therefore useless). We split the universe of data into a 50/50 test/train split. We then build a model on the training data and show the performance of predicting the ycategory on both the test and train set. We use the Fisher contingency table test to see if we have what looks like a significant model. In all cases we get a deceptive very good (very low) pvalue on training that does not translate to any real effect on test data. We show the effect for Naive Bayes (a common text classifier), decision trees, logistic regression, and random forests (note for the non Naive Bayes classifiers we use nonrare features to trick them into thinking there is a model).
Basically if you don’t at least look at model diagnostics (such as coefficient pvalues in logistic regression) or look at test significance you fool yourself into thinking you have a model that is good in training. You may even feel with the right sort of smoothing it should at least be usable in test. It will not. The most you can hope for is a training procedure that notices there is no useful signal. You can’t model your way out of having no useful features.
The results we get are as follows:

Naive Bayes train (looks good when it is not):
> library(e1071) > runExample(rows=200,features=400,rareFeature=T, trainer=function(formula,data) { naiveBayes(formula,data) }, predictor=function(model,newdata,yValues) { predict(model,newdata,type='class') } ) runExample(rows = 200, features = 400, rareFeature = T, trainer = function(formula, data) { naiveBayes(formula, data) }, predictor = function(model, newdata, yValues) { predict(model, newdata, type = "class") }) [1] "train set results" predict truth A B A 45 2 B 0 49 Fisher's Exact Test for Count Data data: tabTrain pvalue < 2.2e16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 131.2821 Inf sample estimates: odds ratio Inf

Naive Bayes holdout test (is bad):
[1] "holdout test set results" predict truth A B A 17 41 B 14 32 Fisher's Exact Test for Count Data data: tabTest pvalue = 1 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.3752898 2.4192687 sample estimates: odds ratio 0.9482474

Decision tree train (looks good when it is not):
> library(rpart) > runExample(rows=200,features=400,rareFeature=F, trainer=function(formula,data) { rpart(formula,data) }, predictor=function(model,newdata,yValues) { predict(model,newdata,type='class') } ) runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, data) { rpart(formula, data) }, predictor = function(model, newdata, yValues) { predict(model, newdata, type = "class") }) [1] "train set results" predict truth A B A 42 5 B 16 33 Fisher's Exact Test for Count Data data: tabTrain pvalue = 7.575e09 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 5.27323 64.71322 sample estimates: odds ratio 16.69703

Decision tree holdout test (is bad):
[1] "holdout test set results" predict truth A B A 33 25 B 27 19 Fisher's Exact Test for Count Data data: tabTest pvalue = 1 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.3932841 2.1838878 sample estimates: odds ratio 0.9295556

Logistic regression train (looks good when it is not):
> runExample(rows=200,features=400,rareFeature=F, trainer=function(formula,data) { glm(formula,data,family=binomial(link='logit')) }, predictor=function(model,newdata,yValues) { yValues[ifelse(predict(model,newdata=newdata,type='response')>=0.5,2,1)] } ) runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, data) { glm(formula, data, family = binomial(link = "logit")) }, predictor = function(model, newdata, yValues) { yValues[ifelse(predict(model, newdata = newdata, type = "response") >= 0.5, 2, 1)] }) [1] "train set results" predict truth A B A 47 0 B 0 49 Fisher's Exact Test for Count Data data: tabTrain pvalue < 2.2e16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 301.5479 Inf sample estimates: odds ratio Inf

Logistic regression test (is bad):
[1] "holdout test set results" predict truth A B A 35 23 B 25 21 Fisher's Exact Test for Count Data data: tabTest pvalue = 0.5556 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.5425696 3.0069854 sample estimates: odds ratio 1.275218 Warning messages: 1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rankdeficient fit may be misleading 2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rankdeficient fit may be misleading

Random Forests train (looks good, but is not):
> library(randomForest) > runExample(rows=200,features=400,rareFeature=F, trainer=function(formula,data) { randomForest(formula,data) }, predictor=function(model,newdata,yValues) { predict(model,newdata,type='response') } ) runExample(rows = 200, features = 400, rareFeature = F, trainer = function(formula, data) { randomForest(formula, data) }, predictor = function(model, newdata, yValues) { predict(model, newdata, type = "response") }) [1] "train set results" predict truth A B A 47 0 B 0 49 Fisher's Exact Test for Count Data data: tabTrain pvalue < 2.2e16 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 301.5479 Inf sample estimates: odds ratio Inf

Random Forests tests (is bad):
[1] "holdout test set results" predict truth A B A 21 37 B 13 33 Fisher's Exact Test for Count Data data: tabTest pvalue = 0.4095 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.5793544 3.6528127 sample estimates: odds ratio 1.435704
The point is: good training performance means nothing (unless your trainer is in fact reporting crossvalidated results). To avoid overfit you must at least examine model diagnostics, pervariable model coefficient significances, and should always report results on truly heldout data. It is not enough to look only at modelfit significance on training data. An additional risk is when you are in a situation where you are likely to encounter a mixture of rare useful features and rare noise features. As we have illustrated above the model fitting procedures can’t always tell the difference between features and noise. So it is easy to expect that the noise features can drown out rare useful features in practice. This should remind all of us of the need for good variable curation, selection and principled dimension reduction (domain knowledge sensitive and ysensitive, not just broad principal components analysis). Lots of features (the socalled “wide data” style of analytics) are not always easy to work with (as opposed to “tall data” which is always good as you have more examples to falsify bad relations).
We took the liberty of using the title “Bad Bayes” because this is where we have most often seen the use of many weak variables without enough data to really establish pervariable significance.
For a more on feature selection and model testing please see Zumel, Mount, “Practical Data Science with R”.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...