qeML Example: Issues of Overfitting, Dimension Reduction Etc.

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved.

We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh?

Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.

Let’s try it on the famous NYC taxi trip data, included (with modification) in qeML. First, note that qeML prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.

The call qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) predicts trip time using qeML‘s linear model. (The latter wraps lm, but adds some things and sets the standard qeML call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of qeLin, we could have used any other qeML prediction function, e.g. qeKNN for k-Nearest Neighbors.

> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10)
         full trip_distance  PULocationID  DOLocationID     DayOfWeek
     238.4611      353.2409      253.2761      246.3186      239.2277
There were 50 or more warnings (use warnings() to see the first 50)

We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.

But let’s take a closer look. There were 224 pickup locations. (run levels(nyctaxi$PULocationID) to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in qeLeaveOut1Var, we could make the dummies explicit, so each dummy is removed one at a time:

nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)

This function is actually from the regtools package, included in qeML. Then we could try, say,

nyct <- as.data.frame(nyct)
qeLeaveOut1Var(nyct,'tripTime','qeLin',10)

But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,

tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean)
tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)

Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.

> dim(nyct)
[1] 10000  479

An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; nyctaxi has 4 columns, but lm will expand them as in nyct.)

We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,

z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)

This qeML call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run qeLin on the resulting PCs.”

But…remember those warning messages? Running warnings() we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run qeML::constCols to find out which columns have 0 variance, then delete those, and try qePCA again.

And we haven’t even mentioned using, say, qeLASSO or qeXGBoost instead of qeLin, etc. But the point is clear: Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)