# qeML Example: Issues of Overfitting, Dimension Reduction Etc.

**Mad (Data) Scientist**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, *this is an unsolved problem.* But there are lots of useful methods. See the **qeML** vignettes on feature selection and overfitting for detailed background on the issues involved.

We note at the outset what our concluding statement will be: *Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks.* Real life is not like those simplistic textbooks, eh?

Here I’ll discuss **qeML::qeLeaveOut1Var**. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.

Let’s try it on the famous NYC taxi trip data, included (with modification) in **qeML**. First, note that **qeML** prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.

The call **qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) **predicts trip time using **qeML**‘s linear model. (The latter wraps **lm**, but adds some things and sets the standard **qeML** call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of **qeLin**, we could have used any other **qeML** prediction function, e.g. **qeKNN** for k-Nearest Neighbors.

> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10) full trip_distance PULocationID DOLocationID DayOfWeek 238.4611 353.2409 253.2761 246.3186 239.2277 There were 50 or more warnings (use warnings() to see the first 50)

We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.

But let’s take a closer look. There were 224 pickup locations. (run **levels(nyctaxi$PULocationID)** to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in **qeLeaveOut1Var**, we could make the dummies explicit, so each dummy is removed one at a time:

nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)

This function is actually from the **regtools** package, included in **qeML**. Then we could try, say,

nyct <- as.data.frame(nyct) qeLeaveOut1Var(nyct,'tripTime','qeLin',10)

But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,

tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean) tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)

Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.

> dim(nyct) [1] 10000 479

An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; **nyctaxi **has 4 columns, but **lm** will expand them as in **nyct**.)

We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,

z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)

This **qeML** call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run **qeLin **on the resulting PCs.”

But…remember those warning messages? Running **warnings()** we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run **qeML::constCols** to find out which columns have 0 variance, then delete those, and try **qePCA **again.

And we haven’t even mentioned using, say, **qeLASSO** or **qeXGBoost** instead of **qeLin**, etc. But the point is clear: *Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.*

**leave a comment**for the author, please follow the link and comment on their blog:

**Mad (Data) Scientist**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.