qeML Example: Nonparametric Quantile Regression

[This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I will first introduce the concept of quantile regression (QR), a powerful technique that is rarely taught in stat courses. I’ll give an example from the quantreg package, and then will show how qeML can be used to do model-free QR estimation. Along the way, I will also illustrate the use of closures in R.

Notation: We are predicting a scalar Y (including the case of dummy/one-hot variables) from a feature vector X.

In its simplest form, QR estimates the conditional median of Y given X, as opposed to the usual conditional mean, using a linear model. As we all know, the median is less affected by outliers than is the mean, so QR is giving us outlier robustness. As a bonus, we dispense with the homoskedasticity assumption, i.e. constant Var(Y|X).

But it’s more than that. We can model any conditional quantile, e.g. estimate the 80th percentile weight for each human height. Quantile analysis has a variety of applications.

One can conduct QR in R with the quantreg package, written by Prof. Roger Koenker, one of the major names in the QR field. Here is an example, using the qeML dataset mlb:

> data(mlb)
> library(quantreg)
> z <- rq(Weight ~ Height,data=mlb,tau=0.80)
> summary(z)

Call: rq(formula = Weight ~ Height, tau = 0.8, data = mlb)

tau: [1] 0.8

Coefficients:
            Value      Std. Error t value    Pr(>|t|)  
(Intercept) -201.66667   17.58797  -11.46617    0.00000
Height         5.66667    0.23856   23.75376    0.00000

As you can see, the call form here is like that of the R linear model function lm, and we could have had multiple predictors, e.g. age in addition to height.

But what if we don’t believe a linear model is appropriate? Of course, as usual we may consider adding polynomial terms, and there is also a package quantreg.nonpar. But we obtain model-free estimates easily using qeKNN in qeML.

Standard k-Nearest Neighbors estimation is simple. Say to predict the weight of someone 70 inches tall and 28 years ago, we find the k closest data points in our training data to the vector (70,28). We then compute the mean weight among those k people, and it’s then our predicted weight for the new person who has known height and age but unknown weight.

But qeKNN offers the user more flexibility, via an argument smoothingFtn. Instead of computing mean Y among the neighbors, we can specify the median, or even specify that a small linear model be fit to the neighboring data. The latter may be useful if the new person to be predicted is either very short or very tall, as things tend to be biased near the edges of a dataset. If the new person is 77 inches tall, most or all people in our neighboring data will be shorter than this, thus lighter, so our prediction based on the mean will be biased downward.

But we can also specify our own smoothingFtn, perfect for QR. We simply define a function that gives us the desired Y quantile among the neighbors.

The call form is

smoothingFtn(nearIdxs,x,y,predpt)

Here x and y are our X and Y training data, predpt is the new X value at which we wish to predict (redundant in most cases), and nearIdxs are the indices in x and y of the nearest neighbors to predpt. Note that at the time kNN calls smoothingFtn, the indices have already been computed.

Our code is then

sftn <- function(nearIdxs,x,y,predpt)
{
   nearYs <- y[nearIdxs]
   quantile(nearYs,0.80)
}

u <- mlb[c('Height','Age','Weight')]
set.seed(9999) # qeML ftns do random holdout
z <- qeKNN(u,'Weight',smoothingFtn=sftn)
predict(z,c(70,28)) # prints 200

It would be nice, though, to run this for a general quantile level q, rather than the special case 0.80. But we can’t do that directly, because the smoothingFtn argument to qeKNN must be a function object, no provision there for an argument to smoothingFtn. But we can accomplish what we want via R closures.

makeSmFtn <- function(q) function(newIdxs,x,y,predpt) quantile(y[newIdxs],q)

To understand this, one must first know more about the R reserved word function. Consider this simple example:

f <- function(x) x^2

Here we are saying, “R, please create a function for me. Its formal argument will be named x, and it will compute and return the square of that quantity. After you create that function–an object, just like other R entities–assign it to f.” In other words, function creates functions. As I like to tell my students,

The function of the function named function is to create functions!

Now, going back to makeQFtn above, it creates a function object (the call to quantile), and returns that object, just as with f above, but the key point is that here the value of q will be “baked in” to that object.

So our call to qeKNN for general q would be

z <- qeKNN(u,’Weight’,smoothingFtn=makeSmFtn(q))

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)