Quantile Regression (home made)

June 14, 2018
By

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

After my series of post on classification algorithms, it’s time to get back to R codes, this time for quantile regression. Yes, I still want to get a better understanding of optimization routines, in R. Before looking at the quantile regression, let us compute the median, or the quantile, from a sample.

Median

Consider a sample \{y_1,\cdots,y_n\}. To compute the median, solve\min_\mu \left\lbrace\sum_{i=1}^n|y_i-\mu|\right\rbracewhich can be solved using linear programming techniques. More precisely, this problem is equivalent to\min_{\mu,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^na_i+b_i\right\rbracewith a_i,b_i\geq 0 and y_i-\mu=a_i-b_i, \forall i=1,\cdots,n.
To illustrate, consider a sample from a lognormal distribution,

1
2
3
4
5
n = 101 
set.seed(1)
y = rlnorm(n)
median(y)
[1] 1.077415

For the optimization problem, use the matrix form, with 3n constraints, and 2n+1 parameters,

1
2
3
4
5
6
7
library(lpSolve)
A1 = cbind(diag(2*n),0) 
A2 = cbind(diag(n), -diag(n), 1)
r = lp("min", c(rep(1,2*n),0),
rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y))
tail(r$solution,1) 
[1] 1.077415

It looks like it’s working well…

Quantile

Of course, we can adapt our previous code for quantiles

1
2
3
4
tau = .3
quantile(x,tau)
      30% 
0.6741586

The linear program is now\min_{\mu,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^n\tau a_i+(1-\tau)b_i\right\rbracewith a_i,b_i\geq 0 and y_i-\mu=a_i-b_i, \forall i=1,\cdots,n. The R code is now

1
2
3
4
5
6
A1 = cbind(diag(2*n),0) 
A2 = cbind(diag(n), -diag(n), 1)
r = lp("min", c(rep(tau,n),rep(1-tau,n),0),
rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y))
tail(r$solution,1) 
[1] 0.6741586

So far so good…

Quantile Regression (simple)

Consider the following dataset, with rents of flat, in a major German city, as function of the surface, the year of construction, etc.

1
base=read.table("http://freakonometrics.free.fr/rent98_00.txt",header=TRUE)

The linear program for the quantile regression is now\min_{\mu,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^n\tau a_i+(1-\tau)b_i\right\rbracewith a_i,b_i\geq 0 and y_i-[\beta_0^\tau+\beta_1^\tau x_i]=a_i-b_i\forall i=1,\cdots,n. So use here

1
2
3
4
5
6
7
8
9
10
11
12
require(lpSolve) 
tau = .3
n=nrow(base)
X = cbind( 1, base$area)
y = base$rent_euro
A1 = cbind(diag(2*n), 0,0) 
A2 = cbind(diag(n), -diag(n), X) 
r = lp("min",
       c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2),
       c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) 
tail(r$solution,2)
[1] 148.946864   3.289674

Of course, we can use R function to fit that model

1
2
3
4
5
library(quantreg)
rq(rent_euro~area, tau=tau, data=base)
Coefficients:
(Intercept)        area 
 148.946864    3.289674

Here again, it seems to work quite well. We can use a different probability level, of course, and get a plot

1
2
3
4
5
6
7
8
9
10
11
12
13
plot(base$area,base$rent_euro,xlab=expression(paste("surface (",m^2,")")),
     ylab="rent (euros/month)",col=rgb(0,0,1,.4),cex=.5)
sf=0:250
yr=r$solution[2*n+1]+r$solution[2*n+2]*sf
lines(sf,yr,lwd=2,col="blue")
tau = .9
r = lp("min",
       c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2),
       c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) 
tail(r$solution,2)
[1] 121.815505   7.865536
yr=r$solution[2*n+1]+r$solution[2*n+2]*sf
lines(sf,yr,lwd=2,col="blue")

Quantile Regression (multiple)

Now that we understand how to run the optimization program with one covariate, why not try with two ? For instance, let us see if we can explain the rent of a flat as a (linear) function of the surface and the age of the building.

1
2
3
4
5
6
7
8
9
10
11
12
require(lpSolve) 
tau = .3
n=nrow(base)
X = cbind( 1, base$area, base$yearc )
y = base$rent_euro
A1 = cbind(diag(2*n), 0,0,0) 
A2 = cbind(diag(n), -diag(n), X) 
r = lp("min",
       c(rep(tau,n), rep(1-tau,n),0,0,0), rbind(A1, A2),
       c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) 
tail(r$solution,3)
[1] 0.000000 3.257562 0.077501

Unfortunately, this time, it is not working well…

1
2
3
4
5
library(quantreg)
rq(rent_euro~area+yearc, tau=tau, data=base)
Coefficients:
 (Intercept)         area        yearc 
-5542.503252     3.978135     2.887234

Results are quite different. And actually, another technique can confirm the later (IRLS – Iteratively Reweighted Least Squares)

1
2
3
4
5
6
7
8
eps = residuals(lm(rent_euro~area+yearc, data=base))
for(s in 1:500){
  reg = lm(rent_euro~area+yearc, data=base, weights=(tau*(eps>0)+(1-tau)*(eps<0))/abs(eps))
  eps = residuals(reg)
}
reg$coefficients
 (Intercept)         area        yearc 
-5484.443043     3.955134     2.857943

I could not figure out what went wrong with the linear program. Not only coefficients are very different, but also predictions…

1
2
3
yr = r$solution[2*n+1]+r$solution[2*n+2]*base$area+r$solution[2*n+3]*base$yearc
plot(predict(reg),yr)
abline(a=0,b=1,lty=2,col="red")


It’s now time to investigate….

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)