Some More Results on the Theory of Statistical Learning

March 8, 2015
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Yesterday, I did mention a popular graph discussed when studying theoretical foundations of statistical learning. But there is usually another one, which is the following,

Let us get back to the underlying formulas. On the traning sample, we have some empirical risk, defined as

http://latex.codecogs.com/gif.latex?$$R_n=frac{1}{n}sum_{i=1}^n%20L(y_i,widehat{m}_n(boldsymbol{x}))$$

for some loss function http://latex.codecogs.com/gif.latex?L. Why is it complicated ?

From the law of large numbers,

http://latex.codecogs.com/gif.latex?lim_{nrightarrowinfty}%20frac{1}{n}sum_{i=1}^n%20U_i%20=%20mathbb{E}[U]

when the http://latex.codecogs.com/gif.latex?U_i‘s are i.i.d., and http://latex.codecogs.com/gif.latex?U_isim%20U. Here we look for

http://latex.codecogs.com/gif.latex?lim_{nrightarrowinfty}%20%20underbrace{frac{1}{n}sum_{i=1}^n%20L(y_i,widehat{m}_n(boldsymbol{x}))%20}_{R_n}

It is difficult to say something about the limit, since the http://latex.codecogs.com/gif.latex?(y,boldsymbol{x}_i)‘s are independent, but the http://latex.codecogs.com/gif.latex?L(y_i,widehat{m}_n(boldsymbol{x}_i))‘s, because of http://latex.codecogs.com/gif.latex?widehat{m}_n(cdot) (which dependends on the entire sample).

But if we look at the empirical risk on a validation sample

http://latex.codecogs.com/gif.latex?lim_{nrightarrowinfty}%20underbrace{frac{1}{n}sum_{i=1}^n%20L(tilde{y}_i,widehat{m}_n(tilde{boldsymbol{x}}))}_{tilde{R}_i}%20=mathbb{E}[L(Y,boldsymbol{X})]

One can prove that, with probability http://latex.codecogs.com/gif.latex?alpha,

http://latex.codecogs.com/gif.latex?widehat{R}_nleq%20R_n%20+sqrt{frac{{{VC}}[log(2n/d)+1]-log[alpha/4]}{n}}

which depends on http://latex.codecogs.com/gif.latex?n (as discussed in the previous post), but also about that http://latex.codecogs.com/gif.latex?VC parameter, the so-called Vapnik-Chervonenkis dimension.

I won’t spend hours on that dimension, but the idea is that this dimension is related to the model complexity. For instance, in dimension one (one covariate), if http://latex.codecogs.com/gif.latex?m(cdot) is a polynomial of degree http://latex.codecogs.com/gif.latex?d, then http://latex.codecogs.com/gif.latex?VC=d+1. In dimension two (two covariates), if http://latex.codecogs.com/gif.latex?m(cdot) is a (bivariate) polynomial of degree http://latex.codecogs.com/gif.latex?d, then http://latex.codecogs.com/gif.latex?VC=(d+1)(d+2)/2$, while it would be http://latex.codecogs.com/gif.latex?2(d+1) if http://latex.codecogs.com/gif.latex?m(cdot) is additive, with two polynomials of degree http://latex.codecogs.com/gif.latex?d.

Let us try to get a graph which looks like the one above, using the same idea as the one in our previous post.

MissClassU=rep(NA,25)
MissClassV=rep(NA,25)
n=200
  U=data.frame(X1=runif(n),X2=runif(n))
  p=(U[,1]+U[,2])/2
  U$Y=rbinom(n,size=1,prob=p)
  V=data.frame(X1=runif(n),X2=runif(n))
  p=(V[,1]+V[,2])/2
  V$Y=rbinom(n,size=1,prob=p)
for(s in 1:25){
reg=glm(Y~poly(X1,s)+poly(X2,s),data=U,
family=binomial)
pd=function(x1,x2) predict(reg,newdata=data.frame(X1=x1,X2=x2),type="response")>.5
  MissClassU[s]=mean(abs(pd(U$X1,U$X2)-U$Y))
  MissClassV[s]=mean(abs(pd(V$X1,V$X2)-V$Y))
}

If we plot the missclassification rate, as a function of the polynomial degree, in purple on the validation sample, and in black on the training sample, we get

Again, it is on one sample, only. We can run it on hundreds, and see how the average risk of misclassification changes with complexity.

MCU=rep(NA,500)
MCV=rep(NA,500)
 
missclassification=function(s){ 
  for(i in 1:500){
    U=data.frame(X1=runif(n),X2=runif(n))
    p=(U[,1]+U[,2])/2
    U$Y=rbinom(n,size=1,prob=p)
reg=glm(Y~bs(X1,s)+bs(X2,s),data=U,
family=binomial)
pd=function(x1,x2) predict(reg,newdata=data.frame(X1=x1,X2=x2),type="response")>.5
    MCU[i]=mean(abs(pd(U$X1,U$X2)-U$Y))  
    V=data.frame(X1=runif(n),X2=runif(n))
    p=(V[,1]+V[,2])/2
    V$Y=rbinom(n,size=1,prob=p)
    MCV[i]=mean(abs(pd(V$X1,V$X2)-V$Y))
   }
  MissClassV=mean(MCU)
  MissClassU=mean(MCV)
return(c(MissClassU,MissClassV))
}

Here, we cannot see the optimal dimension, because our risk on the validation samples keeps increasing. Which makes sence since our data are generated from a linear model, so the optimal transformation should be optained with linear transformation (and not polynomials with higher degrees).

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)