# Some heuristics about local regression and kernel smoothing

October 8, 2013
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

In a standard linear model, we assume that $\mathbb{E}(Y\vert X=x)=\beta_0+\beta_1 x$. Alternatives can be considered, when the linear assumption is too strong.

• Polynomial regression

A natural extension might be to assume some polynomial function,

$\mathbb{E}(Y\vert X=x)=\beta_0+\beta_1 x+\beta_2 x^2 +\cdots +\beta_k x^k$

Again, in the standard linear model approach (with a conditional normal distribution using the GLM terminology), parameters $\boldsymbol{\beta}=(\beta_0,\beta_1,\cdots,\beta_k)$ can be obtained using least squares, where a regression of $Y$ on $\boldsymbol{X}=(1,X,X^2,\cdots,X^k)$ is considered.

Even if this polynomial model is not the real one, it might still be a good approximation for $\mathbb{E}(Y\vert X=x)=h(x)$. Actually, from Stone-Weierstrass theorem, if $h(\cdot)$ is continuous on some interval, then there is a uniform approximation of $h(\cdot)$ by polynomial functions.

Just to illustrate, consider the following (simulated) dataset

```set.seed(1)
xr = seq(0,n,by=.1)
yr = sin(xr/2)+rnorm(length(xr))/2
db = data.frame(x=xr,y=yr)
plot(db)```

with the standard regression line

```reg = lm(y ~ x,data=db)
abline(reg,col="red")```

Consider some polynomial regression. If the degree of the polynomial function is large enough, any kind of pattern can be obtained,

`reg=lm(y~poly(x,5),data=db)`

But if the degree is too large, then too many ‘oscillations’ are obtained,

`reg=lm(y~poly(x,25),data=db)`

and the estimation might be be seen as no longer robust: if we change one point, there might be important (local) changes

```plot(db)
attach(db)
lines(xr,predict(reg),col="red",lty=2)
yrm=yr;yrm[31]=yr[31]-2
regm=lm(yrm~poly(xr,25))
lines(xr,predict(regm),col="red")
```
• Local regression

Actually, if our interest is to have locally a good approximation of  $h(\cdot)$, why not use a local regression?

This can be done easily using a weighted regression, where, in the least square formulation, we consider

$\min\left\{ \sum_{i=1}^n \omega_i [Y_i-(\beta_0+\beta_1 X_i)]^2 \right\}$

(it is possible to consider weights in the GLM framework, but let’s keep that for another post). Two comments here:

• here I consider a linear model, but any polynomial model can be considered. Even a constant one. In that case, the optimization problem is

$\min\left\{ \sum_{i=1}^n \omega_i [Y_i-\beta_0]^2 \right\}$which can be solve explicitly, since

$\widehat{\beta}_0=\frac{\sum \omega_i Y_i}{\sum \omega_i}$

• so far, nothing was mentioned about the weights. The idea is simple, here: if you can a good prediction at point $x_0$, then $\omega_i$ should be proportional to some distance between $X_i$ and $x_0$: if $X_i$ is too far from $x_0$, then it should not have to much influence on the prediction.

For instance, if we want to have a prediction at some point $x_0$, consider $\omega_i\propto \boldsymbol{1}(\vert X_i-x_0 \vert<1)$. With this model, we remove observations too far away,

Actually, here, it is the same as

`reg=lm(yr~xr,subset=which(abs(xr-x0)<1)`

A more general idea is to consider some kernel function $K(\cdot)$ that gives the shape of the weight function, and some bandwidth (usually denoted h) that gives the length of the neighborhood, so that

$\omega_i = K\left(\frac{x_0-X_i}{b}\right)$

This is actually the so-called Nadaraya-Watson estimator of function $h(\cdot)$.
In the previous case, we did consider a uniform kernel $K(x)=\boldsymbol{1}(x\in[-1/2,+1/2])$, with bandwith $2$,

But using this weight function, with a strong discontinuity may not be the best idea… Why not a Gaussian kernel,

$K(x)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{x^2}{2}\right)$

This can be done using

```fitloc0 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~1,data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}```

On our dataset, we can plot

```ul=seq(0,10,by=.01)
vl0=Vectorize(fitloc0)(ul)
u0=seq(-2,7,by=.01)
linearlocalconst=function(x0){
w=dnorm((xr-x0))
plot(db,cex=abs(w)*4)
lines(ul,vl0,col="red")
axis(3)
axis(2)
reg=lm(y~1,data=db,weights=w)
u=seq(0,10,by=.02)
v=predict(reg,newdata=data.frame(x=u))
lines(u,v,col="red",lwd=2)
abline(v=c(0,x0,10),lty=2)
}
linearlocalconst(2)```

Here, we want a local regression at point 2. The horizonal line below is the regression (the size of the point is proportional to the wieght). The curve, in red, is the evolution of the local regression

Let us use an animation to visualize the construction of the curve. One can use

`library(animate)`

but for some reasons, I cannot install the package easily on Linux. And it is not a big deal. We can still use a loop to generate some graphs

```vx0=seq(1,9,by=.1)
vx0=c(vx0,rev(vx0))
graphloc=function(i){
name=paste("local-reg-",100+i,".png",sep="")
png(name,600,400)
linearlocalconst(vx0[i])
dev.off()}

for(i in 1:length(vx0)) graphloc(i)```

and then, in a terminal, I simply use

`    convert -delay 25 /home/freak/local-reg-1*.png /home/freak/local-reg.gif`

Of course, it is possible to consider a linear model, locally,

```fitloc1 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=1),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}```

or even a quadratic (local) regression,

```fitloc2 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=2),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}```

Of course, we can change the bandwidth

To conclude the technical part this post, observe that, in practise, we have to choose the shape of the weight function (the so-called kernel). But there are (simple) technique to select the “optimal” bandwidth h. The idea of cross validation is to consider

$\min\left\{ \sum_{i=1}^n [Y_i-\widehat{Y}_i(b)]^2 \right\}$

where $\widehat{Y}_i(b)$ is the prediction obtained using a local regression technique, with bandwidth $b$. And to get a more accurate (and optimal) bandwith $\widehat{Y}_i(b)$ is obtained using a model estimated on a sample where the ith observation was removed. But again, that is not the main point in this post, so let’s keep that for another one…

Perhaps we can try on some real data? Inspired from a great post on http://f.briatte.org/teaching/ida/092_smoothing.html, by François Briatte, consider the Global Episode Opinion Survey, from some TV show, http://geos.tv/index.php/index?sid=189 , like Dexter.

```library(XML)
file = "geos-tww.csv"
html = htmlParse("http://www.geos.tv/index.php/list?sid=189&collection=all")
html = xpathApply(html, "//table[@id='collectionTable']")[[1]]
data = data[,-3]
names(data)=c("no",names(data)[-1])
data=data[-(61:64),]```

Let us reshape the dataset,

```data\$no = 1:96
data\$mu = as.numeric(substr(as.character(data\$Mean), 0, 4))
data\$se =  sd(data\$mu,na.rm=TRUE)/sqrt(as.numeric(as.character(data\$Count)))
data\$season = 1 + (data\$no - 1)%/%12
data\$season = factor(data\$season)
plot(data\$no,data\$mu,ylim=c(6,10))
segments(data\$no,data\$mu-1.96*data\$se,
data\$no,data\$mu+1.96*data\$se,col="light blue")```

As done by François, we compute some kind of standard error, just to reflect uncertainty. But we won’t really use it.

```plot(data\$no,data\$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
for(s in 1:8){reg=lm(mu~no,data=db,subset=season==s)
lines((s-1)*12+1:12,predict(reg)[1:12],col="red") }```

Henre, we assume that all seasons should be considered as completely independent… which might not be a great assumption.

```db = data
NW = ksmooth(db\$no,db\$mu,kernel = "normal",bandwidth=5)
plot(data\$no,data\$mu)
lines(NW,col="red")```

We can try to look the curve with a larger bandwidth. The problem is that there is a missing value, at the end. If we (arbitrarily) fill it, we can run a kernel regression,

```db\$mu[95]=7
NW = ksmooth(db\$no,db\$mu,kernel = "normal",bandwidth=12)
plot(data\$no,data\$mu,ylim=c(6,10))
lines(NW,col="red")```