Some heuristics about local regression and kernel smoothing

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a standard linear model, we assume that . Alternatives can be considered, when the linear assumption is too strong.

  • Polynomial regression

A natural extension might be to assume some polynomial function,

Again, in the standard linear model approach (with a conditional normal distribution using the GLM terminology), parameters can be obtained using least squares, where a regression of  on  is considered.

Even if this polynomial model is not the real one, it might still be a good approximation for . Actually, from Stone-Weierstrass theorem, if  is continuous on some interval, then there is a uniform approximation of  by polynomial functions.

Just to illustrate, consider the following (simulated) dataset

set.seed(1)
xr = seq(0,n,by=.1)
yr = sin(xr/2)+rnorm(length(xr))/2
db = data.frame(x=xr,y=yr)
plot(db)

with the standard regression line

reg = lm(y ~ x,data=db)
abline(reg,col="red")

Consider some polynomial regression. If the degree of the polynomial function is large enough, any kind of pattern can be obtained,

reg=lm(y~poly(x,5),data=db)

But if the degree is too large, then too many ‘oscillations’ are obtained,

reg=lm(y~poly(x,25),data=db)

and the estimation might be be seen as no longer robust: if we change one point, there might be important (local) changes

plot(db)
attach(db)
lines(xr,predict(reg),col="red",lty=2)
yrm=yr;yrm[31]=yr[31]-2 
regm=lm(yrm~poly(xr,25)) 
lines(xr,predict(regm),col="red")
  • Local regression

Actually, if our interest is to have locally a good approximation of  , why not use a local regression?

This can be done easily using a weighted regression, where, in the least square formulation, we consider

(it is possible to consider weights in the GLM framework, but let’s keep that for another post). Two comments here:

  • here I consider a linear model, but any polynomial model can be considered. Even a constant one. In that case, the optimization problem is

which can be solve explicitly, since

  • so far, nothing was mentioned about the weights. The idea is simple, here: if you can a good prediction at point , then  should be proportional to some distance between  and : if  is too far from , then it should not have to much influence on the prediction.

For instance, if we want to have a prediction at some point , consider . With this model, we remove observations too far away,

Actually, here, it is the same as

reg=lm(yr~xr,subset=which(abs(xr-x0)<1)

A more general idea is to consider some kernel function  that gives the shape of the weight function, and some bandwidth (usually denoted h) that gives the length of the neighborhood, so that

This is actually the so-called Nadaraya-Watson estimator of function .
In the previous case, we did consider a uniform kernel , with bandwith ,

But using this weight function, with a strong discontinuity may not be the best idea… Why not a Gaussian kernel,

This can be done using

fitloc0 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~1,data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

On our dataset, we can plot

ul=seq(0,10,by=.01)
vl0=Vectorize(fitloc0)(ul)
u0=seq(-2,7,by=.01)
linearlocalconst=function(x0){
w=dnorm((xr-x0))
plot(db,cex=abs(w)*4)
lines(ul,vl0,col="red")
axis(3)
axis(2)
reg=lm(y~1,data=db,weights=w)
u=seq(0,10,by=.02)
v=predict(reg,newdata=data.frame(x=u))
lines(u,v,col="red",lwd=2)
abline(v=c(0,x0,10),lty=2)
}
linearlocalconst(2)

Here, we want a local regression at point 2. The horizonal line below is the regression (the size of the point is proportional to the wieght). The curve, in red, is the evolution of the local regression

Let us use an animation to visualize the construction of the curve. One can use

library(animate)

but for some reasons, I cannot install the package easily on Linux. And it is not a big deal. We can still use a loop to generate some graphs

vx0=seq(1,9,by=.1)
vx0=c(vx0,rev(vx0))
graphloc=function(i){
name=paste("local-reg-",100+i,".png",sep="")
png(name,600,400)
linearlocalconst(vx0[i])
dev.off()}

for(i in 1:length(vx0)) graphloc(i)

and then, in a terminal, I simply use

    convert -delay 25 /home/freak/local-reg-1*.png /home/freak/local-reg.gif

Of course, it is possible to consider a linear model, locally,

fitloc1 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=1),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

or even a quadratic (local) regression,

fitloc2 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=2),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

Of course, we can change the bandwidth

To conclude the technical part this post, observe that, in practise, we have to choose the shape of the weight function (the so-called kernel). But there are (simple) technique to select the “optimal” bandwidth h. The idea of cross validation is to consider

where  is the prediction obtained using a local regression technique, with bandwidth . And to get a more accurate (and optimal) bandwith  is obtained using a model estimated on a sample where the ith observation was removed. But again, that is not the main point in this post, so let’s keep that for another one…

Perhaps we can try on some real data? Inspired from a great post on http://f.briatte.org/teaching/ida/092_smoothing.html, by François Briatte, consider the Global Episode Opinion Survey, from some TV show, http://geos.tv/index.php/index?sid=189 , like Dexter.

library(XML)
library(downloader)
file = "geos-tww.csv"
html = htmlParse("http://www.geos.tv/index.php/list?sid=189&collection=all")
html = xpathApply(html, "//table[@id='collectionTable']")[[1]]
data = readHTMLTable(html)
data = data[,-3]
names(data)=c("no",names(data)[-1])
data=data[-(61:64),]

Let us reshape the dataset,

data$no = 1:96
data$mu = as.numeric(substr(as.character(data$Mean), 0, 4))
data$se =  sd(data$mu,na.rm=TRUE)/sqrt(as.numeric(as.character(data$Count)))
data$season = 1 + (data$no - 1)%/%12
data$season = factor(data$season)
plot(data$no,data$mu,ylim=c(6,10))
segments(data$no,data$mu-1.96*data$se,
data$no,data$mu+1.96*data$se,col="light blue")

As done by François, we compute some kind of standard error, just to reflect uncertainty. But we won’t really use it.

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
for(s in 1:8){reg=lm(mu~no,data=db,subset=season==s)
lines((s-1)*12+1:12,predict(reg)[1:12],col="red") }

Henre, we assume that all seasons should be considered as completely independent… which might not be a great assumption.

db = data
NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=5)
plot(data$no,data$mu)
lines(NW,col="red")

We can try to look the curve with a larger bandwidth. The problem is that there is a missing value, at the end. If we (arbitrarily) fill it, we can run a kernel regression,

db$mu[95]=7
NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=12) 
plot(data$no,data$mu,ylim=c(6,10)) 
lines(NW,col="red")

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)