Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let us continue our discussion on smoothing techniques in regression. Assume that .$\mathbb{E}(Y\vert X=x)=h(x)$ where $h(\cdot)$ is some unkown function, but assumed to be sufficently smooth. For instance, assume that $h(\cdot)$ is continuous, that $h'(\cdot)$ exists, and is continuous, that  $h''(\cdot)$ exists and is also continuous, etc. If $h(\cdot)$ is smooth enough, Taylor’s expansion can be used. Hence, for $x\in(\alpha,\beta)$

$h(x)=h(\alpha)+\sum_{k=1}^ d \frac{(x-\alpha)^k}{k!}h^{(k)}(x_0)+\frac{1}{d!}\int_{\alpha}^x [x-t]^d h^{(d+1)}(t)dt$

which can also be writen as

$h(x)=\sum_{k=0}^ d a_k (x-\alpha)^k +\frac{1}{d!}\int_{\alpha}^x [x-t]^d h^{(d+1)}(t)dt$

for some $a_k$‘s. The first part is simply a polynomial.

The second part, is some integral. Using Riemann integral, observe that

$\frac{1}{d!}\int_{\alpha}^x [x-t]^d h^{(d+1)}(t)dt\sim \sum_{i=1}^ j b_i (x-x_i)_+^d$

for some $b_i$‘s, and some

$\alpha < x_1< x_2< \cdots < x_{j-1} < x_j < \beta$

Thus,

$h(x) \sim \sum_{k=0}^ d a_k (x-\alpha)^k +\sum_{i=1}^ j b_i (x-x_i)_+^d$

Nice! We have our linear regression model. A natural idea is then to consider a regression of $Y$ on $\boldsymbol{X}$ where

$\boldsymbol{X} = (1,X,X^2,\cdots,X^d,(X-x_1)_+^d,\cdots,(X-x_k)_+^d )$

given some knots $\{x_1,\cdots,x_k\}$. To make things easier to understand, let us work with our previous dataset,

plot(db)

If we consider one knot, and an expansion of order 1,

attach(db)
library(splines)
B=bs(xr,knots=c(3),Boundary.knots=c(0,10),degre=1)
reg=lm(yr~B)
lines(xr[xr<=3],predict(reg)[xr<=3],col="red")
lines(xr[xr>=3],predict(reg)[xr>=3],col="blue")

The prediction obtained with this spline can be compared with regressions on subsets (the doted lines)

reg=lm(yr~xr,subset=xr<=3)
lines(xr[xr<=3],predict(reg)[xr<=3],col="red",lty=2)
reg=lm(yr~xr,subset=xr>=3)
lines(xr[xr>=3],predict(reg),col="blue",lty=2)

It is different, since we have here three parameters (and not four, as for the regressions on the two subsets). One degree of freedom is lost, when asking for a continuous model. Observe that it is possible to write, equivalently

reg=lm(yr~bs(xr,knots=c(3),Boundary.knots=c(0,10),degre=1),data=db)

So, what happened here?

B=bs(xr,knots=c(2,5),Boundary.knots=c(0,10),degre=1)
matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

Here, the functions that appear in the regression are the following

Now, if we run the regression on those two components, we get

B=bs(xr,knots=c(2,5),Boundary.knots=c(0,10),degre=1)
matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

If we add one knot, we get

the prediction is

reg=lm(yr~B)
lines(xr,predict(reg),col="red")

Of course, we can choose much more knots,

B=bs(xr,knots=1:9,Boundary.knots=c(0,10),degre=1)
reg=lm(yr~B)
lines(xr,predict(reg),col="red")

We can even get a confidence interval

reg=lm(yr~B)
P=predict(reg,interval="confidence")
plot(db,col="white")
polygon(c(xr,rev(xr)),c(P[,2],rev(P[,3])),col="light blue",border=NA)
points(db)
reg=lm(yr~B)
lines(xr,P[,1],col="red")
abline(v=c(0,2,5,10),lty=2)

And if we keep the  two knots we chose previously, but consider Taylor’s expansion of order 2, we get

B=bs(xr,knots=c(2,5),Boundary.knots=c(0,10),degre=2)
matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

So, what’s going on? If we consider the constant, and the first component of the spline based matrix, we get

k=2
plot(db)
B=cbind(1,B)
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

If we add the constant term, the first term and the second term, we get the part on the left, before the first knot,

k=3
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

and with three terms from the spline based matrix, we can get the part between the two knots,

k=4
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

and finallty, when we sum all the terms, we get this time the part on the right, after the last knot,

k=5
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

This is what we get using a spline regression, quadratic, with two (fixed) knots. And can can even get confidence intervals, as before

reg=lm(yr~B)
P=predict(reg,interval="confidence")
plot(db,col="white")
polygon(c(xr,rev(xr)),c(P[,2],rev(P[,3])),col="light blue",border=NA)
points(db)
reg=lm(yr~B)
lines(xr,P[,1],col="red")
abline(v=c(0,2,5,10),lty=2)

The great idea here is to use functions $(x-x_i)_+$, that will insure continuity at point $x_i$.

Of course, we can use those splines on our Dexter application,

Here again, using linear spline function, it is possible to impose a continuity constraint,

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
reg=lm(mu~bs(no,knots=c(12*(1:7)+.5),Boundary.knots=c(0,97),
degre=1),data=db)
lines(c(1:94,96),predict(reg),col="red")

But we can also consider some quadratic splines,

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
reg=lm(mu~bs(no,knots=c(12*(1:7)+.5),Boundary.knots=c(0,97),
degre=2),data=db)
lines(c(1:94,96),predict(reg),col="red")