**R-english – Freakonometrics**, and kindly contributed to R-bloggers)

There are some graphs that you cannot forget. One graph that I found puzzling was mentioned on Andrew Gelman’s blog, a few years back, and was related to rupture detection

What I remember from this graph is that if you want to get a rupture, you can easily find one…

Recently, I had to review a paper, and Imbens & Lemieux (2008) was mentioned (the paper on Regression Discontinuity). It is a great reference and a major contribution, undoublty. But now, each time I read about ruptures and discontinuities, I have this graph in mind, and I start having doubt. I there really a discontinuity ? or is it somehow artificial, simply because the author is looking for a discontinuity ? For instance, consider the following (simulated) dataset

`f=function(x) pnorm(x,mean=3)-x/4`

n=100

set.seed(1)

X=runif(n)*6

Y=f(X)+rnorm(n)/8

DB=data.frame(X,Y)

It is a smooth continuous model. To test for discontinuity, a standard procedure seems to fit models on the left, and on the right. For instance two linear regression models,

`s=2`

i=1

reg1=lm(Y~poly(X,i),data=DB,subset=X<=s)

reg2=lm(Y~poly(X,i),daia=DB,subset=X>=s)

u=seq(0,6,by=.025)

v1=predict(reg1,newdata=data.frame(X=u[u<=s]))

v2=predict(reg2,newdata=data.frame(X=u[u>=s]))

lines(u[u<=s],v1,lwd=2)

lines(u[u>=s],v2,lwd=2)

or two quadratic ones,

`i=2`

reg1=lm(Y~poly(X,i),data=DB,subset=X<=s)

reg2=lm(Y~poly(X,i),daia=DB,subset=X>=s)

u=seq(0,6,by=.025)

v1=predict(reg1,newdata=data.frame(X=u[u<=s]))

v2=predict(reg2,newdata=data.frame(X=u[u>=s]))

lines(u[u<=s],v1,lwd=2)

lines(u[u>=s],v2,lwd=2)

We observe here a discontinuity… but is it significant ? Consider here simulations of the dataset, and try several breakpoints

`SIMU=function(ns=1e3){`

MAT=matrix(NA,ns,21)

n=100

for(j in 1:ns){

X=runif(n)*6

Y=f(X)+rnorm(n)/8

DB=data.frame(X,Y)

saut=function(s){

reg1=lm(Y~poly(X,2),data=DB,subset=X<=s)

reg2=lm(Y~poly(X,2),data=DB,subset=X>=s)

predict(reg2,newdata=data.frame(X=s))-predict(reg1,newdata=data.frame(X=s))

}

S=seq(.5,5.5,by=.25)

MAT[j,]=Vectorize(saut)(S)

}

return(MAT)}

If we plot those differences, between the model on the left and the model on the right of the breakpoint, we get

`SM=SIMU(1e3)`

S=seq(.5,5.5,by=.25)

plot(S,SM[1,],type="l",ylim=c(-.7,.5),col="light blue")

for(j in 2:100) lines(S,SM[j,],col="light blue")

abline(h=0,lty=2)

lines(S,apply(SM,2,mean),col="red",lwd=2)

lines(S,apply(SM,2,function(x) quantile(x,.9)),col="red")

lines(S,apply(SM,2,function(x) quantile(x,.1)),col="red")

for various breakpoints. The good thing is that it looks like we should reject the idea of having a breakpoint here, since it might not be “significant”. But what if the change of concavity was sharper, in our simulated dataset

`f=function(x) pnorm(x,mean=3)-x/4`

n=100

set.seed(1)

X=runif(n)*6

Y=f(X)+rnorm(n)/8

DB=data.frame(X,Y)

With a simular code, we might now accept the idea of having a discontinuity, that we do not have actually…

But more puzzeling, if we get back to the initial statement in the post of Andrew Gelman, it might actually be possible to chose the degrees of the polynomial regressions to validate the discontinuity hypothesis. Let us get back on our original dataset.

`SIMU=function(ns){`

MAT=matrix(NA,ns,21)

n=100

for(j in 1:ns){

X=runif(n)*6

Y=f(X)+rnorm(n)/8

DB=data.frame(X,Y)

saut=function(s){

reg1=reg2=rep(NA,3)

for(i in 1:3){

reg1[i]=predict(lm(Y~poly(X,i),data=DB,subset=X<=s),newdata=data.frame(X=s))

reg2[i]=predict(lm(Y~poly(X,i),data=DB,subset=X>=s),newdata=data.frame(X=s))

}

max(abs(rep(reg1,each=3)-rep(reg2,3)))}

S=seq(.5,5.5,by=.25)

MAT[j,]=Vectorize(saut)(S)

}

return(MAT)}

Here we try a linear regression, a quadratic and a cubic one. If we generate one thousand datasets, we get

`SM=SIMU(1e3)`

S=seq(.5,5.5,by=.25)

plot(S,SM[1,],type="l",ylim=c(-.1,.9),col="light blue")

for(j in 2:100) lines(S,SM[j,],col="light blue")

abline(h=0,lty=2)

lines(S,apply(SM,2,mean),col="red",lwd=2)

lines(S,apply(SM,2,function(x) quantile(x,.9)),col="red")

lines(S,apply(SM,2,function(x) quantile(x,.1)),col="red")

i.e. wherever we seek a discontinuity, we can actually find one. Even if we did generate a very smooth regression model….

**leave a comment**for the author, please follow the link and comment on their blog:

**R-english – Freakonometrics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...