My residuals look weird… aren’t they ?

November 3, 2010
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

Since I got the same question twice, let us look at it quickly....  Some students show me a graph (from a Poisson regression) which looks like that,

and they asked "isn't it weird ?", i.e."residuals are null or positive... this is not what we should have, right ?". Actually, residuals are always centered in a glm regression (if you keep the intercept). Let us generate a dataset,
> n=500
> set.seed(1)
> X=runif(n)
> E=rnorm(n)*.3
> XB=-5+2*X+E
> lambda=exp(XB)
> Y=rpois(n,lambda)
> reg=glm(Y~X,family=poisson)
> plot(predict(reg),residuals(reg,"pearson"))
If we look carefully, residuals in the lower part, they are strictly negative... And since, on average, we should have zero, then we are very close to 0.
> mean(residuals(reg,"pearson"))
[1] -0.003381202

Actually, the true value is either
> u=seq(-10,2,by=.1)
> lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col="red")
or
> lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col="blue")

So what happens ? Here, the value of the parameter (the expected value of our Poisson distribution) is very small
> mean(Y)
[1] 0.026

A crude estimator for the glm would be to consider a regression only on the intercept. In that case, the error would be either 0-0.026, or 1-0.026, i.e. for 97.5%, we have -0.026 (which is small, close to 0, but negative), while for 2.5% we have 0.974 (which is huge compared to the previous value). Well, on that graph, we plot Pearson's residuals, i..e. we divide the difference by the expected value... so we have those two curves....
Actually, there are only two clouds since the probability to have 1 (when lambda is close to 2.5%) is small, and extremely small to exceed 2...
On the graph below, the average value of Y increases (and then decreases): initially, almost all the points are on the red curve (i.e. we observe 0, and predict something extremely small), then we have more and more points on the blue curve (for more and more individuals we have 1), then we start to see some on the green curve (i.e. 2), etc. Note that areas where we observe clouds remain unchanged (i.e. the colored curves).
> P=seq(-5,-2,by=.05)
> P=c(P,rev(P))
> for(i in 1:length(P)){
+ set.seed(1)
+ XB=P[i]+2*X+E
+ lambda=exp(XB)
+ Y=rpois(n,lambda)
+ reg=glm(Y~X,family=poisson)
+ u=seq(-10,2,by=.1)
+ plot(predict(reg),residuals(reg,"pearson"),ylim=c(-1,5),xlim=c(-6,0))
+ abline(h=0)
+ lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col="red")
+ lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col="blue")
+ lines(u,(2-exp(u))/sqrt(exp(u)),lwd=.5,col="green")
+ rlines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col="orange")
+ lines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col="orange")
+ lines(u,(4-exp(u))/sqrt(exp(u)),lwd=.5,col="purple")
+ points(predict(reg),residuals(reg,"pearson"))
+ }


So the graph is perfectly normal since we have a discrete distribution (while we predict something continuous).

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics - Tag - R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.