# My residuals look weird… aren’t they ?

November 3, 2010
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

Since I got the same question twice, let us look at it quickly….
Some students show me a graph (from a Poisson regression) which looks
like that,

and they asked “isn’t it weird ?“, i.e.”residuals are null or positive… this is not what we should have, right ?“. Actually, residuals are always centered in a glm regression (if you keep the intercept). Let us generate a dataset,
> n=500
> set.seed(1)
> X=runif(n)
> E=rnorm(n)*.3
> XB=-5+2*X+E
> lambda=exp(XB)
> Y=rpois(n,lambda)
> reg=glm(Y~X,family=poisson)
> plot(predict(reg),residuals(reg,”pearson”))
If
we look carefully, residuals in the lower part, they are strictly
negative… And since, on average, we should have zero, then we are
very close to 0.
> mean(residuals(reg,”pearson”))
[1] -0.003381202

Actually, the true value is either
> u=seq(-10,2,by=.1)
> lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)
or
> lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)

So what happens ? Here, the value of the parameter (the expected value of our Poisson distribution) is very small
> mean(Y)
[1] 0.026

A
crude estimator for the glm would be to consider a regression only on
the intercept. In that case, the error would be either 0-0.026, or
1-0.026, i.e. for 97.5%, we have -0.026 (which is small, close to 0,
but negative), while for 2.5% we have 0.974 (which is huge compared to
the previous value). Well, on that graph, we plot Pearson’s residuals, i..e. we divide the difference by the expected value… so we have those two curves….
Actually, there are only two clouds since the probability to have 1 (when lambda is close to 2.5%) is small, and extremely small to exceed 2…
On the graph below, the average value of Y increases (and then decreases): initially, almost all the points are on the red curve (i.e. we observe 0, and predict something extremely small), then we have more and more points on the blue
curve (for more and more individuals we have 1), then we start to see
some on the green curve (i.e. 2), etc. Note that areas where we observe
clouds remain unchanged (i.e. the colored curves).
> P=seq(-5,-2,by=.05)
> P=c(P,rev(P))
> for(i in 1:length(P)){
+ set.seed(1)
+ XB=P[i]+2*X+E
+ lambda=exp(XB)
+ Y=rpois(n,lambda)
+ reg=glm(Y~X,family=poisson)
+ u=seq(-10,2,by=.1)
+ plot(predict(reg),residuals(reg,”pearson”),ylim=c(-1,5),xlim=c(-6,0))
+ abline(h=0)
+ lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)
+ lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)
+ lines(u,(2-exp(u))/sqrt(exp(u)),lwd=.5,col=”green”)
+ rlines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)
+ lines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)
+ lines(u,(4-exp(u))/sqrt(exp(u)),lwd=.5,col=”purple”)
+ points(predict(reg),residuals(reg,”pearson”))
+ }

So the graph is perfectly normal since we have a discrete distribution (while we predict something continuous).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: , , , ,