My residuals look weird… aren’t they ?

November 3, 2010

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

Since I got the same question twice, let us look at it quickly….
 Some students show me a graph (from a Poisson regression) which looks
like that,

and they asked “isn’t it weird ?“, i.e.”residuals are null or positive… this is not what we should have, right ?“. Actually, residuals are always centered in a glm regression (if you keep the intercept). Let us generate a dataset,
> n=500
> set.seed(1)
> X=runif(n)
> E=rnorm(n)*.3
> XB=-5+2*X+E
> lambda=exp(XB)
> Y=rpois(n,lambda)
> reg=glm(Y~X,family=poisson)
> plot(predict(reg),residuals(reg,”pearson”))
we look carefully, residuals in the lower part, they are strictly
negative… And since, on average, we should have zero, then we are
very close to 0.
> mean(residuals(reg,”pearson”))
[1] -0.003381202

Actually, the true value is either
> u=seq(-10,2,by=.1)
> lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)
> lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)

So what happens ? Here, the value of the parameter (the expected value of our Poisson distribution) is very small
> mean(Y)
[1] 0.026

crude estimator for the glm would be to consider a regression only on
the intercept. In that case, the error would be either 0-0.026, or
1-0.026, i.e. for 97.5%, we have -0.026 (which is small, close to 0,
but negative), while for 2.5% we have 0.974 (which is huge compared to
the previous value). Well, on that graph, we plot Pearson’s residuals, i..e. we divide the difference by the expected value… so we have those two curves….
Actually, there are only two clouds since the probability to have 1 (when lambda is close to 2.5%) is small, and extremely small to exceed 2…
On the graph below, the average value of Y increases (and then decreases): initially, almost all the points are on the red curve (i.e. we observe 0, and predict something extremely small), then we have more and more points on the blue
curve (for more and more individuals we have 1), then we start to see
some on the green curve (i.e. 2), etc. Note that areas where we observe
clouds remain unchanged (i.e. the colored curves).
> P=seq(-5,-2,by=.05)
> P=c(P,rev(P))
> for(i in 1:length(P)){
+ set.seed(1)
+ XB=P[i]+2*X+E
+ lambda=exp(XB)
+ Y=rpois(n,lambda)
+ reg=glm(Y~X,family=poisson)
+ u=seq(-10,2,by=.1)
+ plot(predict(reg),residuals(reg,”pearson”),ylim=c(-1,5),xlim=c(-6,0))
+ abline(h=0)
+ lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)
+ lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)
+ lines(u,(2-exp(u))/sqrt(exp(u)),lwd=.5,col=”green”)
+ rlines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)
+ lines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)
+ lines(u,(4-exp(u))/sqrt(exp(u)),lwd=.5,col=”purple”)
+ points(predict(reg),residuals(reg,”pearson”))
+ }

So the graph is perfectly normal since we have a discrete distribution (while we predict something continuous).

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)