**Freakonometrics - Tag - R-english**, and kindly contributed to R-bloggers)

Since I got the same question twice, let us look at it quickly….

Some students show me a graph (from a Poisson regression) which looks

like that,

and they asked “isn’t it weird ?“, i.e.”residuals are null or positive… this is not what we should have, right ?“. Actually, residuals *are* always centered in a glm regression (if you keep the intercept). Let us generate a dataset,

> n=500

> set.seed(1)

> X=runif(n)

> E=rnorm(n)*.3

> XB=-5+2*X+E

> lambda=exp(XB)

> Y=rpois(n,lambda)

> reg=glm(Y~X,family=poisson)

> plot(predict(reg),residuals(reg,”pearson”))

If

we look carefully, residuals in the lower part, they are strictly

negative… And since, on average, we should have zero, then we are

very close to 0.

> mean(residuals(reg,”pearson”))

[1] -0.003381202

Actually, the *true* value is either

> u=seq(-10,2,by=.1)

> lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)

or

> lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)

So what happens ? Here, the value of the parameter (the expected value of our Poisson distribution) is very small

> mean(Y)

[1] 0.026

A

crude estimator for the glm would be to consider a regression only on

the intercept. In that case, the error would be either 0-0.026, or

1-0.026, i.e. for 97.5%, we have -0.026 (which is small, close to 0,

but negative), while for 2.5% we have 0.974 (which is huge compared to

the previous value). Well, on that graph, we plot *Pearson’s* residuals, i..e. we divide the difference by the expected value… so we have those two curves….

Actually, there are only *two* clouds since the probability to have 1 (when lambda is close to 2.5%) is small, and extremely small to exceed 2…

On the graph below, the average value of Y increases (and then decreases): initially, almost all the points are on the red curve (i.e. we observe 0, and predict something extremely small), then we have more and more points on the blue

curve (for more and more individuals we have 1), then we start to see

some on the green curve (i.e. 2), etc. Note that areas where we observe

clouds remain unchanged (i.e. the colored curves).

> P=seq(-5,-2,by=.05)

> P=c(P,rev(P))

> for(i in 1:length(P)){

+ set.seed(1)

+ XB=P[i]+2*X+E

+ lambda=exp(XB)

+ Y=rpois(n,lambda)

+ reg=glm(Y~X,family=poisson)

+ u=seq(-10,2,by=.1)

+ plot(predict(reg),residuals(reg,”pearson”),ylim=c(-1,5),xlim=c(-6,0))

+ abline(h=0)

+ lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)

+ lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)

+ lines(u,(2-exp(u))/sqrt(exp(u)),lwd=.5,col=”green”)

+ rlines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)

+ lines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)

+ lines(u,(4-exp(u))/sqrt(exp(u)),lwd=.5,col=”purple”)

+ points(predict(reg),residuals(reg,”pearson”))

+ }

So the graph is perfectly normal since we have a discrete distribution (while we predict something continuous).

**leave a comment**for the author, please follow the link and comment on his blog:

**Freakonometrics - Tag - R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...