**R tutorial for Spatial Statistics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

After publishing my previous post, I realized that it was way too long and so I decided to split it in 2-3 parts. If you think something is missing in the explanation here it may be related to the fact that this was originally part of the previous post (http://r-video-tutorial.blogspot.co.uk/2017/06/linear-models-anova-glms-and-mixed.html), so please look there first (otherwise please post your question in the comment section and I will try to answer).

## Dealing with non-normal data – Generalized Linear Models

### Count Data

*Y*. This is the link function, meaning the transformation of y that we need to make to insure the linearity of the response. For the model we are going to look at below, which are probably the most common, the link function is implicitly called, meaning that glm call the right function for us and we do not have to specify it explicitly. However, we can do that if needed.

From this equation you may think that instead of using glm we could log transform y and run a normal lm. The problem with that would be that lm is assuming an error term with constant variance (as we saw with the plot fitted versus residuals), but for this model the variance is not constant so that assumption would be violated. That is why we need to use glm.

` > dat = beall.webworms `

> str(dat)

'data.frame': 1300 obs. of 7 variables:

$ row : int 1 2 3 4 5 6 7 8 9 10 ...

$ col : int 1 1 1 1 1 1 1 1 1 1 ...

$ y : int 1 0 1 3 6 0 2 2 1 3 ...

$ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ...

$ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ...

$ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

$ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

` hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms") `

` pois.mod = glm(y ~ trt, data=dat, family=c("poisson")) `

Once again the function summary will show some useful details about this model:

` > summary(pois.mod) `

Call:

glm(formula = y ~ trt, family = c("poisson"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6733 -1.0046 -0.9081 0.6141 4.2771

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.33647 0.04688 7.177 7.12e-13 ***

trtT2 -1.02043 0.09108 -11.204 < 2e-16 ***

trtT3 -0.49628 0.07621 -6.512 7.41e-11 ***

trtT4 -1.22246 0.09829 -12.438 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1955.9 on 1299 degrees of freedom

Residual deviance: 1720.4 on 1296 degrees of freedom

AIC: 3125.5

Number of Fisher Scoring iterations: 6

We can use the function plot to obtain more info about how the model fits our data:

` par(mfrow=c(2,2)) `

plot(pois.mod)

This creates the following plot, where the four outputs are included in the same image:

This plots tell a lot about the goodness of fit of the model. The first image in top left corner is the same we created for lm (i.e. residuals versus fitted values). This again does not show any trend, just a general underestimation. Then we have the normal QQ plot, where we see that the residuals are not normal, which violates one of the assumptions of the model. Even though we are talking about non-linear error term, we are “linearizing” the model with the link function and by specifying a different family for the error term. Therefore, we still need to obtain normal residuals.

The effects of the treatments are all negative and referred to the first level T1, meaning for example that a change from T1 to T2 will decrease the count by 1.02. We can check this effect by estimating changes between T1 and T2 with the function predict, and the option newdata:

` > predict(pois.mod, newdata=data.frame(trt=c("T1","T2"))) `

1 2

0.3364722 -0.6839588

We can compute the p-value of the model with the following line:

` > 1-pchisq(deviance(pois.mod), df.residual(pois.mod)) `

[1] 1.709743e-14

` pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson")) `

` > AIC(pois.mod, pois.mod2) `

df AIC

pois.mod 4 3125.478

pois.mod2 16 3027.438

` > mean(dat$y) `

[1] 0.7923077

> var(dat$y)

[1] 1.290164

` pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson")) `

` > summary(pois.mod3) `

Call:

glm(formula = y ~ trt, family = c("quasipoisson"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6733 -1.0046 -0.9081 0.6141 4.2771

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.33647 0.05457 6.166 9.32e-10 ***

trtT2 -1.02043 0.10601 -9.626 < 2e-16 ***

trtT3 -0.49628 0.08870 -5.595 2.69e-08 ***

trtT4 -1.22246 0.11440 -10.686 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 1.35472)

Null deviance: 1955.9 on 1299 degrees of freedom

Residual deviance: 1720.4 on 1296 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 6

Another way of directly comparing the two models is with the analysis of deviance, which can be performed with the function anova:

` > anova(pois.mod, pois.mod2, test="Chisq") `

Analysis of Deviance Table

Model 1: y ~ trt

Model 2: y ~ block + spray * lead

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 1296 1720.4

2 1284 1598.4 12 122.04 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This test compares the residual deviance of the two models to see whether they are difference and calculates a p-values. In this case the p-value is highly significant, meaning that the models are different. Since we already compared the AIC, we can conclude that pois.mod2 is significantly (low p-value) better (lower AIC) than pois.mod.

` library(MASS) `

NB.mod1 = glm.nb(y ~ trt, data=dat)

NOTE:

For GLM it is possible to also compute pseudo R-Squared to ease the interpretation of their accuracy. This can be done with the function pR2 from the package pscl. Please read below (Logistic Regression section) for an example on the use of this function.

### Logistic Regression

` > dat = johnson.blight `

> str(dat)

'data.frame': 25 obs. of 6 variables:

$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...

$ area : int 0 0 0 0 50 810 120 40 0 0 ...

$ blight : int 0 0 0 0 1 1 1 1 0 0 ...

$ rain.am : int 8 9 9 6 16 10 12 10 11 8 ...

$ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ...

$ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...

` mod9 = glm(blight ~ rain.am, data=dat, family=binomial) `

` > summary(mod9) `

Call:

glm(formula = blight ~ rain.am, family = binomial, data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9395 -0.6605 -0.3517 1.0228 1.6048

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.9854 2.0720 -2.406 0.0161 *

rain.am 0.4467 0.1860 2.402 0.0163 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.617 on 24 degrees of freedom

Residual deviance: 24.782 on 23 degrees of freedom

AIC: 28.782

Number of Fisher Scoring iterations: 5

` > predict(mod9, type="response") `

1 2 3 4 5 6 7

0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722

8 9 10 11 12 13 14

0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141

15 16 17 18 19 20 21

0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722

22 23 24 25

0.48214935 0.59273722 0.98109229 0.89680283

` >predict(mod9,newdata=data.frame(rain.am=c(15)),type="response") `

1

0.8475443

We could use the same method to compute probabilities for a series of value of rain to see what is the threshold of rain that increases the probability of blight above 50%:

` prob.NEW = predict(mod9,newdata=data.frame(rain.am=1:30),type="response") `

plot(1:30, prob.NEW, xlab="Rain", ylab="Probability of Blight")

abline(h=0.5)

As you can see we are using once again the function predict, but in this case we are estimating the probabilities for increasing values of rain. Then we are plotting the results:

From this plot it is clear that we reach a 50% probability at around 12 rainy days between April and May.

` > 1-pchisq(24.782, 23) `

[1] 0.3616226

` > 1-pchisq(34.617, 24) `

[1] 0.07428544

` install.packages("pscl") `

library(pscl)

` > pR2(mod9) `

llh llhNull G2 McFadden r2ML r2CU

-12.3910108 -17.3086742 9.8353268 0.2841155 0.3252500 0.4338984

Each of these R2 is computed in a different way and you can read the documentation to know more. In general, one of the most commonly reported is the McFadden, which however tends to be conservative, and the r2ML. In this paper you can find a complete overview/comparison of various pseudo R-Squared: http://www.glmj.org/archives/articles/Smith_v39n2.pdf

### Dealing with other distributions and transformation

` > dat = hughes.grapes `

> str(dat)

'data.frame': 270 obs. of 6 variables:

$ block : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ...

$ trt : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ...

$ vine : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ...

$ shoot : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ...

$ diseased: int 1 2 0 0 3 0 7 0 1 0 ...

$ total : int 14 12 12 13 8 9 8 10 14 10 ...

` descdist(dat$total, discrete = FALSE) `

` plot(fitdist(dat$total,distr="gamma")) `

` mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity)) `

### Generalized Linear Mixed Effects models

` install.packages("lme4") `

library(lme4)

` dat = johnson.blight `

` mod10 = glm(blight ~ precip.m, data=dat, family="binomial") `

mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial")

> AIC(mod10, mod11)

df AIC

mod10 2 37.698821

mod11 3 9.287692

**leave a comment**for the author, please follow the link and comment on their blog:

**R tutorial for Spatial Statistics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.