# Linear Models, ANOVA, GLMs and Mixed-Effects models in R

**R tutorial for Spatial Statistics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As part of my new role as Lecturer in Agri-data analysis at Harper Adams University, I found myself applying a lot of techniques based on linear modelling. Another thing I noticed is that there is a lot of confusion among researchers in regards to what technique should be used in each instance and how to interpret the model. For this reason I started reading material from books and on-line to try and create a sort of reference tutorial that researchers can use. This post is the result of my work so far.

Please feel free to comment, provide feedback and constructive criticism!!

## Theoretical Background – Linear Model and ANOVA

### Linear Model

where β_0 is the intercept (i.e. the value of the line at zero), β_1 is the slope for the variable x, which indicates the changes in y as a function of changes in x. For example if the slope is +0.5, we can say that for each unit increment in x, y increases of 0.5. Please note that the slope can also be negative.

This equation can be expanded to accommodate more that one explanatory variable x:

In this case the interpretation is a bit more complex because for example the coefficient β_2 provides the slope for the explanatory variable x_2. This means that for a unit variation of x_2 the target variable y changes by the value of β_2, if the other explanatory variables are kept constant.

notice the interaction term between x_1 and x_2. In this case the interpretation becomes extremely difficult just by looking at the model.

we can see that its slope become affected by the value of x_2 (Yan & Su, 2009), for this reason the only way we can actually determine how x_1 changes Y, when the other terms are kept constant, is to use the equation with new values of x_1.

### ANOVA

## Examples of ANOVA and ANCOVA in R

install.packages("agridat")

We also need to include other packages for the examples below. If some of these are not installed in your system please use again the function install.packages (replacing the name within quotation marks according to your needs) to install them.

library(agridat) library(ggplot2) library(plotrix) library(moments) library(car) library(fitdistrplus) library(nlme) library(multcomp) library(epade) library(lme4)

Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn yield in a field in Argentina, plus several explanatory variables both factorial (or categorical) and continuous.

> dat = lasrosas.corn > str(dat) 'data.frame': 3443 obs. of 9 variables: $ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ... $ lat : num -33.1 -33.1 -33.1 -33.1 -33.1 ... $ long : num -63.8 -63.8 -63.8 -63.8 -63.8 ... $ yield: num 72.1 73.8 77.2 76.3 75.5 ... $ nitro: num 132 132 132 132 132 ... $ topo : Factor w/ 4 levels "E","HT","LO",..: 4 4 4 4 4 4 4 4 4 4 ... $ bv : num 163 170 168 177 171 ... $ rep : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ... $ nf : Factor w/ 6 levels "N0","N1","N2",..: 6 6 6 6 6 6 6 6 6 6 ...

Important for the purpose of this tutorial is the target variable yield, which is what we are trying to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have rep, which is the blocking factor.

### Checking Assumptions

- Data independence
- Normality
- Equality of variances between groups
- Balance design (i.e. all groups have the same number of samples)

> tapply(dat$yield, INDEX=dat$nf, FUN=var) N0 N1 N2 N3 N4 N5 438.5448 368.8136 372.8698 369.6582 366.5705 405.5653

In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of nitrogen). There is some variation between groups but in my opinion it is not substantial. Now we can shift our focus on normality. There are tests to check for normality, but again the ANOVA is flexible (particularly where our dataset is big) and can still produce correct results even when its assumptions are violated up to a certain degree. For this reason, it is good practice to check normality with descriptive analysis alone, without any statistical test. For example, we could start by plotting the histogram of yield:

hist(dat$yield, main="Histogram of Yield", xlab="Yield (quintals/ha)")

qqnorm(dat$yield, main="QQplot of Yield") qqline(dat$yield)

> skewness(dat$yield) [1] 0.3875977

> tapply(dat$yield, INDEX=dat$nf, FUN=length) N0 N1 N2 N3 N4 N5 573 577 571 575 572 575

### ANOVA with aov

means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean) StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error) BP = barplot(means.nf, ylim=c(0,max(means.nf)+10)) segments(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5) arrows(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90, code = 3, length = 0.05)

mod1 = aov(yield ~ nf, data=dat)

> summary(mod1) Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797 12.4 6.08e-12 *** Residuals 3437 1330110 387 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> TukeyHSD(mod1, conf.level=0.95) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ nf, data = dat) $nf diff lwr upr p adj N1-N0 3.6434635 0.3353282 6.951599 0.0210713 N2-N0 4.6774357 1.3606516 7.994220 0.0008383 N3-N0 5.3629638 2.0519632 8.673964 0.0000588 N4-N0 7.5901274 4.2747959 10.905459 0.0000000 N5-N0 7.8588595 4.5478589 11.169860 0.0000000 N2-N1 1.0339723 -2.2770686 4.345013 0.9489077 N3-N1 1.7195004 -1.5857469 5.024748 0.6750283 N4-N1 3.9466640 0.6370782 7.256250 0.0089057 N5-N1 4.2153960 0.9101487 7.520643 0.0038074 N3-N2 0.6855281 -2.6283756 3.999432 0.9917341 N4-N2 2.9126917 -0.4055391 6.230923 0.1234409 N5-N2 3.1814238 -0.1324799 6.495327 0.0683500 N4-N3 2.2271636 -1.0852863 5.539614 0.3916824 N5-N3 2.4958957 -0.8122196 5.804011 0.2613027 N5-N4 0.2687320 -3.0437179 3.581182 0.9999099

> model.tables(mod1, type="effects") Tables of effects nf N0 N1 N2 N3 N4 N5 -4.855 -1.212 -0.178 0.5075 2.735 3.003 rep 573.000 577.000 571.000 575.0000 572.000 575.000

> model.tables(mod1, type="means") Tables of means Grand mean 69.82831 nf N0 N1 N2 N3 N4 N5 64.97 68.62 69.65 70.34 72.56 72.83 rep 573.00 577.00 571.00 575.00 572.00 575.00

### Linear Model with 1 factor

mod2 = lm(yield ~ nf, data=dat)

> summary(mod2) Call: lm(formula = yield ~ nf, data = dat) Residuals: Min 1Q Median 3Q Max -52.313 -15.344 -3.126 13.563 45.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 64.9729 0.8218 79.060 < 2e-16 *** nfN1 3.6435 1.1602 3.140 0.0017 ** nfN2 4.6774 1.1632 4.021 5.92e-05 *** nfN3 5.3630 1.1612 4.618 4.01e-06 *** nfN4 7.5901 1.1627 6.528 7.65e-11 *** nfN5 7.8589 1.1612 6.768 1.53e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.67 on 3437 degrees of freedom Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629 F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

dat$nf = relevel(dat$nf, ref="N1") mod3 = lm(yield ~ nf, data=dat) summary(mod3)

> summary(mod3) Call: lm(formula = yield ~ nf, data = dat) Residuals: Min 1Q Median 3Q Max -52.313 -15.344 -3.126 13.563 45.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 68.616 0.819 83.784 < 2e-16 *** nfN0 -3.643 1.160 -3.140 0.001702 ** nfN2 1.034 1.161 0.890 0.373308 nfN3 1.720 1.159 1.483 0.138073 nfN4 3.947 1.161 3.400 0.000681 *** nfN5 4.215 1.159 3.636 0.000280 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.67 on 3437 degrees of freedom Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629 F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

> anova(mod2) Analysis of Variance Table Response: yield Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797.4 12.396 6.075e-12 *** Residuals 3437 1330110 387.0 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> Anova(mod2, type=c("III")) Anova Table (Type III tests) Response: yield Sum Sq Df F value Pr(>F) (Intercept) 2418907 1 6250.447 < 2.2e-16 *** nf 23987 5 12.396 6.075e-12 *** Residuals 1330110 3437 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

### Two-way ANOVA

means.topo = tapply(dat$yield, INDEX=dat$topo, FUN=mean) StdErr.topo = tapply(dat$yield, INDEX=dat$topo, FUN= std.error) BP = barplot(means.topo, ylim=c(0,max(means.topo)+10)) segments(BP, means.topo - (2*StdErr.topo), BP, means.topo + (2*StdErr.topo), lwd = 1.5) arrows(BP, means.topo - (2*StdErr.topo), BP, means.topo + (2*StdErr.topo), lwd = 1.5, angle = 90, code = 3, length = 0.05)

mod1b = aov(yield ~ nf + topo, data=dat) summary(mod1b) Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797 23.21 <2e-16 *** topo 3 620389 206796 1000.59 <2e-16 *** Residuals 3434 709721 207 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

TukeyHSD(mod1b, conf.level=0.95, which=c("topo")) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ nf + topo, data = dat) $topo diff lwr upr p adj HT-LO -36.240955 -38.052618 -34.429291 0 W-LO -18.168544 -19.857294 -16.479794 0 E-LO -6.206619 -8.054095 -4.359143 0 W-HT 18.072411 16.326440 19.818381 0 E-HT 30.034335 28.134414 31.934257 0 E-W 11.961925 10.178822 13.745028 0

### Two-Way ANOVA with Interactions

dat$topo = factor(dat$topo,levels(dat$topo)[c(2,4,1,3)]) means.INT = tapply(dat$yield, INDEX=list(dat$nf, dat$topo), FUN=mean) bar3d.ade(means.INT, col="grey")

> means.INT LO HT W E N0 81.03027 41.50652 62.08192 75.13902 N1 83.06276 48.33630 65.74627 78.12808 N2 85.06879 48.79830 66.70848 78.92632 N3 85.23255 50.18398 66.16531 78.99210 N4 87.14400 52.12039 70.10682 80.39213 N5 87.94122 51.03138 69.65933 80.55078

mod1c = aov(yield ~ nf * topo, data=dat)

> summary(mod1c) Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797 23.176 <2e-16 *** topo 3 620389 206796 999.025 <2e-16 *** nf:topo 15 1993 133 0.642 0.842 Residuals 3419 707727 207 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

library(phia) plot(interactionMeans(mod1c))

This function plots the effects of the interactions in a 2 by 2 plot, including the standard error of the coefficients, so that we can readily see which overlap:

> testInteractions(mod1c) F Test: P-value adjustment method: holm Value Df Sum of Sq F Pr(>F) N0-N1 : HT-W -3.1654 1 377 1.8230 1 N0-N2 : HT-W -2.6652 1 267 1.2879 1 N0-N3 : HT-W -4.5941 1 784 3.7880 1 N0-N4 : HT-W -2.5890 1 250 1.2072 1 N0-N5 : HT-W -1.9475 1 140 0.6767 1 N1-N2 : HT-W 0.5002 1 9 0.0458 1 N1-N3 : HT-W -1.4286 1 76 0.3694 1 N1-N4 : HT-W 0.5765 1 12 0.0604 1 N1-N5 : HT-W 1.2180 1 55 0.2669 1 N2-N3 : HT-W -1.9289 1 139 0.6711 1 N2-N4 : HT-W 0.0762 1 0 0.0011 1 N2-N5 : HT-W 0.7178 1 19 0.0924 1 N3-N4 : HT-W 2.0051 1 149 0.7204 1

The table is very long so only the first lines are included. However, from this it is clear that the interaction has no effect (p-value of 1), but if it was this function can give us numerous details about the specific effects.

### ANCOVA with lm

> mod3 = lm(yield ~ nf + bv, data=dat) > summary(mod3) Call: lm(formula = yield ~ nf + bv, data = dat) Residuals: Min 1Q Median 3Q Max -78.345 -10.847 -3.314 10.739 56.835 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 271.55084 4.99308 54.385 < 2e-16 *** nfN0 -3.52312 0.95075 -3.706 0.000214 *** nfN2 1.54761 0.95167 1.626 0.103996 nfN3 2.08006 0.94996 2.190 0.028619 * nfN4 3.82330 0.95117 4.020 5.96e-05 *** nfN5 4.47993 0.94994 4.716 2.50e-06 *** bv -1.16458 0.02839 -41.015 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 3436 degrees of freedom Multiple R-squared: 0.3406, Adjusted R-squared: 0.3394 F-statistic: 295.8 on 6 and 3436 DF, p-value: < 2.2e-16

> test.set = data.frame(nf="N1", bv=150) > predict(mod3, test.set) 1 2 96.86350 38.63438 > > test.set = data.frame(nf="N0", bv=150) > predict(mod3, test.set) 1 93.34037

### Two-factors and one continuous explanatory variable

> levels(dat$topo) [1] "E" "HT" "LO" "W"

> mod4 = lm(yield ~ nf + bv + topo, data=dat) > > summary(mod4) Call: lm(formula = yield ~ nf + bv + topo, data = dat) Residuals: Min 1Q Median 3Q Max -46.394 -10.927 -2.211 10.364 47.338 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 171.20921 5.28842 32.374 < 2e-16 *** nfN0 -3.73225 0.81124 -4.601 4.36e-06 *** nfN2 1.29704 0.81203 1.597 0.1103 nfN3 1.56499 0.81076 1.930 0.0537 . nfN4 3.71277 0.81161 4.575 4.94e-06 *** nfN5 3.88382 0.81091 4.789 1.74e-06 *** bv -0.54206 0.03038 -17.845 < 2e-16 *** topoHT -24.11882 0.78112 -30.877 < 2e-16 *** topoLO 3.13643 0.70924 4.422 1.01e-05 *** topoW -10.77758 0.66708 -16.156 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.75 on 3433 degrees of freedom Multiple R-squared: 0.5204, Adjusted R-squared: 0.5191 F-statistic: 413.8 on 9 and 3433 DF, p-value: < 2.2e-16

RES = residuals(mod4) qqnorm(RES) qqline(RES)

### ANCOVA with interactions

qplot(yield, bv, data=dat, geom="point", xlab="Yield", ylab="bv") + facet_wrap(~topo)+ geom_smooth(method = "lm", se = TRUE)

mod5 = lm(yield ~ nf + bv * topo, data=dat)

> Anova(mod5, type=c("III")) Anova Table (Type III tests) Response: yield Sum Sq Df F value Pr(>F) (Intercept) 20607 1 115.225 < 2.2e-16 *** nf 23032 5 25.758 < 2.2e-16 *** bv 5887 1 32.920 1.044e-08 *** topo 40610 3 75.691 < 2.2e-16 *** bv:topo 36059 3 67.209 < 2.2e-16 *** Residuals 613419 3430 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> summary(mod5) Call: lm(formula = yield ~ nf + bv * topo, data = dat) Residuals: Min 1Q Median 3Q Max -46.056 -10.328 -1.516 9.622 50.184 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.45783 8.70646 10.734 < 2e-16 *** nfN1 3.96637 0.78898 5.027 5.23e-07 *** nfN2 5.24313 0.79103 6.628 3.93e-11 *** nfN3 5.46036 0.79001 6.912 5.68e-12 *** nfN4 7.52685 0.79048 9.522 < 2e-16 *** nfN5 7.73646 0.79003 9.793 < 2e-16 *** bv -0.27108 0.04725 -5.738 1.04e-08 *** topoW 88.11105 12.07428 7.297 3.63e-13 *** topoE 236.12311 17.12941 13.785 < 2e-16 *** topoLO -15.76280 17.27191 -0.913 0.3615 bv:topoW -0.41393 0.06726 -6.154 8.41e-10 *** bv:topoE -1.21024 0.09761 -12.399 < 2e-16 *** bv:topoLO 0.28445 0.10104 2.815 0.0049 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.37 on 3430 degrees of freedom Multiple R-squared: 0.547, Adjusted R-squared: 0.5454 F-statistic: 345.1 on 12 and 3430 DF, p-value: < 2.2e-16

### GLS – For violations of independence

> summary(mod5, cor=T) […] Correlation of Coefficients: (Intercept) nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO nfN1 -0.05 nfN2 -0.04 0.50 nfN3 -0.05 0.50 0.50 nfN4 -0.05 0.50 0.50 0.50 nfN5 -0.05 0.50 0.50 0.50 0.50 bv -1.00 0.01 -0.01 0.01 0.00 0.00 topoW -0.72 0.00 0.00 0.00 0.00 0.00 0.72 topoE -0.51 0.01 0.02 0.03 0.01 0.02 0.51 0.37 topoLO -0.50 -0.02 -0.01 0.02 -0.01 0.02 0.50 0.36 0.26 bv:topoW 0.70 0.00 0.00 0.00 0.00 0.00 -0.70 -1.00 -0.36 -0.35 bv:topoE 0.48 -0.01 -0.02 -0.03 -0.01 -0.02 -0.48 -0.35 -1.00 -0.24 bv:topoLO 0.47 0.02 0.01 -0.02 0.01 -0.02 -0.47 -0.34 -0.24 -1.00 bv:topoW bv:topoE nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO bv:topoW bv:topoE 0.34 bv:topoLO 0.33 0.23

mod6 = gls(yield ~ nf + bv * topo, data=dat, method="REML") Anova(mod6, type=c("III")) summary(mod6)

> anova(mod6, mod5) Model df AIC BIC logLik mod6 1 14 27651.21 27737.18 -13811.61 mod5 2 14 27651.21 27737.18 -13811.61

## Linear Mixed-Effects Models

*u*and the standard error term. A more complex form, that is normally used for repeated measures is the random slope and intercept model:

*v*related to time

*T*.

### Random Intercept Model for Clustered Data

lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)

> summary(lme1) Linear mixed-effects model fit by REML Data: dat AIC BIC logLik 27648.36 27740.46 -13809.18 Random effects: Formula: ~1 | rep (Intercept) Residual StdDev: 0.798407 13.3573 Fixed effects: yield ~ nf + bv * topo Value Std.Error DF t-value p-value (Intercept) 327.3304 14.782524 3428 22.143068 0 nfN1 3.9643 0.788049 3428 5.030561 0 nfN2 5.2340 0.790104 3428 6.624449 0 nfN3 5.4498 0.789084 3428 6.906496 0 nfN4 7.5286 0.789551 3428 9.535320 0 nfN5 7.7254 0.789111 3428 9.789976 0 bv -1.4685 0.085507 3428 -17.173569 0 topoHT -233.3675 17.143956 3428 -13.612232 0 topoLO -251.9750 20.967003 3428 -12.017693 0 topoW -146.4066 16.968453 3428 -8.628162 0 bv:topoHT 1.1945 0.097696 3428 12.226279 0 bv:topoLO 1.4961 0.123424 3428 12.121624 0 bv:topoW 0.7873 0.097865 3428 8.044485 0

> Anova(lme2, type=c("III")) Analysis of Deviance Table (Type III tests) Response: yield Chisq Df Pr(>Chisq) (Intercept) 752.25 1 < 2.2e-16 *** nf 155.57 5 < 2.2e-16 *** bv 291.49 1 < 2.2e-16 *** topo 236.52 3 < 2.2e-16 *** year 797.13 1 < 2.2e-16 *** bv:topo 210.38 3 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> anova(lme1, mod6) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 15 27648.36 27740.46 -13809.18 mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275

### Random Intercept and Slope for repeated measures

lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat) summary(lme2) Anova(lme2, type=c("III"))

> anova(lme1, lme2) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 15 27648.36 27740.46 -13809.18 lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001 Warning message: In anova.lme(lme1, lme2) : fitted objects with different fixed effects. REML comparisons are not meaningful.

> ranef(lme2) (Intercept) year R1 -0.3468601 -1.189799e-07 R2 -0.5681688 -1.973702e-07 R3 0.9150289 3.163501e-07

> fixef(lme2) (Intercept) nfN1 nfN2 nfN3 nfN4 -1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00 nfN5 bv topoHT topoLO topoW 7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02 year bv:topoHT bv:topoLO bv:topoW 5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01

> intervals(lme2, which = "fixed") Approximate 95% confidence intervals Fixed effects: lower est. upper (Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04 nfN1 2.526139e+00 3.918006e+00 5.309873e+00 nfN2 3.736625e+00 5.132136e+00 6.527648e+00 nfN3 3.974809e+00 5.368513e+00 6.762216e+00 nfN4 6.070018e+00 7.464542e+00 8.859065e+00 nfN5 6.245584e+00 7.639337e+00 9.033089e+00 bv -1.469793e+00 -1.318391e+00 -1.166989e+00 topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02 topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02 topoW -1.436741e+02 -1.136168e+02 -8.355954e+01 year 5.414742e+00 5.818826e+00 6.222911e+00 bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00 bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00 bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01 attr(,"label") [1] "Fixed effects:"

## Dealing with non-normal data – Generalized Linear Models

### Count Data

*Y*.

> dat = beall.webworms > str(dat) 'data.frame': 1300 obs. of 7 variables: $ row : int 1 2 3 4 5 6 7 8 9 10 ... $ col : int 1 1 1 1 1 1 1 1 1 1 ... $ y : int 1 0 1 3 6 0 2 2 1 3 ... $ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ... $ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ... $ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ... $ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms")

pois.mod = glm(y ~ trt, data=dat, family=c("poisson"))

> summary(pois.mod) Call: glm(formula = y ~ trt, family = c("poisson"), data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.6733 -1.0046 -0.9081 0.6141 4.2771 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.33647 0.04688 7.177 7.12e-13 *** trtT2 -1.02043 0.09108 -11.204 < 2e-16 *** trtT3 -0.49628 0.07621 -6.512 7.41e-11 *** trtT4 -1.22246 0.09829 -12.438 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1955.9 on 1299 degrees of freedom Residual deviance: 1720.4 on 1296 degrees of freedom AIC: 3125.5 Number of Fisher Scoring iterations: 6

> predict(pois.mod, newdata=data.frame(trt=c("T1","T2"))) 1 2 0.3364722 -0.6839588

> 1-pchisq(deviance(pois.mod), df.residual(pois.mod)) [1] 1.709743e-14

pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))

> AIC(pois.mod, pois.mod2) df AIC pois.mod 4 3125.478 pois.mod2 16 3027.438

> mean(dat$y) [1] 0.7923077 > var(dat$y) [1] 1.290164

pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson"))

> summary(pois.mod3) Call: glm(formula = y ~ trt, family = c("quasipoisson"), data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.6733 -1.0046 -0.9081 0.6141 4.2771 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.33647 0.05457 6.166 9.32e-10 *** trtT2 -1.02043 0.10601 -9.626 < 2e-16 *** trtT3 -0.49628 0.08870 -5.595 2.69e-08 *** trtT4 -1.22246 0.11440 -10.686 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 1.35472) Null deviance: 1955.9 on 1299 degrees of freedom Residual deviance: 1720.4 on 1296 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 6

library(MASS) NB.mod1 = glm.nb(y ~ trt, data=dat)

### Logistic Regression

> dat = johnson.blight > str(dat) 'data.frame': 25 obs. of 6 variables: $ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ... $ area : int 0 0 0 0 50 810 120 40 0 0 ... $ blight : int 0 0 0 0 1 1 1 1 0 0 ... $ rain.am : int 8 9 9 6 16 10 12 10 11 8 ... $ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ... $ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...

mod9 = glm(blight ~ rain.am, data=dat, family=binomial)

> summary(mod9) Call: glm(formula = blight ~ rain.am, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.9395 -0.6605 -0.3517 1.0228 1.6048 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.9854 2.0720 -2.406 0.0161 * rain.am 0.4467 0.1860 2.402 0.0163 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 34.617 on 24 degrees of freedom Residual deviance: 24.782 on 23 degrees of freedom AIC: 28.782 Number of Fisher Scoring iterations: 5

> predict(mod9, type="response") 1 2 3 4 5 6 7 0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722 8 9 10 11 12 13 14 0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141 15 16 17 18 19 20 21 0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722 22 23 24 25 0.48214935 0.59273722 0.98109229 0.89680283

>predict(mod9,newdata=data.frame(rain.am=c(15)),type="response") 1 0.8475443

> 1-pchisq(24.782, 23) [1] 0.3616226

> 1-pchisq(34.617, 24) [1] 0.07428544

install.packages("pscl") library(pscl)

> pR2(mod9) llh llhNull G2 McFadden r2ML r2CU -12.3910108 -17.3086742 9.8353268 0.2841155 0.3252500 0.4338984

### Dealing with other distributions and transformation

> dat = hughes.grapes > str(dat) 'data.frame': 270 obs. of 6 variables: $ block : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ... $ trt : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ... $ vine : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ... $ shoot : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ... $ diseased: int 1 2 0 0 3 0 7 0 1 0 ... $ total : int 14 12 12 13 8 9 8 10 14 10 ...

descdist(dat$total, discrete = FALSE)

plot(fitdist(dat$total,distr="gamma"))

mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity))

### Generalized Linear Mixed Effects models

install.packages("lme4") library(lme4)

dat = johnson.blight

mod10 = glm(blight ~ precip.m, data=dat, family="binomial") mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial") > AIC(mod10, mod11) df AIC mod10 2 37.698821 mod11 3 9.287692

## References and Further Reading

Finch, W.H., Bolin, J.E. and Kelley, K., 2014. *Multilevel modeling using R*. Crc Press.

Yan, X. and Su, X., 2009. *Linear regression analysis: theory and computing*. World Scientific.

James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. *An introduction to statistical learning* (Vol. 6). New York: springer. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

Long, J. Scott. 1997. *Regression Models for Categorical and Limited Dependent Variables*. Sage. pp104-106. [For pseudo R-Squared equations, page available on google books]

Webster, R. and Oliver, M.A., 2007. *Geostatistics for environmental scientists*. John Wiley & Sons.

West, B.T., Galecki, A.T. and Welch, K.B., 2014. *Linear mixed models*. CRC Press.

Gałecki, A. and Burzykowski, T., 2013. *Linear mixed-effects models using R: A step-by-step approach*. Springer Science & Business Media.

Williams, R., 2004. *One-Way Analysis of Variance*. URL: https://www3.nd.edu/~rwilliam/stats1/x52.pdf

Witte, R. and Witte, J. 2009. *Statistics. 9th ed. *Wiley.

## Appendix 1: Assessing the accuracy of our model

### R-Squared

### Adjusted R-Squared

### Root Mean Squared Deviation or Root Mean Squared Error

### Mean Squared Deviation or Mean Squared Error

### Mean Absolute Deviation or Mean Absolute Error

### Akaike Information Criterion

**leave a comment**for the author, please follow the link and comment on their blog:

**R tutorial for Spatial Statistics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.