**R tutorial for Spatial Statistics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As part of my new role as Lecturer in Agri-data analysis at Harper Adams University, I found myself applying a lot of techniques based on linear modelling. Another thing I noticed is that there is a lot of confusion among researchers in regards to what technique should be used in each instance and how to interpret the model. For this reason I started reading material from books and on-line to try and create a sort of reference tutorial that researchers can use. This post is the result of my work so far.

Please feel free to comment, provide feedback and constructive criticism!!

## Theoretical Background – Linear Model and ANOVA

### Linear Model

where β_0 is the intercept (i.e. the value of the line at zero), β_1 is the slope for the variable x, which indicates the changes in y as a function of changes in x. For example if the slope is +0.5, we can say that for each unit increment in x, y increases of 0.5. Please note that the slope can also be negative.

This equation can be expanded to accommodate more that one explanatory variable x:

In this case the interpretation is a bit more complex because for example the coefficient β_2 provides the slope for the explanatory variable x_2. This means that for a unit variation of x_2 the target variable y changes by the value of β_2, if the other explanatory variables are kept constant.

notice the interaction term between x_1 and x_2. In this case the interpretation becomes extremely difficult just by looking at the model.

we can see that its slope become affected by the value of x_2 (Yan & Su, 2009), for this reason the only way we can actually determine how x_1 changes Y, when the other terms are kept constant, is to use the equation with new values of x_1.

### ANOVA

## Examples of ANOVA and ANCOVA in R

` install.packages("agridat") `

We also need to include other packages for the examples below. If some of these are not installed in your system please use again the function install.packages (replacing the name within quotation marks according to your needs) to install them.

` library(agridat) `

library(ggplot2)

library(plotrix)

library(moments)

library(car)

library(fitdistrplus)

library(nlme)

library(multcomp)

library(epade)

library(lme4)

Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn yield in a field in Argentina, plus several explanatory variables both factorial (or categorical) and continuous.

` > dat = lasrosas.corn `

> str(dat)

'data.frame': 3443 obs. of 9 variables:

$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...

$ lat : num -33.1 -33.1 -33.1 -33.1 -33.1 ...

$ long : num -63.8 -63.8 -63.8 -63.8 -63.8 ...

$ yield: num 72.1 73.8 77.2 76.3 75.5 ...

$ nitro: num 132 132 132 132 132 ...

$ topo : Factor w/ 4 levels "E","HT","LO",..: 4 4 4 4 4 4 4 4 4 4 ...

$ bv : num 163 170 168 177 171 ...

$ rep : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ...

$ nf : Factor w/ 6 levels "N0","N1","N2",..: 6 6 6 6 6 6 6 6 6 6 ...

Important for the purpose of this tutorial is the target variable yield, which is what we are trying to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have rep, which is the blocking factor.

### Checking Assumptions

- Data independence
- Normality
- Equality of variances between groups
- Balance design (i.e. all groups have the same number of samples)

` > tapply(dat$yield, INDEX=dat$nf, FUN=var) `

N0 N1 N2 N3 N4 N5

438.5448 368.8136 372.8698 369.6582 366.5705 405.5653

In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of nitrogen). There is some variation between groups but in my opinion it is not substantial. Now we can shift our focus on normality. There are tests to check for normality, but again the ANOVA is flexible (particularly where our dataset is big) and can still produce correct results even when its assumptions are violated up to a certain degree. For this reason, it is good practice to check normality with descriptive analysis alone, without any statistical test. For example, we could start by plotting the histogram of yield:

` hist(dat$yield, main="Histogram of Yield", xlab="Yield (quintals/ha)") `

` qqnorm(dat$yield, main="QQplot of Yield") `

qqline(dat$yield)

` > skewness(dat$yield) `

[1] 0.3875977

` > tapply(dat$yield, INDEX=dat$nf, FUN=length) `

N0 N1 N2 N3 N4 N5

573 577 571 575 572 575

### ANOVA with aov

` means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean) `

StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error)

BP = barplot(means.nf, ylim=c(0,max(means.nf)+10))

segments(BP, means.nf - (2*StdErr.nf), BP,

means.nf + (2*StdErr.nf), lwd = 1.5)

arrows(BP, means.nf - (2*StdErr.nf), BP,

means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90,

code = 3, length = 0.05)

` mod1 = aov(yield ~ nf, data=dat) `

` > summary(mod1) `

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 12.4 6.08e-12 ***

Residuals 3437 1330110 387

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

` > TukeyHSD(mod1, conf.level=0.95) `

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = yield ~ nf, data = dat)

$nf

diff lwr upr p adj

N1-N0 3.6434635 0.3353282 6.951599 0.0210713

N2-N0 4.6774357 1.3606516 7.994220 0.0008383

N3-N0 5.3629638 2.0519632 8.673964 0.0000588

N4-N0 7.5901274 4.2747959 10.905459 0.0000000

N5-N0 7.8588595 4.5478589 11.169860 0.0000000

N2-N1 1.0339723 -2.2770686 4.345013 0.9489077

N3-N1 1.7195004 -1.5857469 5.024748 0.6750283

N4-N1 3.9466640 0.6370782 7.256250 0.0089057

N5-N1 4.2153960 0.9101487 7.520643 0.0038074

N3-N2 0.6855281 -2.6283756 3.999432 0.9917341

N4-N2 2.9126917 -0.4055391 6.230923 0.1234409

N5-N2 3.1814238 -0.1324799 6.495327 0.0683500

N4-N3 2.2271636 -1.0852863 5.539614 0.3916824

N5-N3 2.4958957 -0.8122196 5.804011 0.2613027

N5-N4 0.2687320 -3.0437179 3.581182 0.9999099

` > model.tables(mod1, type="effects") `

Tables of effects

nf

N0 N1 N2 N3 N4 N5

-4.855 -1.212 -0.178 0.5075 2.735 3.003

rep 573.000 577.000 571.000 575.0000 572.000 575.000

` > model.tables(mod1, type="means") `

Tables of means

Grand mean

69.82831

nf

N0 N1 N2 N3 N4 N5

64.97 68.62 69.65 70.34 72.56 72.83

rep 573.00 577.00 571.00 575.00 572.00 575.00

### Linear Model with 1 factor

` mod2 = lm(yield ~ nf, data=dat) `

` > summary(mod2) `

Call:

lm(formula = yield ~ nf, data = dat)

Residuals:

Min 1Q Median 3Q Max

-52.313 -15.344 -3.126 13.563 45.337

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 64.9729 0.8218 79.060 < 2e-16 ***

nfN1 3.6435 1.1602 3.140 0.0017 **

nfN2 4.6774 1.1632 4.021 5.92e-05 ***

nfN3 5.3630 1.1612 4.618 4.01e-06 ***

nfN4 7.5901 1.1627 6.528 7.65e-11 ***

nfN5 7.8589 1.1612 6.768 1.53e-11 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.67 on 3437 degrees of freedom

Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629

F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

` dat$nf = relevel(dat$nf, ref="N1") `

mod3 = lm(yield ~ nf, data=dat)

summary(mod3)

` > summary(mod3) `

Call:

lm(formula = yield ~ nf, data = dat)

Residuals:

Min 1Q Median 3Q Max

-52.313 -15.344 -3.126 13.563 45.337

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 68.616 0.819 83.784 < 2e-16 ***

nfN0 -3.643 1.160 -3.140 0.001702 **

nfN2 1.034 1.161 0.890 0.373308

nfN3 1.720 1.159 1.483 0.138073

nfN4 3.947 1.161 3.400 0.000681 ***

nfN5 4.215 1.159 3.636 0.000280 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.67 on 3437 degrees of freedom

Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629

F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12

` > anova(mod2) `

Analysis of Variance Table

Response: yield

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797.4 12.396 6.075e-12 ***

Residuals 3437 1330110 387.0

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

` > Anova(mod2, type=c("III")) `

Anova Table (Type III tests)

Response: yield

Sum Sq Df F value Pr(>F)

(Intercept) 2418907 1 6250.447 < 2.2e-16 ***

nf 23987 5 12.396 6.075e-12 ***

Residuals 1330110 3437

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

### Two-way ANOVA

` means.topo = tapply(dat$yield, INDEX=dat$topo, FUN=mean) `

StdErr.topo = tapply(dat$yield, INDEX=dat$topo, FUN= std.error)

BP = barplot(means.topo, ylim=c(0,max(means.topo)+10))

segments(BP, means.topo - (2*StdErr.topo), BP,

means.topo + (2*StdErr.topo), lwd = 1.5)

arrows(BP, means.topo - (2*StdErr.topo), BP,

means.topo + (2*StdErr.topo), lwd = 1.5, angle = 90,

code = 3, length = 0.05)

` mod1b = aov(yield ~ nf + topo, data=dat) `

summary(mod1b)

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 23.21 <2e-16 ***

topo 3 620389 206796 1000.59 <2e-16 ***

Residuals 3434 709721 207

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

` TukeyHSD(mod1b, conf.level=0.95, which=c("topo")) `

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = yield ~ nf + topo, data = dat)

$topo

diff lwr upr p adj

HT-LO -36.240955 -38.052618 -34.429291 0

W-LO -18.168544 -19.857294 -16.479794 0

E-LO -6.206619 -8.054095 -4.359143 0

W-HT 18.072411 16.326440 19.818381 0

E-HT 30.034335 28.134414 31.934257 0

E-W 11.961925 10.178822 13.745028 0

### Two-Way ANOVA with Interactions

` dat$topo = factor(dat$topo,levels(dat$topo)[c(2,4,1,3)]) `

means.INT = tapply(dat$yield, INDEX=list(dat$nf, dat$topo), FUN=mean)

bar3d.ade(means.INT, col="grey")

` > means.INT `

LO HT W E

N0 81.03027 41.50652 62.08192 75.13902

N1 83.06276 48.33630 65.74627 78.12808

N2 85.06879 48.79830 66.70848 78.92632

N3 85.23255 50.18398 66.16531 78.99210

N4 87.14400 52.12039 70.10682 80.39213

N5 87.94122 51.03138 69.65933 80.55078

` mod1c = aov(yield ~ nf * topo, data=dat) `

` > summary(mod1c) `

Df Sum Sq Mean Sq F value Pr(>F)

nf 5 23987 4797 23.176 <2e-16 ***

topo 3 620389 206796 999.025 <2e-16 ***

nf:topo 15 1993 133 0.642 0.842

Residuals 3419 707727 207

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

` library(phia) `

plot(interactionMeans(mod1c))

This function plots the effects of the interactions in a 2 by 2 plot, including the standard error of the coefficients, so that we can readily see which overlap:

` > testInteractions(mod1c) `

F Test:

P-value adjustment method: holm

Value Df Sum of Sq F Pr(>F)

N0-N1 : HT-W -3.1654 1 377 1.8230 1

N0-N2 : HT-W -2.6652 1 267 1.2879 1

N0-N3 : HT-W -4.5941 1 784 3.7880 1

N0-N4 : HT-W -2.5890 1 250 1.2072 1

N0-N5 : HT-W -1.9475 1 140 0.6767 1

N1-N2 : HT-W 0.5002 1 9 0.0458 1

N1-N3 : HT-W -1.4286 1 76 0.3694 1

N1-N4 : HT-W 0.5765 1 12 0.0604 1

N1-N5 : HT-W 1.2180 1 55 0.2669 1

N2-N3 : HT-W -1.9289 1 139 0.6711 1

N2-N4 : HT-W 0.0762 1 0 0.0011 1

N2-N5 : HT-W 0.7178 1 19 0.0924 1

N3-N4 : HT-W 2.0051 1 149 0.7204 1

The table is very long so only the first lines are included. However, from this it is clear that the interaction has no effect (p-value of 1), but if it was this function can give us numerous details about the specific effects.

### ANCOVA with lm

` > mod3 = lm(yield ~ nf + bv, data=dat) `

> summary(mod3)

Call:

lm(formula = yield ~ nf + bv, data = dat)

Residuals:

Min 1Q Median 3Q Max

-78.345 -10.847 -3.314 10.739 56.835

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 271.55084 4.99308 54.385 < 2e-16 ***

nfN0 -3.52312 0.95075 -3.706 0.000214 ***

nfN2 1.54761 0.95167 1.626 0.103996

nfN3 2.08006 0.94996 2.190 0.028619 *

nfN4 3.82330 0.95117 4.020 5.96e-05 ***

nfN5 4.47993 0.94994 4.716 2.50e-06 ***

bv -1.16458 0.02839 -41.015 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.12 on 3436 degrees of freedom

Multiple R-squared: 0.3406, Adjusted R-squared: 0.3394

F-statistic: 295.8 on 6 and 3436 DF, p-value: < 2.2e-16

` > test.set = data.frame(nf="N1", bv=150) `

> predict(mod3, test.set)

1 2

96.86350 38.63438

>

> test.set = data.frame(nf="N0", bv=150)

> predict(mod3, test.set)

1

93.34037

### Two-factors and one continuous explanatory variable

` > levels(dat$topo) `

[1] "E" "HT" "LO" "W"

` > mod4 = lm(yield ~ nf + bv + topo, data=dat) `

>

> summary(mod4)

Call:

lm(formula = yield ~ nf + bv + topo, data = dat)

Residuals:

Min 1Q Median 3Q Max

-46.394 -10.927 -2.211 10.364 47.338

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 171.20921 5.28842 32.374 < 2e-16 ***

nfN0 -3.73225 0.81124 -4.601 4.36e-06 ***

nfN2 1.29704 0.81203 1.597 0.1103

nfN3 1.56499 0.81076 1.930 0.0537 .

nfN4 3.71277 0.81161 4.575 4.94e-06 ***

nfN5 3.88382 0.81091 4.789 1.74e-06 ***

bv -0.54206 0.03038 -17.845 < 2e-16 ***

topoHT -24.11882 0.78112 -30.877 < 2e-16 ***

topoLO 3.13643 0.70924 4.422 1.01e-05 ***

topoW -10.77758 0.66708 -16.156 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.75 on 3433 degrees of freedom

Multiple R-squared: 0.5204, Adjusted R-squared: 0.5191

F-statistic: 413.8 on 9 and 3433 DF, p-value: < 2.2e-16

` RES = residuals(mod4) `

qqnorm(RES)

qqline(RES)

### ANCOVA with interactions

` qplot(yield, bv, data=dat, geom="point", xlab="Yield", ylab="bv") + `

facet_wrap(~topo)+

geom_smooth(method = "lm", se = TRUE)

` mod5 = lm(yield ~ nf + bv * topo, data=dat) `

` > Anova(mod5, type=c("III")) `

Anova Table (Type III tests)

Response: yield

Sum Sq Df F value Pr(>F)

(Intercept) 20607 1 115.225 < 2.2e-16 ***

nf 23032 5 25.758 < 2.2e-16 ***

bv 5887 1 32.920 1.044e-08 ***

topo 40610 3 75.691 < 2.2e-16 ***

bv:topo 36059 3 67.209 < 2.2e-16 ***

Residuals 613419 3430

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

` > summary(mod5) `

Call:

lm(formula = yield ~ nf + bv * topo, data = dat)

Residuals:

Min 1Q Median 3Q Max

-46.056 -10.328 -1.516 9.622 50.184

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 93.45783 8.70646 10.734 < 2e-16 ***

nfN1 3.96637 0.78898 5.027 5.23e-07 ***

nfN2 5.24313 0.79103 6.628 3.93e-11 ***

nfN3 5.46036 0.79001 6.912 5.68e-12 ***

nfN4 7.52685 0.79048 9.522 < 2e-16 ***

nfN5 7.73646 0.79003 9.793 < 2e-16 ***

bv -0.27108 0.04725 -5.738 1.04e-08 ***

topoW 88.11105 12.07428 7.297 3.63e-13 ***

topoE 236.12311 17.12941 13.785 < 2e-16 ***

topoLO -15.76280 17.27191 -0.913 0.3615

bv:topoW -0.41393 0.06726 -6.154 8.41e-10 ***

bv:topoE -1.21024 0.09761 -12.399 < 2e-16 ***

bv:topoLO 0.28445 0.10104 2.815 0.0049 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.37 on 3430 degrees of freedom

Multiple R-squared: 0.547, Adjusted R-squared: 0.5454

F-statistic: 345.1 on 12 and 3430 DF, p-value: < 2.2e-16

### GLS – For violations of independence

` > summary(mod5, cor=T) `

[…]

Correlation of Coefficients:

(Intercept) nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO

nfN1 -0.05

nfN2 -0.04 0.50

nfN3 -0.05 0.50 0.50

nfN4 -0.05 0.50 0.50 0.50

nfN5 -0.05 0.50 0.50 0.50 0.50

bv -1.00 0.01 -0.01 0.01 0.00 0.00

topoW -0.72 0.00 0.00 0.00 0.00 0.00 0.72

topoE -0.51 0.01 0.02 0.03 0.01 0.02 0.51 0.37

topoLO -0.50 -0.02 -0.01 0.02 -0.01 0.02 0.50 0.36 0.26

bv:topoW 0.70 0.00 0.00 0.00 0.00 0.00 -0.70 -1.00 -0.36 -0.35

bv:topoE 0.48 -0.01 -0.02 -0.03 -0.01 -0.02 -0.48 -0.35 -1.00 -0.24

bv:topoLO 0.47 0.02 0.01 -0.02 0.01 -0.02 -0.47 -0.34 -0.24 -1.00

bv:topoW bv:topoE

nfN1

nfN2

nfN3

nfN4

nfN5

bv

topoW

topoE

topoLO

bv:topoW

bv:topoE 0.34

bv:topoLO 0.33 0.23

` mod6 = gls(yield ~ nf + bv * topo, data=dat, method="REML") `

Anova(mod6, type=c("III"))

summary(mod6)

` > anova(mod6, mod5) `

Model df AIC BIC logLik

mod6 1 14 27651.21 27737.18 -13811.61

mod5 2 14 27651.21 27737.18 -13811.61

## Linear Mixed-Effects Models

*u*and the standard error term. A more complex form, that is normally used for repeated measures is the random slope and intercept model:

*v*related to time

*T*.

### Random Intercept Model for Clustered Data

` lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat) `

` > summary(lme1) `

Linear mixed-effects model fit by REML

Data: dat

AIC BIC logLik

27648.36 27740.46 -13809.18

Random effects:

Formula: ~1 | rep

(Intercept) Residual

StdDev: 0.798407 13.3573

Fixed effects: yield ~ nf + bv * topo

Value Std.Error DF t-value p-value

(Intercept) 327.3304 14.782524 3428 22.143068 0

nfN1 3.9643 0.788049 3428 5.030561 0

nfN2 5.2340 0.790104 3428 6.624449 0

nfN3 5.4498 0.789084 3428 6.906496 0

nfN4 7.5286 0.789551 3428 9.535320 0

nfN5 7.7254 0.789111 3428 9.789976 0

bv -1.4685 0.085507 3428 -17.173569 0

topoHT -233.3675 17.143956 3428 -13.612232 0

topoLO -251.9750 20.967003 3428 -12.017693 0

topoW -146.4066 16.968453 3428 -8.628162 0

bv:topoHT 1.1945 0.097696 3428 12.226279 0

bv:topoLO 1.4961 0.123424 3428 12.121624 0

bv:topoW 0.7873 0.097865 3428 8.044485 0

` > Anova(lme2, type=c("III")) `

Analysis of Deviance Table (Type III tests)

Response: yield

Chisq Df Pr(>Chisq)

(Intercept) 752.25 1 < 2.2e-16 ***

nf 155.57 5 < 2.2e-16 ***

bv 291.49 1 < 2.2e-16 ***

topo 236.52 3 < 2.2e-16 ***

year 797.13 1 < 2.2e-16 ***

bv:topo 210.38 3 < 2.2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

` > anova(lme1, mod6) `

Model df AIC BIC logLik Test L.Ratio p-value

lme1 1 15 27648.36 27740.46 -13809.18

mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275

### Random Intercept and Slope for repeated measures

` lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat) `

summary(lme2)

Anova(lme2, type=c("III"))

` > anova(lme1, lme2) `

Model df AIC BIC logLik Test L.Ratio p-value

lme1 1 15 27648.36 27740.46 -13809.18

lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001

Warning message:

In anova.lme(lme1, lme2) :

fitted objects with different fixed effects. REML comparisons are not meaningful.

` > ranef(lme2) `

(Intercept) year

R1 -0.3468601 -1.189799e-07

R2 -0.5681688 -1.973702e-07

R3 0.9150289 3.163501e-07

` > fixef(lme2) `

(Intercept) nfN1 nfN2 nfN3 nfN4

-1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00

nfN5 bv topoHT topoLO topoW

7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02

year bv:topoHT bv:topoLO bv:topoW

5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01

` > intervals(lme2, which = "fixed") `

Approximate 95% confidence intervals

Fixed effects:

lower est. upper

(Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04

nfN1 2.526139e+00 3.918006e+00 5.309873e+00

nfN2 3.736625e+00 5.132136e+00 6.527648e+00

nfN3 3.974809e+00 5.368513e+00 6.762216e+00

nfN4 6.070018e+00 7.464542e+00 8.859065e+00

nfN5 6.245584e+00 7.639337e+00 9.033089e+00

bv -1.469793e+00 -1.318391e+00 -1.166989e+00

topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02

topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02

topoW -1.436741e+02 -1.136168e+02 -8.355954e+01

year 5.414742e+00 5.818826e+00 6.222911e+00

bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00

bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00

bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01

attr(,"label")

[1] "Fixed effects:"

## Dealing with non-normal data – Generalized Linear Models

### Count Data

*Y*.

` > dat = beall.webworms `

> str(dat)

'data.frame': 1300 obs. of 7 variables:

$ row : int 1 2 3 4 5 6 7 8 9 10 ...

$ col : int 1 1 1 1 1 1 1 1 1 1 ...

$ y : int 1 0 1 3 6 0 2 2 1 3 ...

$ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ...

$ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ...

$ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

$ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...

` hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms") `

` pois.mod = glm(y ~ trt, data=dat, family=c("poisson")) `

` > summary(pois.mod) `

Call:

glm(formula = y ~ trt, family = c("poisson"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6733 -1.0046 -0.9081 0.6141 4.2771

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.33647 0.04688 7.177 7.12e-13 ***

trtT2 -1.02043 0.09108 -11.204 < 2e-16 ***

trtT3 -0.49628 0.07621 -6.512 7.41e-11 ***

trtT4 -1.22246 0.09829 -12.438 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 1955.9 on 1299 degrees of freedom

Residual deviance: 1720.4 on 1296 degrees of freedom

AIC: 3125.5

Number of Fisher Scoring iterations: 6

` > predict(pois.mod, newdata=data.frame(trt=c("T1","T2"))) `

1 2

0.3364722 -0.6839588

` > 1-pchisq(deviance(pois.mod), df.residual(pois.mod)) `

[1] 1.709743e-14

` pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson")) `

` > AIC(pois.mod, pois.mod2) `

df AIC

pois.mod 4 3125.478

pois.mod2 16 3027.438

` > mean(dat$y) `

[1] 0.7923077

> var(dat$y)

[1] 1.290164

` pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson")) `

` > summary(pois.mod3) `

Call:

glm(formula = y ~ trt, family = c("quasipoisson"), data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6733 -1.0046 -0.9081 0.6141 4.2771

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.33647 0.05457 6.166 9.32e-10 ***

trtT2 -1.02043 0.10601 -9.626 < 2e-16 ***

trtT3 -0.49628 0.08870 -5.595 2.69e-08 ***

trtT4 -1.22246 0.11440 -10.686 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 1.35472)

Null deviance: 1955.9 on 1299 degrees of freedom

Residual deviance: 1720.4 on 1296 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 6

` library(MASS) `

NB.mod1 = glm.nb(y ~ trt, data=dat)

### Logistic Regression

` > dat = johnson.blight `

> str(dat)

'data.frame': 25 obs. of 6 variables:

$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...

$ area : int 0 0 0 0 50 810 120 40 0 0 ...

$ blight : int 0 0 0 0 1 1 1 1 0 0 ...

$ rain.am : int 8 9 9 6 16 10 12 10 11 8 ...

$ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ...

$ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...

` mod9 = glm(blight ~ rain.am, data=dat, family=binomial) `

` > summary(mod9) `

Call:

glm(formula = blight ~ rain.am, family = binomial, data = dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9395 -0.6605 -0.3517 1.0228 1.6048

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.9854 2.0720 -2.406 0.0161 *

rain.am 0.4467 0.1860 2.402 0.0163 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.617 on 24 degrees of freedom

Residual deviance: 24.782 on 23 degrees of freedom

AIC: 28.782

Number of Fisher Scoring iterations: 5

` > predict(mod9, type="response") `

1 2 3 4 5 6 7

0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722

8 9 10 11 12 13 14

0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141

15 16 17 18 19 20 21

0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722

22 23 24 25

0.48214935 0.59273722 0.98109229 0.89680283

` >predict(mod9,newdata=data.frame(rain.am=c(15)),type="response") `

1

0.8475443

` > 1-pchisq(24.782, 23) `

[1] 0.3616226

` > 1-pchisq(34.617, 24) `

[1] 0.07428544

` install.packages("pscl") `

library(pscl)

` > pR2(mod9) `

llh llhNull G2 McFadden r2ML r2CU

-12.3910108 -17.3086742 9.8353268 0.2841155 0.3252500 0.4338984

### Dealing with other distributions and transformation

` > dat = hughes.grapes `

> str(dat)

'data.frame': 270 obs. of 6 variables:

$ block : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ...

$ trt : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ...

$ vine : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ...

$ shoot : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ...

$ diseased: int 1 2 0 0 3 0 7 0 1 0 ...

$ total : int 14 12 12 13 8 9 8 10 14 10 ...

` descdist(dat$total, discrete = FALSE) `

` plot(fitdist(dat$total,distr="gamma")) `

` mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity)) `

### Generalized Linear Mixed Effects models

` install.packages("lme4") `

library(lme4)

` dat = johnson.blight `

` mod10 = glm(blight ~ precip.m, data=dat, family="binomial") `

mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial")

> AIC(mod10, mod11)

df AIC

mod10 2 37.698821

mod11 3 9.287692

## References and Further Reading

Finch, W.H., Bolin, J.E. and Kelley, K., 2014. *Multilevel modeling using R*. Crc Press.

Yan, X. and Su, X., 2009. *Linear regression analysis: theory and computing*. World Scientific.

James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. *An introduction to statistical learning* (Vol. 6). New York: springer. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf

Long, J. Scott. 1997. *Regression Models for Categorical and Limited Dependent Variables*. Sage. pp104-106. [For pseudo R-Squared equations, page available on google books]

Webster, R. and Oliver, M.A., 2007. *Geostatistics for environmental scientists*. John Wiley & Sons.

West, B.T., Galecki, A.T. and Welch, K.B., 2014. *Linear mixed models*. CRC Press.

Gałecki, A. and Burzykowski, T., 2013. *Linear mixed-effects models using R: A step-by-step approach*. Springer Science & Business Media.

Williams, R., 2004. *One-Way Analysis of Variance*. URL: https://www3.nd.edu/~rwilliam/stats1/x52.pdf

Witte, R. and Witte, J. 2009. *Statistics. 9th ed. *Wiley.

## Appendix 1: Assessing the accuracy of our model

### R-Squared

### Adjusted R-Squared

### Root Mean Squared Deviation or Root Mean Squared Error

### Mean Squared Deviation or Mean Squared Error

### Mean Absolute Deviation or Mean Absolute Error

### Akaike Information Criterion

**leave a comment**for the author, please follow the link and comment on their blog:

**R tutorial for Spatial Statistics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.