As part of my new role as Lecturer in Agri-data analysis at Harper Adams University, I found myself applying a lot of techniques based on linear modelling. Another thing I noticed is that there is a lot of confusion among researchers in regards to what technique should be used in each instance and how to interpret the model. For this reason I started reading material from books and on-line to try and create a sort of reference tutorial that researchers can use. This post is the result of my work so far.
Please feel free to comment, provide feedback and constructive criticism!!
Theoretical Background – Linear Model and ANOVA
The classic linear model forms the basis for ANOVA (with categorical treatments) and ANCOVA (which deals with continuous explanatory variables). Its basic equation is the following:
This equation can be expanded to accommodate more that one explanatory variable x:
In case our model includes interactions, the linear equation would be changed as follows:
In fact, if we rewrite the equation focusing for example on x_1:
This linear model can be applied to continuous target variables, in this case we would talk about an ANCOVA for exploratory analysis, or a linear regression if the objective was to create a predictive model.
The Analysis of variance is based on the linear model presented above, the only difference is that its reference point is the mean of the dataset. When we described the equations above we said that to interpret the results of the linear model we would look at the slope term; this indicates the rate of changes in Y if we change one variable and keep the rest constant. The ANOVA calculates the effects of each treatment based on the grand mean, which is the mean of the variable of interest.
In mathematical terms ANOVA solves the following equation (Williams, 2004):
where y is the effect on group j of treatment τ_1, while μ is the grand mean (i.e. the mean of the whole dataset). From this equation is clear that the effects calculated by the ANOVA are not referred to unit changes in the explanatory variables, but are all related to changes on the grand mean.
Examples of ANOVA and ANCOVA in R
For this example we are going to use one of the datasets available in the package agridatavailable in CRAN:
We also need to include other packages for the examples below. If some of these are not installed in your system please use again the function install.packages (replacing the name within quotation marks according to your needs) to install them.
library(agridat) library(ggplot2) library(plotrix) library(moments) library(car) library(fitdistrplus) library(nlme) library(multcomp) library(epade) library(lme4)
Now we can load the dataset lasrosas.corn, which has more that 3400 observations of corn yield in a field in Argentina, plus several explanatory variables both factorial (or categorical) and continuous.
> dat = lasrosas.corn > str(dat) 'data.frame': 3443 obs. of 9 variables: $ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ... $ lat : num -33.1 -33.1 -33.1 -33.1 -33.1 ... $ long : num -63.8 -63.8 -63.8 -63.8 -63.8 ... $ yield: num 72.1 73.8 77.2 76.3 75.5 ... $ nitro: num 132 132 132 132 132 ... $ topo : Factor w/ 4 levels "E","HT","LO",..: 4 4 4 4 4 4 4 4 4 4 ... $ bv : num 163 170 168 177 171 ... $ rep : Factor w/ 3 levels "R1","R2","R3": 1 1 1 1 1 1 1 1 1 1 ... $ nf : Factor w/ 6 levels "N0","N1","N2",..: 6 6 6 6 6 6 6 6 6 6 ...
Important for the purpose of this tutorial is the target variable yield, which is what we are trying to model, and the explanatory variables: topo (topographic factor), bv (brightness value, which is a proxy for low organic matter content) and nf (factorial nitrogen levels). In addition we have rep, which is the blocking factor.
Since we are planning to use an ANOVA we first need to check that our data fits with its assumptions. ANOVA is based on three assumptions:
- Data independence
- Equality of variances between groups
- Balance design (i.e. all groups have the same number of samples)
Let’s see how we can test for them in R. Clearly we are talking about environmental data so the assumption of independence is not met, because data are autocorrelated with distance. Theoretically speaking, for spatial data ANOVA cannot be employed and more robust methods should be employed (e.g. REML); however, over the years it has been widely used for analysis of environmental data and it is accepted by the community. That does not mean that it is the correct method though, and later on in this tutorial we will see the function to perform linear modelling with REML.
The third assumption is the one that is most easy to assess using the function tapply:
> tapply(dat$yield, INDEX=dat$nf, FUN=var) N0 N1 N2 N3 N4 N5 438.5448 368.8136 372.8698 369.6582 366.5705 405.5653
In this case we used tapply to calculate the variance of yield for each subgroup (i.e. level of nitrogen). There is some variation between groups but in my opinion it is not substantial. Now we can shift our focus on normality. There are tests to check for normality, but again the ANOVA is flexible (particularly where our dataset is big) and can still produce correct results even when its assumptions are violated up to a certain degree. For this reason, it is good practice to check normality with descriptive analysis alone, without any statistical test. For example, we could start by plotting the histogram of yield:
hist(dat$yield, main="Histogram of Yield", xlab="Yield (quintals/ha)")
By looking at this image it seems that our data are more or less normally distributed. Another plot we could create is the QQplot (http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm):
qqnorm(dat$yield, main="QQplot of Yield") qqline(dat$yield)
For normally distributed data the points should all be on the line. This is clearly not the case but again the deviation is not substantial. The final element we can calculate is the skewness of the distribution, with the function skewnessin the package moments:
> skewness(dat$yield)  0.3875977
According to Webster and Oliver (2007) is the skewness is below 0.5, we can consider the deviation from normality not big enough to transform the data. Moreover, according to Witte and Witte (2009) if we have more than 10 samples per group we should not worry too much about violating the assumption of normality or equality of variances.
To see how many samples we have for each level of nitrogen we can use once again the function tapply:
> tapply(dat$yield, INDEX=dat$nf, FUN=length) N0 N1 N2 N3 N4 N5 573 577 571 575 572 575
As you can see we have definitely more than 10 samples per group, but our design is not balanced (i.e. some groups have more samples). This implies that the normal ANOVA cannot be used, this is because the standard way of calculating the sum of squares is not appropriate for unbalanced designs (look here for more info: http://goanna.cs.rmit.edu.au/~fscholer/anova.php).
In summary, even though from the descriptive analysis it appears that our data are close to being normal and have equal variance, our design is unbalanced, therefore the normal way of doing ANOVA cannot be used. In other words we cannot function aovfor this dataset. However, since this is a tutorial we are still going to start by applying the normal ANOVA with aov.
ANOVA with aov
The first thing we need to do is think about the hypothesis we would like to test. For example, we could be interested in looking at nitrogen levels and their impact on yield. Let’s start with some plotting to better understand our data:
means.nf = tapply(dat$yield, INDEX=dat$nf, FUN=mean) StdErr.nf = tapply(dat$yield, INDEX=dat$nf, FUN= std.error) BP = barplot(means.nf, ylim=c(0,max(means.nf)+10)) segments(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5) arrows(BP, means.nf - (2*StdErr.nf), BP, means.nf + (2*StdErr.nf), lwd = 1.5, angle = 90, code = 3, length = 0.05)
This code first uses the function tapply to compute mean and standard error of the mean for yield in each nitrogen group. Then it plots the means as bars and creates error bars using the standard error (please remember that with a normal distribution ± twice the standard error provides a 95% confidence interval around the mean value). The result is the following image:
By plotting our data we can start figuring out what is the interaction between nitrogen levels and yield. In particular, there is an increase in yield with higher level of nitrogen. However, some of the error bars are overlapping, and this may suggest that their values are not significantly different. For example, by looking at this plot N0 and N1 have error bars very close to overlap, but probably not overlapping, so it may be that N1 provides a significant different from N0. The rest are all probably significantly different from N0. For the rest their interval overlap most of the times, so their differences would probably not be significant.
We could formulate the hypothesis that nitrogen significantly affects yield and that the mean of each subgroup are significantly different. Now we just need to test this hypothesis with a one-way ANOVA:
mod1 = aov(yield ~ nf, data=dat)
The code above uses the function aov to perform an ANOVA; we can specify to perform a one-way ANOVA simply by including only one factorial term after the tilde (~) sign. We can plot the ANOVA table with the function summary:
> summary(mod1) Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797 12.4 6.08e-12 *** Residuals 3437 1330110 387 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It is clear from this output that nitrogen significantly affects yield, so we tested our first hypothesis. To test the significance for individual levels of nitrogen we can use the Tukey’s test:
> TukeyHSD(mod1, conf.level=0.95) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ nf, data = dat) $nf diff lwr upr p adj N1-N0 3.6434635 0.3353282 6.951599 0.0210713 N2-N0 4.6774357 1.3606516 7.994220 0.0008383 N3-N0 5.3629638 2.0519632 8.673964 0.0000588 N4-N0 7.5901274 4.2747959 10.905459 0.0000000 N5-N0 7.8588595 4.5478589 11.169860 0.0000000 N2-N1 1.0339723 -2.2770686 4.345013 0.9489077 N3-N1 1.7195004 -1.5857469 5.024748 0.6750283 N4-N1 3.9466640 0.6370782 7.256250 0.0089057 N5-N1 4.2153960 0.9101487 7.520643 0.0038074 N3-N2 0.6855281 -2.6283756 3.999432 0.9917341 N4-N2 2.9126917 -0.4055391 6.230923 0.1234409 N5-N2 3.1814238 -0.1324799 6.495327 0.0683500 N4-N3 2.2271636 -1.0852863 5.539614 0.3916824 N5-N3 2.4958957 -0.8122196 5.804011 0.2613027 N5-N4 0.2687320 -3.0437179 3.581182 0.9999099
There are significant differences between the control and the rest of the levels of nitrogen, plus other differences between N4 and N5 compared to N1, but nothing else. If you look back at the bar chart we produced before, and look carefully at the overlaps between error bars, you will see that for example N1, N2, and N3 have overlapping error bars, thus they are not significantly different. On the contrary, N1 has no overlaps with either N4 and N5 , which is what we demonstrated in the ANOVA.
The function model.tables provides a quick way to print the table of effects and the table of means:
> model.tables(mod1, type="effects") Tables of effects nf N0 N1 N2 N3 N4 N5 -4.855 -1.212 -0.178 0.5075 2.735 3.003 rep 573.000 577.000 571.000 575.0000 572.000 575.000
These values are all referred to the gran mean, which we can simply calculate with the function mean(dat$yield) and it is equal to 69.83. This means that the mean for N0 would be 69.83-4.855 = 64.97. we can verify that with another call to the function model.tables, this time with the option type=”means”:
> model.tables(mod1, type="means") Tables of means Grand mean 69.82831 nf N0 N1 N2 N3 N4 N5 64.97 68.62 69.65 70.34 72.56 72.83 rep 573.00 577.00 571.00 575.00 572.00 575.00
Linear Model with 1 factor
The same results can be obtain by fitting a linear model with the function lm, only their interpretation would be different. The assumption for fitting a linear models are again independence (which is always violated with environmental data), and normality.
Let’s look at the code:
mod2 = lm(yield ~ nf, data=dat)
This line fits the same model but with the standard linear equation. This become clearer by looking at the summary table:
> summary(mod2) Call: lm(formula = yield ~ nf, data = dat) Residuals: Min 1Q Median 3Q Max -52.313 -15.344 -3.126 13.563 45.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 64.9729 0.8218 79.060 < 2e-16 *** nfN1 3.6435 1.1602 3.140 0.0017 ** nfN2 4.6774 1.1632 4.021 5.92e-05 *** nfN3 5.3630 1.1612 4.618 4.01e-06 *** nfN4 7.5901 1.1627 6.528 7.65e-11 *** nfN5 7.8589 1.1612 6.768 1.53e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.67 on 3437 degrees of freedom Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629 F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12
There are several information in this table that we should clarify. First of all it already provides with some descriptive measures for the residuals, from which we can see that their distribution is relatively normal (first and last quartiles have similar but opposite values and the same is true for minimum and maximum). Then we have the table of the coefficients, with the intercept and all the slopes. As you can see the level N0 is not shown in the list; this is called the reference level, which means that all the other are referenced back to it. In other words, the value of the intercept is the mean of nitrogen level 0 (in fact is the same we calculated above 64.97). To calculate the means for the other groups we need to sum the value of the reference level with the slopes. For example N1 is 64.97 + 3.64 = 68.61 (the same calculated from the ANOVA). The p-value and the significance are again in relation to the reference level, meaning for example that N1 is significantly different from N0 (reference level) and the p-value is 0.0017. This is similar to the Tukey’s test we performed above, but it is only valid in relation to N0. We need to change the reference level, and fit another model, to get the same information for other nitrogen levels:
dat$nf = relevel(dat$nf, ref="N1") mod3 = lm(yield ~ nf, data=dat) summary(mod3)
Now the reference level is N1, so all the results will tell us the effects of nitrogen in relation to N1.
> summary(mod3) Call: lm(formula = yield ~ nf, data = dat) Residuals: Min 1Q Median 3Q Max -52.313 -15.344 -3.126 13.563 45.337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 68.616 0.819 83.784 < 2e-16 *** nfN0 -3.643 1.160 -3.140 0.001702 ** nfN2 1.034 1.161 0.890 0.373308 nfN3 1.720 1.159 1.483 0.138073 nfN4 3.947 1.161 3.400 0.000681 *** nfN5 4.215 1.159 3.636 0.000280 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.67 on 3437 degrees of freedom Multiple R-squared: 0.01771, Adjusted R-squared: 0.01629 F-statistic: 12.4 on 5 and 3437 DF, p-value: 6.075e-12
For example, we can see that N0 has a lower value compared to N1, and that only N0, N4 and N5 are significantly different from N1, which is what we saw from the bar chart and what we found from the Tukey’s test.
Interpreting the output of the function aov is much easier compare to lm. However, in many cases we can only use the function lm (for example in an ANCOVA where alongside categorical we have continuous explanatory variables) so it is important that we learn how to interpret its summary table.
We can obtain the ANOVA table with the function anova:
> anova(mod2) Analysis of Variance Table Response: yield Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797.4 12.396 6.075e-12 *** Residuals 3437 1330110 387.0 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This uses the type I sum of squares (more info at: http://www.utstat.utoronto.ca/reid/sta442f/2009/typeSS.pdf), which is the standard way and it is not indicated for unbalanced designs. The function Anova in the package car has the option to select which type of sum of squares to calculate and we can specify type=c(“III”)to correct for the unbalanced design:
> Anova(mod2, type=c("III")) Anova Table (Type III tests) Response: yield Sum Sq Df F value Pr(>F) (Intercept) 2418907 1 6250.447 < 2.2e-16 *** nf 23987 5 12.396 6.075e-12 *** Residuals 1330110 3437 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this example the two results are the same, probably the large sample size helps in this respect. However, for smaller samples this distinction may become important. For this reason, if your design is unbalanced please remember not to use the function aov, but always lmand Anova with the option for type III sum of squares.
So far we have looked on the effect of nitrogen on yield. However, in the dataset we also have a factorial variable named topo, which stands for topographic factor and has 4 levels: W = West slope, HT = Hilltop, E = East slope, LO = Low East. We already formulated an hypothesis about nitrogen, so now we need to formulate an hypothesis about topo as well. Once again we can do that by using the function tapply and a simple bar charts with error bars. Look at the code below:
means.topo = tapply(dat$yield, INDEX=dat$topo, FUN=mean) StdErr.topo = tapply(dat$yield, INDEX=dat$topo, FUN= std.error) BP = barplot(means.topo, ylim=c(0,max(means.topo)+10)) segments(BP, means.topo - (2*StdErr.topo), BP, means.topo + (2*StdErr.topo), lwd = 1.5) arrows(BP, means.topo - (2*StdErr.topo), BP, means.topo + (2*StdErr.topo), lwd = 1.5, angle = 90, code = 3, length = 0.05)
Here we are using the same exact approach we used before to formulate an hypothesis about nitrogen. We first calculate mean and standard error of yield for each level of topo, and then plot a bar chart with error bars.
The result is the plot below:
From this plot it is clear that the topographic factor has an effect on yield. In particular, hilltop areas have low yield while the low east corner of the field has high yield. From the error bars we can say with a good level of confidence that probably all the differences will be significant, at least up to an alpha of 95% (significant level, meaning a p-value of 0.05).
We can test this hypothesis with a two way ANOVA, by simply adding the term topo to the equation:
mod1b = aov(yield ~ nf + topo, data=dat) summary(mod1b) Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797 23.21 <2e-16 *** topo 3 620389 206796 1000.59 <2e-16 *** Residuals 3434 709721 207 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the summary table it is clear that both factors have a significant effect on yield, but just by looking at this it is very difficult to identify clearly which levels are the significant ones. Top do that we need the Tukey’s test:
TukeyHSD(mod1b, conf.level=0.95, which=c("topo")) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ nf + topo, data = dat) $topo diff lwr upr p adj HT-LO -36.240955 -38.052618 -34.429291 0 W-LO -18.168544 -19.857294 -16.479794 0 E-LO -6.206619 -8.054095 -4.359143 0 W-HT 18.072411 16.326440 19.818381 0 E-HT 30.034335 28.134414 31.934257 0 E-W 11.961925 10.178822 13.745028 0
The zero p-values indicate a large significance for each combination, as it was clear from the plot. With the function model.table you can easily obtain a table of means or effects, if you are interested.
Two-Way ANOVA with Interactions
One step further we can take to get more insights into our data is add an interaction between nitrogen and topo, and see if this can further narrow down the main sources of yield variation. Once again we need to start our analysis by formulating an hypothesis. Since we are talking about an interaction we are now concern in finding a way to plot yield responses for varying nitrogen level and topographic position, so we need a 3d bar chart. We can do that with the function bar3d.ade from the package epade, so please install this package and load it.
Then please look at the following R code:
dat$topo = factor(dat$topo,levels(dat$topo)[c(2,4,1,3)]) means.INT = tapply(dat$yield, INDEX=list(dat$nf, dat$topo), FUN=mean) bar3d.ade(means.INT, col="grey")
The first line is only used to reorder the levels in the factorial variable topo. This is because from the previous plot we clearly saw that HT is the one with the lowest yield, followed by W, E and LO. We are doing this only to make the 3d bar chart more readable.
The next line applies once again the function tapply, this time to calculate the mean of yield for subgroups divided by nitrogen and topographic factors. The result is a matrix that looks like this:
> means.INT LO HT W E N0 81.03027 41.50652 62.08192 75.13902 N1 83.06276 48.33630 65.74627 78.12808 N2 85.06879 48.79830 66.70848 78.92632 N3 85.23255 50.18398 66.16531 78.99210 N4 87.14400 52.12039 70.10682 80.39213 N5 87.94122 51.03138 69.65933 80.55078
This can be used directly within the function bar3d.ade to create the 3d bar chart below:
From this plot we can see two things very clearly: the first is that there is an increase in yield from HT to LO in the topographic factor, the second is that we have again and increase from N0 to N1 in the nitrogen levels. These were all expected since we already noticed them before. What we do not see in these plot is any particular influence from the interaction between topography and nitrogen. For example, if you look at HT, you have an increase in yield from N0 to N5 (expected) and overall the yield is lower than the other bars (again expected). If there was an interaction we would expect this general pattern to change, for example with relatively high yield on the hilltop at high nitrogen level, or very low yield in the low east side with N0. This does not happen and all the bars follow an expected pattern, so we can hypothesise that the interaction will not be significant.
To test that we need to run another ANOVA with an interaction term:
mod1c = aov(yield ~ nf * topo, data=dat)
This formula test for both main effects and their interaction. To see the significance we can use the summary table:
> summary(mod1c) Df Sum Sq Mean Sq F value Pr(>F) nf 5 23987 4797 23.176 <2e-16 *** topo 3 620389 206796 999.025 <2e-16 *** nf:topo 15 1993 133 0.642 0.842 Residuals 3419 707727 207 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From this we can conclude that our hypothesis was correct and that the interaction has no effect on yield.
We can have a better ides of the interaction effect by using some functions in the package phia:
This function plots the effects of the interactions in a 2 by 2 plot, including the standard error of the coefficients, so that we can readily see which overlap:
We already knew from the 3d plot that there is a general increase between N0 and N5 that mainly drives the changes we see in the data. However, from the top-right plot we can see that topo plays a little role between N0 and the other (in fact the black line only slightly overlap with the other), but it has no effect on N1 to N5.
We can look at the numerical break-out of what we see in the plot with another function:
> testInteractions(mod1c) F Test: P-value adjustment method: holm Value Df Sum of Sq F Pr(>F) N0-N1 : HT-W -3.1654 1 377 1.8230 1 N0-N2 : HT-W -2.6652 1 267 1.2879 1 N0-N3 : HT-W -4.5941 1 784 3.7880 1 N0-N4 : HT-W -2.5890 1 250 1.2072 1 N0-N5 : HT-W -1.9475 1 140 0.6767 1 N1-N2 : HT-W 0.5002 1 9 0.0458 1 N1-N3 : HT-W -1.4286 1 76 0.3694 1 N1-N4 : HT-W 0.5765 1 12 0.0604 1 N1-N5 : HT-W 1.2180 1 55 0.2669 1 N2-N3 : HT-W -1.9289 1 139 0.6711 1 N2-N4 : HT-W 0.0762 1 0 0.0011 1 N2-N5 : HT-W 0.7178 1 19 0.0924 1 N3-N4 : HT-W 2.0051 1 149 0.7204 1
The table is very long so only the first lines are included. However, from this it is clear that the interaction has no effect (p-value of 1), but if it was this function can give us numerous details about the specific effects.
ANCOVA with lm
The Analysis of covariance (ANCOVA) fits a new model where the effects of the treatments (or factorial variables) is corrected for the effect of continuous covariates, for which we can also see the effects on yield.
The code is very similar to what we saw before, and again we can perform an ANCOVA with the lm function; the only difference is that here we are including an additional continuous explanatory variable in the model:
> mod3 = lm(yield ~ nf + bv, data=dat) > summary(mod3) Call: lm(formula = yield ~ nf + bv, data = dat) Residuals: Min 1Q Median 3Q Max -78.345 -10.847 -3.314 10.739 56.835 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 271.55084 4.99308 54.385 < 2e-16 *** nfN0 -3.52312 0.95075 -3.706 0.000214 *** nfN2 1.54761 0.95167 1.626 0.103996 nfN3 2.08006 0.94996 2.190 0.028619 * nfN4 3.82330 0.95117 4.020 5.96e-05 *** nfN5 4.47993 0.94994 4.716 2.50e-06 *** bv -1.16458 0.02839 -41.015 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.12 on 3436 degrees of freedom Multiple R-squared: 0.3406, Adjusted R-squared: 0.3394 F-statistic: 295.8 on 6 and 3436 DF, p-value: < 2.2e-16
By printing the summary table we can already see some differences compared to the model we only nitrogen as explanatory variable. The first is related to the Adjusted R-squared (which is simply the R-squared corrected for the number of predictors so that it is less affected by overfitting), which in this case is around 0.3. If we look back at the summary table of the model with only nitrogen, the R-squared was only 0.01. This means that by adding the continuous variable bv we are able to massively increase the explanatory power of the model; in fact, this new model is capable of explaining 33% of the variation in yield. Moreover, we can also see that other terms become significant, for example N3. This is because the inclusion of bv changes the entire model and its interpretation becomes less obvious compared to the simple bar chart we plotted at the beginning.
The interpretation of the ANCOVA model is more complex that the one for the one-way ANOVA. In fact, the intercept value has changed and it is not the mean of the reference level N1. This is because the model now changes based on the covariate bv. The slope can be used to assess the relative impact of each term; for example, N0 has a negative impact on yield in relation to its reference level. Therefore, shifting from a nitrogen level N1 to N0 decreases the yield by -3.52, if bv is kept constant.
Take a look at the following code:
> test.set = data.frame(nf="N1", bv=150) > predict(mod3, test.set) 1 2 96.86350 38.63438 > > test.set = data.frame(nf="N0", bv=150) > predict(mod3, test.set) 1 93.34037
Here we are using the model (mod3) to estimate new values of yield based on set parameters. In the first example we set nf to N1 (reference level) and bv constant at 150. With the function predictwe can see estimate these new values using mod3. For N1 and bv equal to 150 the yield is 96.86.
In the second example we did the same but for nitrogen level N0. The result is a yield equal to 93.34, that is a difference of exactly 3.52, which is the slope of the model.
For computing the ANOVA table, we can again use either the function anova (if the design is balanced) or Anova with type III (for unbalanced designs).
Two-factors and one continuous explanatory variable
Let’s look now at another example with a slightly more complex model where we include two factorial and one continuous variable. We also include in the model the variable topo. We can check these with the function levels:
> levels(dat$topo)  "E" "HT" "LO" "W"
Please notice that E is the first and therefore is the reference level for this factor. Now let’s fit the model and look at the summary table:
> mod4 = lm(yield ~ nf + bv + topo, data=dat) > > summary(mod4) Call: lm(formula = yield ~ nf + bv + topo, data = dat) Residuals: Min 1Q Median 3Q Max -46.394 -10.927 -2.211 10.364 47.338 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 171.20921 5.28842 32.374 < 2e-16 *** nfN0 -3.73225 0.81124 -4.601 4.36e-06 *** nfN2 1.29704 0.81203 1.597 0.1103 nfN3 1.56499 0.81076 1.930 0.0537 . nfN4 3.71277 0.81161 4.575 4.94e-06 *** nfN5 3.88382 0.81091 4.789 1.74e-06 *** bv -0.54206 0.03038 -17.845 < 2e-16 *** topoHT -24.11882 0.78112 -30.877 < 2e-16 *** topoLO 3.13643 0.70924 4.422 1.01e-05 *** topoW -10.77758 0.66708 -16.156 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.75 on 3433 degrees of freedom Multiple R-squared: 0.5204, Adjusted R-squared: 0.5191 F-statistic: 413.8 on 9 and 3433 DF, p-value: < 2.2e-16
The adjusted R-squared increases again and now we are able to explain around 52% of the variation in yield.
The interpretation is similar to what we said before, the only difference is that here both factors have a reference level. So for example, the effect of topoHT is related to the reference level, which is the one not shown E. So if we change the topographic position from E to HT, while keeping everything else in the model constant (meaning same value of bv and same nitrogen level), we obtain a decrease in yield of 24.12.
Another thing we can see from this table is that the p-value change, and for example N3 becomes less significant. This is probably because when we consider more variables the effect of N3 on yield is explained by other variables, maybe partly bv and partly topo.
One last thing we can check, and this is something we should check every time we perform an ANOVA or a linear model is the normality of the residuals. We already saw that the summary table provides us with some data about the residuals distribution (minimum, first quartile, median, third quartile and maximum) that gives us a good indication of normality, since the distribution is centred around 0. However, we can also use other tools to check this. For example a QQ plot:
RES = residuals(mod4) qqnorm(RES) qqline(RES)
The function residuals automatically extract the residuals from the model, which can then be used to create the following plot:
It looks approximately normal, but to have a further confirmation we can use again the function skewness, which returns a value below 0.5, so we can consider this a normal distribution.
ANCOVA with interactions
Let’s now add a further layer of complexity by adding an interaction term between bv and topo. Once again we need to formulate an hypothesis before proceeding to test it. Since we are interested in an interaction between a continuous variable (bv) and a factorial variable (topo) on yield, we could try to create scatterplots of yield versus bv, for the different levels in topo. We can easily do that with the package ggplot2:
qplot(yield, bv, data=dat, geom="point", xlab="Yield", ylab="bv") + facet_wrap(~topo)+ geom_smooth(method = "lm", se = TRUE)
Explaining every bit of the three lines of code above would require some time and it is beyond the scope of this tutorial. In essence, these lines create a scatterplot yield versus bv for each subgroup of topo and then fit a linear regression line through the points. For more info about the use of ggplot2 please start by looking here: http://www.statmethods.net/advgraphs/ggplot2.html
This create the plot below:
From this plot it is clear that the four lines have different slopes, so the interaction between bv and topo may well be significant and help us further increase the explanatory power of our model. We can test that by adding this interaction:
mod5 = lm(yield ~ nf + bv * topo, data=dat)
We can use the function Anova to check the significance:
> Anova(mod5, type=c("III")) Anova Table (Type III tests) Response: yield Sum Sq Df F value Pr(>F) (Intercept) 20607 1 115.225 < 2.2e-16 *** nf 23032 5 25.758 < 2.2e-16 *** bv 5887 1 32.920 1.044e-08 *** topo 40610 3 75.691 < 2.2e-16 *** bv:topo 36059 3 67.209 < 2.2e-16 *** Residuals 613419 3430 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you can see this interaction is significant. To check the details we can look at the summary table:
> summary(mod5) Call: lm(formula = yield ~ nf + bv * topo, data = dat) Residuals: Min 1Q Median 3Q Max -46.056 -10.328 -1.516 9.622 50.184 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 93.45783 8.70646 10.734 < 2e-16 *** nfN1 3.96637 0.78898 5.027 5.23e-07 *** nfN2 5.24313 0.79103 6.628 3.93e-11 *** nfN3 5.46036 0.79001 6.912 5.68e-12 *** nfN4 7.52685 0.79048 9.522 < 2e-16 *** nfN5 7.73646 0.79003 9.793 < 2e-16 *** bv -0.27108 0.04725 -5.738 1.04e-08 *** topoW 88.11105 12.07428 7.297 3.63e-13 *** topoE 236.12311 17.12941 13.785 < 2e-16 *** topoLO -15.76280 17.27191 -0.913 0.3615 bv:topoW -0.41393 0.06726 -6.154 8.41e-10 *** bv:topoE -1.21024 0.09761 -12.399 < 2e-16 *** bv:topoLO 0.28445 0.10104 2.815 0.0049 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.37 on 3430 degrees of freedom Multiple R-squared: 0.547, Adjusted R-squared: 0.5454 F-statistic: 345.1 on 12 and 3430 DF, p-value: < 2.2e-16
The R-squared is a bit higher, which means that we can explain more of the variability in yield by adding the interaction. For the interpretation, once again everything is related to the reference levels in the factors, even the interaction. So for example, bv:topoW tells us that the interaction between bv and topo changes the yield negatively if we change from HT to W, keeping everything else constant.
For information about individual changes we would need to use the model to estimate new data as we did for mod3.
GLS – For violations of independence
As we mentioned, there are certain assumptions we need to check before starting an analysis with linear models. Assumptions about normality and equality of variance can be relaxed, particularly if sample sizes are large enough. However, other assumptions for example balance in the design and independence tend to be stricter, and we need to be careful in violating them.
We can check for independence by looking at the correlation among the coefficient directly in the summary table:
> summary(mod5, cor=T) […] Correlation of Coefficients: (Intercept) nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO nfN1 -0.05 nfN2 -0.04 0.50 nfN3 -0.05 0.50 0.50 nfN4 -0.05 0.50 0.50 0.50 nfN5 -0.05 0.50 0.50 0.50 0.50 bv -1.00 0.01 -0.01 0.01 0.00 0.00 topoW -0.72 0.00 0.00 0.00 0.00 0.00 0.72 topoE -0.51 0.01 0.02 0.03 0.01 0.02 0.51 0.37 topoLO -0.50 -0.02 -0.01 0.02 -0.01 0.02 0.50 0.36 0.26 bv:topoW 0.70 0.00 0.00 0.00 0.00 0.00 -0.70 -1.00 -0.36 -0.35 bv:topoE 0.48 -0.01 -0.02 -0.03 -0.01 -0.02 -0.48 -0.35 -1.00 -0.24 bv:topoLO 0.47 0.02 0.01 -0.02 0.01 -0.02 -0.47 -0.34 -0.24 -1.00 bv:topoW bv:topoE nfN1 nfN2 nfN3 nfN4 nfN5 bv topoW topoE topoLO bv:topoW bv:topoE 0.34 bv:topoLO 0.33 0.23
If we exclude the interaction, which would clearly be correlated with the single covariates, the rest of the coefficients are not much correlated. From this we may conclude that our assumption of independence holds true for this dataset. In cases where from this table we see a relatively high correlation among coefficients, we would need to use a more robust method of maximum likelihood (ML) and residuals maximum likelihood (REML) for computing the coefficients. This can be done with the function gls in the package nlme, using the same syntax as for lm:
mod6 = gls(yield ~ nf + bv * topo, data=dat, method="REML") Anova(mod6, type=c("III")) summary(mod6)
As you can see despite the different function (gls instead of lm), the rest of the syntax is the same as before. We can still use the function Anova to print the ANOVA table and summary to check the individual coefficients.
Moreover, we can also use the function anovato compare the two models (the one from glsand lm) and see which is the best performer:
> anova(mod6, mod5) Model df AIC BIC logLik mod6 1 14 27651.21 27737.18 -13811.61 mod5 2 14 27651.21 27737.18 -13811.61
The indexes AIC, BIC and logLik are all used to check the accuracy of the model and should be as low as possible. For more info please look at the appendix about assessing the accuracy of our model.
Linear Mixed-Effects Models
This class of models are used to account for more than one source of random variation. For example, assume we have a dataset where again we are trying to model yield as a function of nitrogen level. However, this time the data were collected in many different farms. In this case would need to be consider a cluster and the model would need to take this clustering into account. Another common set of experiments where linear mixed-effects models are used is repeated measures where time provide an additional source of correlation between measures. For these models we do not need to worry about the assumptions from previous models, since these are very robust against all of them. For example, for unbalanced design with blocking, probably these methods should be used instead of the standard ANOVA.
At the beginning on this tutorial we explored the equation that supports linear model:
This equation can be seen as split into two components, the fixed and random effects. For fixed effect we refer to those variables we are using to explain the model. These may be factorial (in ANOVA), continuous or a mixed of the two (ANCOVA) and they can also be the blocks used in our design. The other component in the equation is the random effect, which provides a level of uncertainty that it is difficult to account in the model. For example, when we work with yield we might see differences between plants grown from similar soils and conditions. These may be related to the seeds or to other factors and are part of the within-subject variation that we cannot explain.
There are times however where in the data there are multiple sources of random variation. For example, data may be clustered in separate field or separate farms. This will provide an additional source of random variation that needs to be taken into account in the model. To do so the standard equation can be amended in the following way:
This is referred to as a random intercept model, where the random variation is split into a cluster specific variation u and the standard error term. A more complex form, that is normally used for repeated measures is the random slope and intercept model:
Where we add a new source of random variation v related to time T.
Random Intercept Model for Clustered Data
Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describe some clusters in the data. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:
lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)
The syntax is very similar to all the models we fitted before, with a general formula describing our target variable yield and all the treatments, which are the fixed effects of the model. Then we have the option random, which allows us to include an additional random component for the clustering factor rep. In this case the ~1 indicates that the random effect will be associated with the intercept.
Once again we can use the function summary to explore our results:
> summary(lme1) Linear mixed-effects model fit by REML Data: dat AIC BIC logLik 27648.36 27740.46 -13809.18 Random effects: Formula: ~1 | rep (Intercept) Residual StdDev: 0.798407 13.3573 Fixed effects: yield ~ nf + bv * topo Value Std.Error DF t-value p-value (Intercept) 327.3304 14.782524 3428 22.143068 0 nfN1 3.9643 0.788049 3428 5.030561 0 nfN2 5.2340 0.790104 3428 6.624449 0 nfN3 5.4498 0.789084 3428 6.906496 0 nfN4 7.5286 0.789551 3428 9.535320 0 nfN5 7.7254 0.789111 3428 9.789976 0 bv -1.4685 0.085507 3428 -17.173569 0 topoHT -233.3675 17.143956 3428 -13.612232 0 topoLO -251.9750 20.967003 3428 -12.017693 0 topoW -146.4066 16.968453 3428 -8.628162 0 bv:topoHT 1.1945 0.097696 3428 12.226279 0 bv:topoLO 1.4961 0.123424 3428 12.121624 0 bv:topoW 0.7873 0.097865 3428 8.044485 0
We can also use the function Anovato display the ANOVA table:
> Anova(lme2, type=c("III")) Analysis of Deviance Table (Type III tests) Response: yield Chisq Df Pr(>Chisq) (Intercept) 752.25 1 < 2.2e-16 *** nf 155.57 5 < 2.2e-16 *** bv 291.49 1 < 2.2e-16 *** topo 236.52 3 < 2.2e-16 *** year 797.13 1 < 2.2e-16 *** bv:topo 210.38 3 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We might be interested in understanding if fitting a more complex model provides any advantage in terms of accuracy, compared with a model where no additional random effect is included. To do so we can compare this new model with mod6, which we created with the glsfunction and includes the same treatment structure. We can do that with the function anova, specifying both models:
> anova(lme1, mod6) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 15 27648.36 27740.46 -13809.18 mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275
As you can see there is a decrease in AIC for the model fitted with lme, and the difference is significant (p-value below 0.05). Therefore this new model where clustering is accounted for is better than the one without an additional random effect, even though only slightly. In this case we would need to decide if fitting a more complex model (which is probably more difficult to explain to readers) is the best way.
Random Intercept and Slope for repeated measures
If we collected data at several time steps we are looking at a repeated measures analysis. The code to create such a model is the following:
lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat) summary(lme2) Anova(lme2, type=c("III"))
The syntax is very similar to what we wrote before, except that now the random component includes both time and clusters. Again we can use summary and Anovato get more info about the model. We can also use again the function anova to compare this with the previous model:
> anova(lme1, lme2) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 15 27648.36 27740.46 -13809.18 lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001 Warning message: In anova.lme(lme1, lme2) : fitted objects with different fixed effects. REML comparisons are not meaningful.
From this output it is clear that the new model is better that the one before and their difference in highly significant.
We can extract only the effects for the random components using the function ranef:
> ranef(lme2) (Intercept) year R1 -0.3468601 -1.189799e-07 R2 -0.5681688 -1.973702e-07 R3 0.9150289 3.163501e-07
This tells us the changes in yield for each cluster and time step. We can also do the same for the fixed effects, and this will return the coefficients of the model:
> fixef(lme2) (Intercept) nfN1 nfN2 nfN3 nfN4 -1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00 nfN5 bv topoHT topoLO topoW 7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02 year bv:topoHT bv:topoLO bv:topoW 5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01
To have an idea of their confidence interval we can use the function intervals:
> intervals(lme2, which = "fixed") Approximate 95% confidence intervals Fixed effects: lower est. upper (Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04 nfN1 2.526139e+00 3.918006e+00 5.309873e+00 nfN2 3.736625e+00 5.132136e+00 6.527648e+00 nfN3 3.974809e+00 5.368513e+00 6.762216e+00 nfN4 6.070018e+00 7.464542e+00 8.859065e+00 nfN5 6.245584e+00 7.639337e+00 9.033089e+00 bv -1.469793e+00 -1.318391e+00 -1.166989e+00 topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02 topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02 topoW -1.436741e+02 -1.136168e+02 -8.355954e+01 year 5.414742e+00 5.818826e+00 6.222911e+00 bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00 bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00 bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01 attr(,"label")  "Fixed effects:"
Dealing with non-normal data – Generalized Linear Models
As you remember, when we first introduced the simple linear model we defined a set of assumptions that need to be met to apply this model. We already talked about methods to deal with deviations from the assumption of independence, equality of variances and balanced designs and the fact that, particularly if our dataset is large, we may reach robust results even if our data are not perfectly normal. However, there are datasets for which the target variable has a completely different distribution from the normal, in these cases we need to change our modelling method and employ generalized linear models. Common scenarios where this model should be considered are for example researchers where the variable of interest is binary, for example presence or absence of a species, or where we are interested in modelling counts, for example the number of insects present in a particular location. In these cases, where the target variable is not continuous but rather discrete or categorical, the assumption of normality is usually not met. In this section we will focus on the two scenarios mentioned above, but GLM can be used to deal with data distributed in many different ways, and we will introduce how to deal with more general cases.
Data of this type, i.e. counts or rates, are characterized by the fact that their lower bound is always zero. This does not fit well with a normal linear model, where the regression line may well estimate negative values. For this type of variable we can employ a Poisson Regression, which fits the following model:
As you can see the equation is very similar to the standard linear model, the difference is that to insure that all Y are positive (since we cannot have negative values for count data) we are estimating the log of Y.
In R fitting this model is very easy. For this example we are going to use another dataset available in the package agridat, named beall.webworms, which represents counts of webworms in a beet field, with insecticide treatments:
> dat = beall.webworms > str(dat) 'data.frame': 1300 obs. of 7 variables: $ row : int 1 2 3 4 5 6 7 8 9 10 ... $ col : int 1 1 1 1 1 1 1 1 1 1 ... $ y : int 1 0 1 3 6 0 2 2 1 3 ... $ block: Factor w/ 13 levels "B1","B10","B11",..: 1 1 1 1 1 6 6 6 6 6 ... $ trt : Factor w/ 4 levels "T1","T2","T3",..: 1 1 1 1 1 1 1 1 1 1 ... $ spray: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ... $ lead : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
We can check the distribution of our data with the function hist:
hist(dat$y, main="Histogram of Worm Count", xlab="Number of Worms")
We are going to fit a simple model first to see how to interpret its results, and then compare it with a more complex model:
pois.mod = glm(y ~ trt, data=dat, family=c("poisson"))
Once again the function summary will show some useful details about this model:
> summary(pois.mod) Call: glm(formula = y ~ trt, family = c("poisson"), data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.6733 -1.0046 -0.9081 0.6141 4.2771 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.33647 0.04688 7.177 7.12e-13 *** trtT2 -1.02043 0.09108 -11.204 < 2e-16 *** trtT3 -0.49628 0.07621 -6.512 7.41e-11 *** trtT4 -1.22246 0.09829 -12.438 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1955.9 on 1299 degrees of freedom Residual deviance: 1720.4 on 1296 degrees of freedom AIC: 3125.5 Number of Fisher Scoring iterations: 6
The first valuable information is related to the residuals of the model, which should be symmetrical as for any normal linear model. From this output we can see that minimum and maximum, as well as the first and third quartiles, are similar, so this assumption is confirmed. Then we can see that the variable trt (i.e. treatment factor) is highly significant for the model, with very low p-values. Its effect are all negative and referred to the first level T1, meaning for example that a change from T1 to T2 will decrease the count by 1.02. We can check this effect by estimating changes between T1 and T2 with the function predict, and the option newdata:
> predict(pois.mod, newdata=data.frame(trt=c("T1","T2"))) 1 2 0.3364722 -0.6839588
Another important piece of information are the Null and Residuals deviances, which allow us to compute the probability that this model is better than the Null hypothesis, which states that a constant (with no variables) model would be better. We can compute the p-value of the model with the following line:
> 1-pchisq(deviance(pois.mod), df.residual(pois.mod))  1.709743e-14
This p-value is very low, meaning that this model fits the data well. However, it may not be the best possible model, and we can use the AIC parameter to compare it to other models. For example, we could include more variables:
pois.mod2 = glm(y ~ block + spray*lead, data=dat, family=c("poisson"))
How does this new model compare with the previous?
> AIC(pois.mod, pois.mod2) df AIC pois.mod 4 3125.478 pois.mod2 16 3027.438
As you can see the second model has a lower AIC, meaning that fits the data better than the first.
One of the assumptions of the Poisson distribution is that its mean and variance have the same value. We can check by simply comparing mean and variance of our data:
> mean(dat$y)  0.7923077 > var(dat$y)  1.290164
In cases such as this when the variance is larger than the mean (in this case we talk about overdispersed count data) we should employ different methods, for example a quasipoisson distribution:
pois.mod3 = glm(y ~ trt, data=dat, family=c("quasipoisson"))
The summary function provides us with the dispersion parameter, which for a Poisson distribution should be 1:
> summary(pois.mod3) Call: glm(formula = y ~ trt, family = c("quasipoisson"), data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.6733 -1.0046 -0.9081 0.6141 4.2771 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.33647 0.05457 6.166 9.32e-10 *** trtT2 -1.02043 0.10601 -9.626 < 2e-16 *** trtT3 -0.49628 0.08870 -5.595 2.69e-08 *** trtT4 -1.22246 0.11440 -10.686 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 1.35472) Null deviance: 1955.9 on 1299 degrees of freedom Residual deviance: 1720.4 on 1296 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 6
Since the dispersion parameter is 1.35, we can conclude that our data are not terrible dispersed, so maybe a Poisson regression would still be appropriate for this dataset.
However, there are cases where the data are very overdispersed. In those cases, when we see that the distribution has lots of peaks we need to employ the negative binomial regression, with the function glm.nb available in the package MASS:
library(MASS) NB.mod1 = glm.nb(y ~ trt, data=dat)
Another popular for of regression that can be tackled with GLM is the logistic regression, where the variable of interest is binary (0 and 1, presence and absence or any other binary outcome). In this case the regression model takes the following equation:
Again, the equation is identical to the standard linear model, but what we are computing from this model is the log of the probability that one of the two outcomes will occur.
For this example we are going to use another dataset available in the package agridatcalled johnson.blight, where the binary variable of interest is the presence or absence of blight (either 0 or 1) in potatoes:
> dat = johnson.blight > str(dat) 'data.frame': 25 obs. of 6 variables: $ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ... $ area : int 0 0 0 0 50 810 120 40 0 0 ... $ blight : int 0 0 0 0 1 1 1 1 0 0 ... $ rain.am : int 8 9 9 6 16 10 12 10 11 8 ... $ rain.ja : int 1 4 6 1 6 7 12 4 10 9 ... $ precip.m: num 5.84 6.86 47.29 8.89 7.37 ...
In R fitting this model is very easy:
mod9 = glm(blight ~ rain.am, data=dat, family=binomial)
we are now using the binomial distribution for a logistic regression. To check the model we can rely again on summary:
> summary(mod9) Call: glm(formula = blight ~ rain.am, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.9395 -0.6605 -0.3517 1.0228 1.6048 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.9854 2.0720 -2.406 0.0161 * rain.am 0.4467 0.1860 2.402 0.0163 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 34.617 on 24 degrees of freedom Residual deviance: 24.782 on 23 degrees of freedom AIC: 28.782 Number of Fisher Scoring iterations: 5
This table is very similar to the one created for count data, so a lot of the discussion above can be used here. The main difference is in the way we can interpret the coefficients, because we need to remember that here we are calculating the logit function of the probability, so 0.4467 (coefficient for rain.am) is not the actual probability associated with an increase in rain. However, what we can say by just looking at the coefficients is that rain has a positive effect on blight, meaning that more rain increases the chances of finding blight in potatoes.
To estimate probabilities we need to use the function predict:
> predict(mod9, type="response") 1 2 3 4 5 6 7 0.19598032 0.27590141 0.27590141 0.09070472 0.89680283 0.37328295 0.59273722 8 9 10 11 12 13 14 0.37328295 0.48214935 0.19598032 0.69466455 0.19598032 0.84754431 0.27590141 15 16 17 18 19 20 21 0.93143346 0.05998586 0.19598032 0.05998586 0.84754431 0.59273722 0.59273722 22 23 24 25 0.48214935 0.59273722 0.98109229 0.89680283
This calculates the probability associated with the values of rain in the dataset. To know the probability associated with new values of rain we can again use predict with the option newdata:
>predict(mod9,newdata=data.frame(rain.am=c(15)),type="response") 1 0.8475443
This tells us that when rain is equal to 15 we have 84% chances of finding blight (i.e. chances of finding 1) in potatoes.
To assess the accuracy of the model we can use two approaches, the first is based on the deviances listed in the summary. The Residual deviance compares this model with the one that fits the data perfectly. So if we calculate the following p-value (using the deviance and df in the summary table for residuals):
> 1-pchisq(24.782, 23)  0.3616226
We see that because it is higher than 0.05 we cannot reject that this model fits the data as well as the perfect model, therefore our model seems to be good.
We can repeat the same procedure for the Null hypothesis, which again tests whether this model fits the data well:
> 1-pchisq(34.617, 24)  0.07428544
Since this is again not significant it suggests (contrary to what we obtained before) that this model is not very good.
An additional and probably easier to understand way to assess the accuracy of a logistic model is calculating the pseudo R2, which can be done by installing the package pscl:
Now we can run the following function:
> pR2(mod9) llh llhNull G2 McFadden r2ML r2CU -12.3910108 -17.3086742 9.8353268 0.2841155 0.3252500 0.4338984
From this we can see that our model explains around 30-40% of the variation in blight, which is not particularly good. We can use this index to compare models, as we did for AIC.
Dealing with other distributions and transformation
As mentioned, GLM can be used for fitting linear models not only in the two scenarios we described above, but in any occasion where data do not comply with the normality assumption. For example, we can look at another dataset available in agridat, where the variable of interest is slightly non-normal:
> dat = hughes.grapes > str(dat) 'data.frame': 270 obs. of 6 variables: $ block : Factor w/ 3 levels "B1","B2","B3": 1 1 1 1 1 1 1 1 1 1 ... $ trt : Factor w/ 6 levels "T1","T2","T3",..: 1 2 3 4 5 6 1 2 3 4 ... $ vine : Factor w/ 3 levels "V1","V2","V3": 1 1 1 1 1 1 1 1 1 1 ... $ shoot : Factor w/ 5 levels "S1","S2","S3",..: 1 1 1 1 1 1 2 2 2 2 ... $ diseased: int 1 2 0 0 3 0 7 0 1 0 ... $ total : int 14 12 12 13 8 9 8 10 14 10 ...
The variable total presents a skewness of 0.73, which means that probably with a transformation it should fit with a normal distribution. However, for the sake of the discussion we will assume it cannot be transformed. So now our problem is identify the best distribution for our data, to do so we can use the function descdist in the package fitdistrplus we already loaded:
descdist(dat$total, discrete = FALSE)
this returns the following plot:
Where we can see that our data (blue dot) are close to normal and maybe closer to a gamma distribution. So now we can further check this using another function from the same package:
which creates the following plot:
From this we can see that in fact our data seem to be close to a gamma distribution, so now we can proceed with modelling:
mod8 = glm(total ~ trt * vine, data=dat, family=Gamma(link=identity))
in the option family we included the name of the distribution, plus a link function that is used if we want to transform our data (in this case the function identity is for leaving data not transformed).
This is what we do to model other types of data that do not fit with a normal distribution. Other possible families supported by GLM are:
binomial, gaussian, Gamma, inverse.gaussian, poisson, quasi, quasibinomial, quasipoisson
Other possible link functions (which availability depends on the family) are:
logit, probit, cauchit, cloglog, identity, log, sqrt, 1/mu^2, inverse.
Generalized Linear Mixed Effects models
As linear model, linear mixed effects model need to comply with normality. If our data deviates too much we need to apply the generalized form, which is available in the package lme4:
For this example we will use again the dataset johnson.blight:
dat = johnson.blight
Now we can fit a GLME model with random effects for area, and compare it with a model with only fixed effects:
mod10 = glm(blight ~ precip.m, data=dat, family="binomial") mod11 = glmer(blight ~ precip.m + (1|area), data=dat, family="binomial") > AIC(mod10, mod11) df AIC mod10 2 37.698821 mod11 3 9.287692
As you can see this new model reduces the AIC substantially.
The same function can be used for Poisson regression, but it does not work for quasipoisson overdispersed data. However, within lme4 there is the function glmer.nb for negative binomial mixed effect. The syntax is the same as glmer, except that in glmer.nb we do not need to include family.
References and Further ReadingFinch, W.H., Bolin, J.E. and Kelley, K., 2014. Multilevel modeling using R. Crc Press.
Yan, X. and Su, X., 2009. Linear regression analysis: theory and computing. World Scientific.
James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to statistical learning (Vol. 6). New York: springer. http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Sage. pp104-106. [For pseudo R-Squared equations, page available on google books]
Webster, R. and Oliver, M.A., 2007. Geostatistics for environmental scientists. John Wiley & Sons.
West, B.T., Galecki, A.T. and Welch, K.B., 2014. Linear mixed models. CRC Press.
Gałecki, A. and Burzykowski, T., 2013. Linear mixed-effects models using R: A step-by-step approach. Springer Science & Business Media.
Williams, R., 2004. One-Way Analysis of Variance. URL: https://www3.nd.edu/~rwilliam/stats1/x52.pdf
Witte, R. and Witte, J. 2009. Statistics. 9th ed. Wiley.
Appendix 1: Assessing the accuracy of our model
There are several ways to check the accuracy of our models, some are printed directly in R within the summary output, others are just as easy to calculate with specific functions.
This is probably the most commonly used statistics and allows us to understand the percentage of variance in the target variable explained by the model. It can be computed as a ratio of the regression sum of squares and the total sum of squares. This is one of the standard measures of accuracy that R prints out through the function summary for linear models and ANOVAs.
This is a form of R-squared that is adjusted for the number of terms in the model. It can be computed as follows:
Where R2 is the R-squared of the model, n is the sample size and p is the number of terms (or predictors) in the model. This index is extremely useful to determine possible overfitting in the model. This happens particularly when the sample size is small, in such cases if we fill the model with predictors we may end up increasing the R-squared simply because the model starts adapting to the noise in the data and not properly describing the data.
Root Mean Squared Deviation or Root Mean Squared Error
The previous indexes measure the amount of variance in the target variable that can be explained by our model. This is a good indication but in some cases we are more interested in quantifying the error in the same measuring unit of the variable. In such cases we need to compute indexes that average the residuals of the model. The problem is the residuals are both positive and negative and their distribution should be fairly symmetrical. This means that their average will always be zero. So we need to find other indexes to quantify the average residuals, for example by averaging the squared residuals:
This is the square root of the mean the squared residuals, with
being the estimated value at point t, being the observed value in t and being the sample size. The RMSE has the same measuring unit of the variable y.
Mean Squared Deviation or Mean Squared Error
This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that since they square the residuals they tend to be more affected by large residuals. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE.
Mean Absolute Deviation or Mean Absolute Error
To solve the problem with large residuals we can use the mean absolute error, where we average the absolute value of the residuals:
This index is more robust against large residuals. Since RMSE is still widely used, even though its problems are well known, it is always better calculate and present both in a research paper.
Akaike Information Criterion
This index is another popular index we have used along the text to compare different models. It is very popular because it corrects the RMSE for the number of predictors in the model, thus allowing to account for overfitting. It can be simply computed as follows:
Where again p is the number of terms in the model.