Raccoon | Ch 2.3 – 1-way Anova

January 12, 2017
By

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

This article is part of Quantide’s web book “Raccoon – Statistical Models with R“. Raccoon is Quantide’s third web book after “Rabbit – Introduction to R” and “Ramarro – R for Developers“. See the full project here.

The second chapter of Racooon is focused on T-test and Anova. Through example it shows theory and R code of:

This post is the third section of the chapter, about 1-way Anova.

Throughout the web-book we will widely use the package qdata, containing about 80 datasets. You may find it here: https://github.com/quantide/qdata.

require(nortest)
require(car)
require(ggplot2)
require(dplyr)
require(gridExtra)
require(tidyr)
require(qdata)

Example: Tissues (1-way ANOVA)

Data description

Three quality inspectors of a plant, Henry, Anne, and Andrew (operators) measure the strenght of car seat tissues. The company managers want to test the reproducibility, between operators, of company’s measurement system.
The main goal of study is then to verify if operators’ measures are confrontable.
Each operator measured 25 pieces of car seat tissues.
Globally, 75 samples of tissue randomly chosen from the same production batch were measured.

Data loading

data(tissue)
head(tissue)
##   Operator Strength
## 1     Anne    11.30
## 2     Anne    10.62
## 3     Anne    10.36
## 4     Anne    10.23
## 5     Anne    10.42
## 6     Anne    12.64
str(tissue)
## 'data.frame':    75 obs. of  2 variables:
##  $ Operator: Factor w/ 3 levels "Andrew","Anne",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Strength: num  11.3 10.6 10.4 10.2 10.4 ...

Data is already ordered by operator (Anne, Henry, and Andrew)

Descriptives

Plot of Strengt with Operator factor variable in abscissa:

tissue_mean <- tissue %>% group_by(Operator) %>% summarise(Strength=mean(Strength))

ggp <- ggplot() +
  geom_point(data = tissue, mapping = aes(x=Operator, y=Strength), colour="darkblue") + 
  geom_line(data=tissue_mean, mapping = aes(x=Operator,y=Strength), colour="red", group=1)

print(ggp)

anova14

Stripchart of Strenght by Operator with connect line

A different way to show similar information is through a plot of univariate effects:

plot.design(Strength ~ Operator, data = tissue)

anova15Plot of univariate effects of Operator on Strenght

This plot shows the mean for each level of grouping factor compared to the grand mean (the center wider line).

Inference and models

Now we will try models with different contrast types.
Initially the model with default contrasts will be fitted to data, then the contr.sum and the contr.helmert contrast models will be shown.

Default contrasts:

options("contrasts")

## $contrasts
## [1] "contr.treatment" "contr.poly"

options(contrasts = c("contr.treatment", "contr.poly"))
fm_treatment <- aov(Strength~Operator, data = tissue)
summary(fm_treatment)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Operator     2   6.62    3.31   3.851 0.0258 *
## Residuals   72  61.90    0.86                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.lm(fm_treatment)

## 
## Call:
## aov(formula = Strength ~ Operator, data = tissue)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9064 -0.5482 -0.0388  0.4656  2.4100 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.4364     0.1854  56.280  < 2e-16 ***
## OperatorAnne   -0.2064     0.2622  -0.787  0.43384    
## OperatorHenry  -0.7076     0.2622  -2.698  0.00868 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9272 on 72 degrees of freedom
## Multiple R-squared:  0.09663,    Adjusted R-squared:  0.07154 
## F-statistic: 3.851 on 2 and 72 DF,  p-value: 0.02577

aov()’s predefined output gives a global p-value of 0.0258 on Operator effect.

Note that summary.lm() shows the aov resulting object fm_treatment as if it simply were an lm-type object.

From its output, Andrew’s mean appears significantly different from 0 (\(beta_0\) = 10.4364, p = 2e-16),

Henry seems significantly different from Andrew (0.7076 lesser, p-value=0.00868), whereas Anne’s seem not significantly different from Andrew (0.2064 lesser, p-value=0.43384).  

Instead of summary() and summary.lm(), summary.aov() and summary() may be used, obtaining the same outputs:

fm_treatment <- lm(Strength~Operator, data = tissue)
summary.aov(fm_treatment)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Operator     2   6.62    3.31   3.851 0.0258 *
## Residuals   72  61.90    0.86                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fm_treatment)

## 
## Call:
## lm(formula = Strength ~ Operator, data = tissue)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9064 -0.5482 -0.0388  0.4656  2.4100 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.4364     0.1854  56.280  < 2e-16 ***
## OperatorAnne   -0.2064     0.2622  -0.787  0.43384    
## OperatorHenry  -0.7076     0.2622  -2.698  0.00868 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9272 on 72 degrees of freedom
## Multiple R-squared:  0.09663,    Adjusted R-squared:  0.07154 
## F-statistic: 3.851 on 2 and 72 DF,  p-value: 0.02577

In this sense, aov(),lm() and their resulting objects are substantially interchangeable.

Now let’s see what changes if constrasts settings are modified. The first contrast setting is contr.sum:

options(contrasts = c("contr.sum", "contr.poly"))
fm_sum <- aov(Strength~Operator, data = tissue)
summary(fm_sum)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Operator     2   6.62    3.31   3.851 0.0258 *
## Residuals   72  61.90    0.86                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.lm(fm_sum)

## 
## Call:
## aov(formula = Strength ~ Operator, data = tissue)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9064 -0.5482 -0.0388  0.4656  2.4100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.13173    0.10706  94.635   <2e-16 ***
## Operator1    0.30467    0.15141   2.012   0.0479 *  
## Operator2    0.09827    0.15141   0.649   0.5184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9272 on 72 degrees of freedom
## Multiple R-squared:  0.09663,    Adjusted R-squared:  0.07154 
## F-statistic: 3.851 on 2 and 72 DF,  p-value: 0.02577

The interpretation of the contr.sum coefficients is not very simple: contrasts are the \(k-1\) differences (where \(k\) is the number of factor levels) between each level and a reference level; the resulting estimated coefficients contain the differences of each level mean (excluded the reference level) with respect to the grand mean; in fact, the model intercept is equal to grand mean.  

aov(), however, gives the same results obtained with default contrasts.

Now let’s change contrast settings by selecting Helmert contrasts

options(contrasts = c("contr.helmert", "contr.poly"))
fm_helm <- aov(Strength~Operator, data = tissue)
summary(fm_helm)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Operator     2   6.62    3.31   3.851 0.0258 *
## Residuals   72  61.90    0.86                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.lm(fm_helm)

## 
## Call:
## aov(formula = Strength ~ Operator, data = tissue)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9064 -0.5482 -0.0388  0.4656  2.4100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1317     0.1071  94.635  < 2e-16 ***
## Operator1    -0.1032     0.1311  -0.787  0.43384    
## Operator2    -0.2015     0.0757  -2.661  0.00959 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9272 on 72 degrees of freedom
## Multiple R-squared:  0.09663,    Adjusted R-squared:  0.07154 
## F-statistic: 3.851 on 2 and 72 DF,  p-value: 0.02577

Helmert’s contrasts are even more complex than the others; they analyze:  

  1. Initially, the difference between first level mean and the second level mean
  2. Then, the difference between the mean of first two (pooled) levels and the third level mean
  3. Then, the difference between the mean of first three (pooled) levels and the fourth level mean
  4. And so on for \(k-1\) contrasts

The main advantage of using Helmert contrasts is that they are orthogonal; the tests on linear model coefficients are then stocastically independent: the Type I and Type II error rate for the tests on Helmert coefficients are unrelated. The main disavantage, of course, is their complexity in coefficients explanations.

Now, let’s restore predefined contrasts:

options(contrasts = c("contr.treatment", "contr.poly"))

Summarizing the results just seen about contrasts:

  • contr.treatment makes easier to read results, but the tests on coefficients are NOT independent
  • contr.sum makes NO independent tests on coefficients, but compares pairs of means; also, the coefficients represent the distance of level’s mean with respect to grand mean, however they are not always easily interpretable
  • contr.helmert makes independent tests on coefficients, but the coefficients are more difficult to interpret
  • anyway, until now, the global ANOVA (aov())results are always the same: aov() analyzes effects globally, not single means

Residuals analysis

By plotting a model object (lm or aov) up to six residuals diagnostic plots may be shown.

By default, the plot() method applied on a model object shows four graphs:

  1. A plot of raw residuals against fitted values
  2. A Normal Q-Q plot on standardized residuals
  3. A scale-Location plot of \(sqrt{vert std. residuals vert}\) against fitted values. Here, \(std. residuals\) means raw residuals divided by \(hatsigma cdot sqrt{1-h_{ii}}\), where \(h_{ii}\) are the diagonal entries of the “hat” matrix (see Appendices chapter).
  4. A plot of standardized Pearson residuals against leverages. If the leverages are constant (as is the case in the balanced design of this example) the plot uses factor level combinations instead of the leverages for the x-axis.

The next lines of code split the graphical device in 4 areas (2 x 2) and plot diagnostic graphs in only one display

op <- par(mfrow = c(2, 2))
plot(fm_treatment)
par(op)

anova16

Compound diagnostic residual plot of Strenght Vs. Operator ANOVA model

The last line of above code restores the graphical device.

The residual plots confirm that normality and homoscedasticity assumptions are met, and no outliers appear. This confirms the model results.

 

The post Raccoon | Ch 2.3 – 1-way Anova appeared first on Quantide – R training & consulting.

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)