**Pachá (Batteries Included)**, and kindly contributed to R-bloggers)

In the first part of this entry I did show some `mtcars`

examples, where `am`

can be useful to explain ANOVA as its observations are defined as:

am_i = \begin{cases}1 &\text{ if car } i \text{ is manual} \cr 0 &\text{ if car } i \text{ is automatic}\end{cases}

$$

Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involved more variables.

# ANOVA with one dummy variable

Consider a model where the outcome is `mpg`

and the design matrix is \(X = (\vec{1} \: \vec{x}_2)\).

From the last entry let

x_2 = \begin{cases}1 &\text{ if car } i \text{ is automatic} \cr 0 &\text{ if car } i \text{ is manual}\end{cases}

$$

This will lead to this estimate:

\hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 – \bar{y}_1\end{bmatrix}

$$

Fitting the model gives:

```
y = mtcars$mpg
x1 = mtcars$am
x2 = ifelse(x1 == 1, 0, 1)
fit = lm(y ~ x2)
summary(fit)
```

```
Call:
lm(formula = y ~ x2)
Residuals:
Min 1Q Median 3Q Max
-9.3923 -3.0923 -0.2974 3.2439 9.5077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.392 1.360 17.941 < 2e-16 ***
x2 -7.245 1.764 -4.106 0.000285 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
```

So to see the relationship between the estimates and the group means I need additional steps:

```
x0 = rep(1,length(y))
X = cbind(x0,x2)
beta = solve(t(X)%*%X) %*% (t(X)%*%y)
beta
```

```
[,1]
x0 24.392308
x2 -7.244939
```

I did obtain the same estimates with `lm`

command so now I calculate the group means:

```
x1 = ifelse(x1 == 0, NA, x1)
x2 = ifelse(x2 == 0, NA, x2)
m1 = mean(y*x1, na.rm = TRUE)
m2 = mean(y*x2, na.rm = TRUE)
beta0 = m1
beta2 = m2-m1
beta0;beta2
```

```
[1] 24.39231
```

```
[1] -7.244939
```

In this case this means that the slope for the two groups is the same but the intercept is different, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.

Again I’ll verify the equivalency between `lm`

and `aov`

in this particular case:

```
y = mtcars$mpg
x1 = mtcars$am
x2 = ifelse(x1 == 1, 0, 1)
fit2 = aov(y ~ x2)
fit2$coefficients
```

```
(Intercept) x2
24.392308 -7.244939
```

I can calculate the residuals by hand:

```
mean_mpg = mean(mtcars$mpg)
fitted_mpg = fit3$coefficients[1] + fit3$coefficients[2]*mtcars$am
observed_mpg = mtcars$mpg
TSS = sum((observed_mpg - mean_mpg)^2)
ESS = sum((fitted_mpg - mean_mpg)^2)
RSS = sum((observed_mpg - fitted_mpg)^2)
TSS;ESS;RSS
```

```
[1] 1126.047
```

```
[1] 405.1506
```

```
[1] 720.8966
```

Here its verified that \(TSS = ESS + RSS\) but aside from that I can extract information from `aov`

:

```
summary(fit2)
```

```
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 405.2 405.2 16.86 0.000285 ***
Residuals 30 720.9 24.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

And check that, as expected, \(ESS\) is the variance explained by `x2`

.

I also can run ANOVA over `lm`

with:

```
anova(fit)
```

```
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 405.15 405.15 16.86 0.000285 ***
Residuals 30 720.90 24.03
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The table provides information on the effect of `x2`

over `y`

. In this case the null hypothesis is rejected because of the large F-value and the associated p-values.

Considering a 0.05 significance threshold I can say, with 95% of confidence, that the regression slope is statistically different from zero or that there is a difference in group means between automatic and manual transmission.

# ANOVA with three dummy variables

Now let’s explore something more complex than `am`

. Reading the documentation I wonder if `cyl`

has an impact on `mpg`

so I explore that variable:

```
str(mtcars)
```

```
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```

```
unique(mtcars$cyl)
```

```
[1] 6 4 8
```

One (wrong) possibility is to write:

```
y = mtcars$mpg
x1 = mtcars$cyl; x1 = ifelse(x1 == 4, 1, 0)
x2 = mtcars$cyl; x2 = ifelse(x1 == 6, 1, 0)
x3 = mtcars$cyl; x3 = ifelse(x1 == 8, 1, 0)
fit = lm(y ~ x1 + x2 + x3)
summary(fit)
```

```
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-6.2476 -2.2846 -0.4556 2.6774 7.2364
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.6476 0.7987 20.844 < 2e-16 ***
x1 10.0160 1.3622 7.353 3.44e-08 ***
x2 NA NA NA NA
x3 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.66 on 30 degrees of freedom
Multiple R-squared: 0.6431, Adjusted R-squared: 0.6312
F-statistic: 54.06 on 1 and 30 DF, p-value: 3.436e-08
```

Here the `NAs`

mean there are variables that are linearly related to the other variables (e.g. the variable pointed with `NA`

is an average of one or more of the rest of the variables, like \(x_2 = 2x_1 + x_3\) or another linear combination), then there’s no unique solution to the regression without dropping variables.

But R has a command called `as.factor()`

that is useful in these cases and also can save you some lines of code in other cases:

```
fit2 = lm(mpg ~ as.factor(cyl), data = mtcars)
summary(fit2)
```

```
Call:
lm(formula = mpg ~ as.factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
as.factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
as.factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
```

The `aov`

version of this is:

```
fit3 = aov(mpg ~ as.factor(cyl), data = mtcars)
TukeyHSD(fit3)
```

```
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mpg ~ as.factor(cyl), data = mtcars)
$`as.factor(cyl)`
diff lwr upr p adj
6-4 -6.920779 -10.769350 -3.0722086 0.0003424
8-4 -11.563636 -14.770779 -8.3564942 0.0000000
8-6 -4.642857 -8.327583 -0.9581313 0.0112287
```

As I said many times in this entry, ANOVA is linear regression. Interpreting the coefficients is up to you.

**leave a comment**for the author, please follow the link and comment on their blog:

**Pachá (Batteries Included)**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...