# Linear Regression and ANOVA shaken and stirred (Part 2)

**Pachá (Batteries Included)**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the first part of this entry I did show some `mtcars`

examples, where `am`

can be useful to explain ANOVA as its observations are defined as:

am_i = \begin{cases}1 &\text{ if car } i \text{ is manual} \cr 0 &\text{ if car } i \text{ is automatic}\end{cases}

$$

Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involved more variables.

# ANOVA with one dummy variable

Consider a model where the outcome is `mpg`

and the design matrix is \(\renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\R}{\mathbb{R}} X = (\vec{1} \: \vec{x}_2)\).

From the last entry let

x_2 = \begin{cases}1 &\text{ if car } i \text{ is automatic} \cr 0 &\text{ if car } i \text{ is manual}\end{cases}

$$

This will lead to this estimate:

\hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 – \bar{y}_1\end{bmatrix}

$$

Fitting the model gives:

y = mtcars$mpg x1 = mtcars$am x2 = ifelse(x1 == 1, 0, 1) fit = lm(y ~ x2) summary(fit) Call: lm(formula = y ~ x2) Residuals: Min 1Q Median 3Q Max -9.3923 -3.0923 -0.2974 3.2439 9.5077 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.392 1.360 17.941 < 2e-16 *** x2 -7.245 1.764 -4.106 0.000285 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.902 on 30 degrees of freedom Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385 F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285

So to see the relationship between the estimates and the group means I need additional steps:

x0 = rep(1,length(y)) X = cbind(x0,x2) beta = solve(t(X)%*%X) %*% (t(X)%*%y) beta [,1] x0 24.392308 x2 -7.244939

I did obtain the same estimates with `lm`

command so now I calculate the group means:

x1 = ifelse(x1 == 0, NA, x1) x2 = ifelse(x2 == 0, NA, x2) m1 = mean(y*x1, na.rm = TRUE) m2 = mean(y*x2, na.rm = TRUE) beta0 = m1 beta2 = m2-m1 beta0;beta2 [1] 24.39231 [1] -7.244939

In this case this means that the slope for the two groups is the same but the intercept is different, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.

Again I'll verify the equivalency between `lm`

and `aov`

in this particular case:

y = mtcars$mpg x1 = mtcars$am x2 = ifelse(x1 == 1, 0, 1) fit2 = aov(y ~ x2) fit2$coefficients (Intercept) x2 24.392308 -7.244939

I can calculate the residuals by hand:

mean_mpg = mean(mtcars$mpg) fitted_mpg = fit3$coefficients[1] + fit3$coefficients[2]*mtcars$am observed_mpg = mtcars$mpg TSS = sum((observed_mpg - mean_mpg)^2) ESS = sum((fitted_mpg - mean_mpg)^2) RSS = sum((observed_mpg - fitted_mpg)^2) TSS;ESS;RSS [1] 1126.047 [1] 405.1506 [1] 720.8966

Here its verified that \(TSS = ESS + RSS\) but aside from that I can extract information from `aov`

:

summary(fit2) Df Sum Sq Mean Sq F value Pr(>F) x2 1 405.2 405.2 16.86 0.000285 *** Residuals 30 720.9 24.0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And check that, as expected, \(ESS\) is the variance explained by `x2`

.

I also can run ANOVA over `lm`

with:

anova(fit) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x2 1 405.15 405.15 16.86 0.000285 *** Residuals 30 720.90 24.03 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table provides information on the effect of `x2`

over `y`

. In this case the null hypothesis is rejected because of the large F-value and the associated p-values.

Considering a 0.05 significance threshold I can say, with 95% of confidence, that the regression slope is statistically different from zero or that there is a difference in group means between automatic and manual transmission.

# ANOVA with three dummy variables

Now let's explore something more complex than `am`

. Reading the documentation I wonder if `cyl`

has an impact on `mpg`

so I explore that variable:

str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... unique(mtcars$cyl) [1] 6 4 8

One (wrong) possibility is to write:

y = mtcars$mpg x1 = mtcars$cyl; x1 = ifelse(x1 == 4, 1, 0) x2 = mtcars$cyl; x2 = ifelse(x1 == 6, 1, 0) x3 = mtcars$cyl; x3 = ifelse(x1 == 8, 1, 0) fit = lm(y ~ x1 + x2 + x3) summary(fit) Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -6.2476 -2.2846 -0.4556 2.6774 7.2364 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.6476 0.7987 20.844 < 2e-16 *** x1 10.0160 1.3622 7.353 3.44e-08 *** x2 NA NA NA NA x3 NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.66 on 30 degrees of freedom Multiple R-squared: 0.6431, Adjusted R-squared: 0.6312 F-statistic: 54.06 on 1 and 30 DF, p-value: 3.436e-08

Here the `NAs`

mean there are variables that are linearly related to the other variables (e.g. the variable pointed with `NA`

is an average of one or more of the rest of the variables, like \(x_2 = 2x_1 + x_3\) or another linear combination), then there's no unique solution to the regression without dropping variables.

My model will include these variables:

x_2 = \begin{cases}1 &\text{ if car } i \text{ has 6 cylinders} \cr 0 &\text{ otherwise }\end{cases}

\quad \quad

x_3 = \begin{cases}1 &\text{ if car } i \text{ has 8 cylinders} \cr 0 &\text{ otherwise }\end{cases}

$$

In this particular case regression coefficients are given by this estimate:

\hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 \end{bmatrix}

$$

But R has a command called `as.factor()`

that is useful in these cases and also can save you some lines of code in other cases:

fit2 = lm(mpg ~ as.factor(cyl), data = mtcars) summary(fit2) Call: lm(formula = mpg ~ as.factor(cyl), data = mtcars) Residuals: Min 1Q Median 3Q Max -5.2636 -1.8357 0.0286 1.3893 7.2364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6636 0.9718 27.437 < 2e-16 *** as.factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 *** as.factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.223 on 29 degrees of freedom Multiple R-squared: 0.7325, Adjusted R-squared: 0.714 F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09

The `aov`

version of this is:

fit3 = aov(mpg ~ as.factor(cyl), data = mtcars) TukeyHSD(fit3) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = mpg ~ as.factor(cyl), data = mtcars) $`as.factor(cyl)` diff lwr upr p adj 6-4 -6.920779 -10.769350 -3.0722086 0.0003424 8-4 -11.563636 -14.770779 -8.3564942 0.0000000 8-6 -4.642857 -8.327583 -0.9581313 0.0112287

As I said many times in this entry, ANOVA is linear regression. Interpreting the coefficients is up to you.

**leave a comment**for the author, please follow the link and comment on their blog:

**Pachá (Batteries Included)**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.