Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post will describe linear regression as from the book Veterinary Epidemiologic Research, describing the examples provided with R.

Regression analysis is used for modeling the relationship between a single variable Y (the outcome, or dependent variable) measured on a continuous or near-continuous scale and one or more predictor (independent or explanatory variable), X. If only one predictor is used, we have a simple regression model, and if we have more than one predictor, we have a multiple regression model. Let’s download the data coming together with the book. There’s a choice of various format, including R:

temp <- tempfile()
load(unz(tmp, "ver2_data_R/daisy2.rdata")) # extract the target file from the temporary file
unlink(tmp) # remove the temporary file

### some data management
daisy2 <- daisy2[daisy2$h7 == 1, ] # we only use a subset of the data daisy2[, c(4, 7, 11, 15:18)] <- lapply(daisy2[, c(4, 7, 11, 15:18)], factor) library(Hmisc) daisy2 <- upData(daisy2, labels = c(region = 'Region', herd = 'Herd number', cow = 'Cow number', study_lact = 'Study lactation number', herd_size = 'Herd size', mwp = "Minimum wait period for herd", parity = 'Lactation number', milk120 = 'Milk volume in first 120 days of lactation', calv_dt = 'Calving date', cf = 'Calving to first service interval', fs = 'Conception at first service', cc = 'Calving to conception interval', wpc = 'Interval from wait period to conception', spc = 'Services to conception', twin = 'Twins born', dyst = 'Dystocia at calving', rp = 'Retained placenta at calving', vag_disch = 'Vaginal discharge observed', h7 = 'Indicator for 7 herd subset'), levels = list(fs = list('No' = 0, 'Yes' = 1), twin = list('No' = 0, 'Yes' = 1), dyst = list('No' = 0, 'Yes' = 1), rp = list('No' = 0, 'Yes' = 1), vag_disch = list('No' = 0, 'Yes' = 1)), units = c(milk120 = "litres")) summary(daisy2$milk120, na.rm = TRUE)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1110    2742    3226    3225    3693    5630
sd(daisy2$milk120, na.rm = TRUE)  703.364 daisy2 <- daisy2[!is.na(daisy2$milk120), ] # get rid of missing observations for milk production


The regression model could be written: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$. A first simple model, with a categorical independent variable would be in R:

lm.milk <- lm(milk120 ~ parity, data = daisy2)
(lm.milk.sum <- summary(lm.milk))

Call:
lm(formula = milk120 ~ parity, data = daisy2)

Residuals:
Milk volume in first 120 days of lactation [litres]
Min       1Q   Median       3Q      Max
-2234.73  -384.50    -0.63   376.52  2188.48

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  2629.63      29.31  89.712  < 2e-16 ***
parity2       715.30      42.45  16.849  < 2e-16 ***
parity3       794.67      44.05  18.039  < 2e-16 ***
parity4       834.51      48.50  17.205  < 2e-16 ***
parity5       812.19      50.77  15.998  < 2e-16 ***
parity6       929.63      66.92  13.893  < 2e-16 ***
parity7       795.26     218.86   3.634 0.000287 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 613.5 on 1761 degrees of freedom
Multiple R-squared: 0.2419,	Adjusted R-squared: 0.2393
F-statistic: 93.65 on 6 and 1761 DF,  p-value: < 2.2e-16


The output shows in order: the model specified, the distribution of residuals (which should be normal with mean = 0 and so median close to 0), the $\beta$ coefficients and their standard errors, the standard error of the residuals, the $R^2$ and adjusted $R^2$ and then a F test for the null hypothesis $H_0: \beta_1 = \beta_2 = ...\beta_k = 0$ (all the coefficients equal zero except the intercept, i.e. is our model better to explain variance than a model predicting the dependent variable mean for all data points). But where’s the ANOVA table?

anova(lm.milk)
Analysis of Variance Table

Response: milk120
Df    Sum Sq  Mean Sq F value    Pr(>F)
parity       6 211463774 35243962  93.653 < 2.2e-16 ***
Residuals 1761 662708166   376325
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


How would you calculate this F-test yourself? The F value is given by $F = \frac{(SST - SSE) / (p - 1)}{SSE / (n - p)}$, with SST the total sum of square, SSE the sum of square for the residuals, n and p the degrees of freedom for total and residuals. We thus have $F_{(p - 1, n - p)}$ to get a p-value.

(src.var.total <- sum((daisy2$milk120 - mean(daisy2$milk120))^2)) #source of variation: total
 874171940
(src.var.res <- sum(lm.milk$res^2)) #source of variation: error (or residual)  662708166 (f.stat <- ((src.var.total - src.var.res) / (lm.milk.sum$fstatistic)) /
+   (src.var.res / lm.milk.sum$fstatistic)) #F-statistic 93.65301 1 - pf(f.stat, lm.milk.sum$fstatistic, lm.milk.sum$fstatistic) # p-value 0  We could also compute ourself $R^2$: 1 - sum(lm.milk$res^2) / sum((daisy2$milk120 - mean(daisy2$milk120))^2)
 0.2419018


We can also show some actual values, predicted ones and residuals:

head(data.frame(daisy2[, c(7:8)], fitted.value = fitted(lm.milk),
+      residual = resid(lm.milk)), n = 10)
parity milk120 fitted.value   residual
1       5  3505.8     3441.824   63.97631
2       5  3691.3     3441.824  249.47631
3       5  4173.0     3441.824  731.17626
4       5  3727.3     3441.824  285.47631
5       5  3090.8     3441.824 -351.02369
6       4  5041.2     3464.141 1577.05893
7       5  3861.2     3441.824  419.37621
8       5  4228.4     3441.824  786.57616
9       6  3431.1     3559.258 -128.15761
10      5  4445.5     3441.824 1003.67626


Next we could plot actual values and intervals for prediction both for the mean and new observations. We first create a new data frame to hold the values of parity for which we want the predictions, then compute the prediction and confidence bands and plot them with ggplo2.

confidence <- data.frame(parity = as.factor(1:7))
# Confidence bands
confidence.band <- predict(lm.milk, int = "c" , newdata = confidence)
confidence.band <- as.data.frame(cbind(confidence.band, parity = c(1:7)))
# Prediction bands
prediction.band <- predict(lm.milk, int = "p" , newdata = confidence)
prediction.band <- as.data.frame(cbind(prediction.band, parity = c(1:7)))

library(ggplot2)
ggplot(data = daisy2, aes(x = as.numeric(parity), y = milk120)) +
geom_point() +
geom_smooth(data = confidence.band, aes(x = parity, y = lwr), method = lm,
se = FALSE, colour = "blue") +
geom_smooth(data = confidence.band, aes(x = parity, y = upr), method = lm,
se = FALSE, colour = "blue") +
geom_smooth(data = prediction.band, aes(x = parity, y = lwr), method = lm,
se = FALSE, colour = "green") +
geom_smooth(data = prediction.band, aes(x = parity, y = upr), method = lm,
se = FALSE, colour = "green") +
xlab("Parity") + ylab("Milk volume in first 120 days of lactation") Finally, let’s have a look at a multiple regression and see how we can assess the significance of predictor variables. We will look at the effect of milk production, parity, retained placenta and vaginal discharge on the waiting period of the cows.

daisy2$milk120.sq <- daisy2$milk120^2
daisy2$parity.cat[as.numeric(daisy2$parity) >= 3] <- "3+"
daisy2$parity.cat[daisy2$parity == 2] <- "2"
daisy2$parity.cat[as.numeric(daisy2$parity) < 2] <- "1"

lm.wpc <- lm(wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch, data = daisy2)
summary(lm.wpc)

Call:
lm(formula = wpc ~ milk120 + milk120.sq + as.factor(parity.cat) +
rp + vag_disch, data = daisy2)

Residuals:
Interval from wait period to conception
Min     1Q Median     3Q    Max
-69.21 -37.98 -14.66  24.46 219.46

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)              1.232e+02  2.134e+01   5.775 9.29e-09 ***
milk120                 -3.426e-02  1.339e-02  -2.558   0.0106 *
milk120.sq               4.507e-06  2.011e-06   2.241   0.0251 *
as.factor(parity.cat)2   5.083e+00  4.056e+00   1.253   0.2103
as.factor(parity.cat)3+  8.887e+00  3.665e+00   2.425   0.0154 *
rpYes                    9.771e+00  4.572e+00   2.137   0.0327 *
vag_dischYes             9.838e+00  6.086e+00   1.616   0.1062
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51.47 on 1529 degrees of freedom
(232 observations deleted due to missingness)
Multiple R-squared: 0.01306,	Adjusted R-squared: 0.009191
F-statistic: 3.373 on 6 and 1529 DF,  p-value: 0.002622


The F-test tells us if any of the predictors is useful in predicting the length of the waiting period. The F-statistic is significant and we can reject the null hypothesis that no predictors are useful. If we had failed to reject the null hypothesis, we would still be interested in the possibility of non-linear transformations of variables (already done here) or the presence of outliers, masking the true effect. Also we could have not enough data to demonstrate a real effect. Now that we rejected this null hypothesis, maybe not all predictors are necessary to predict the response.
We can see from the summary table which predictors are significant. However we can test a single predictor, with an F-test:

lm.wpc2 <- lm(wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp, data = daisy2)
anova(lm.wpc, lm.wpc2)
Analysis of Variance Table

Model 1: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch
Model 2: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp
Res.Df     RSS Df Sum of Sq      F Pr(>F)
1   1529 4050745
2   1530 4057666 -1   -6921.1 2.6125 0.1062


or with a t-test, as $t_i = \hat\beta_i/SE(\hat\beta_i)$ with $n - p$ degrees of freedom. Note that $t^2$ equals the F-statistic.
The same approach can be used to test a group of predictors. For example, if we want to see if we have to keep both milk production and its squared value in the model:

lm.wpc3 <- lm(wpc ~ as.factor(parity.cat) + rp + vag_disch, data = daisy2)
anova(lm.wpc, lm.wpc3)
Analysis of Variance Table

Model 1: wpc ~ milk120 + milk120.sq + as.factor(parity.cat) + rp + vag_disch
Model 2: wpc ~ as.factor(parity.cat) + rp + vag_disch
Res.Df     RSS Df Sum of Sq      F   Pr(>F)
1   1529 4050745
2   1531 4076138 -2    -25394 4.7925 0.008416 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 