[This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of Quantide’s web book “Raccoon – Statistical Models with R“. Raccoon is Quantide’s third web book after “Rabbit – Introduction to R” and “Ramarro – R for Developers“. See the full project here.

The second chapter of Racooon is focused on T-test and Anova. Through example it shows theory and R code of:

This post is the second section of the chapter, about 2-sample t-test and paired t.

Throughout the web-book we will widely use the package qdata, containing about 80 datasets. You may find it here: https://github.com/quantide/qdata.

require(nortest)
require(car)
require(ggplot2)
require(dplyr)
require(gridExtra)
require(tidyr)
require(qdata)

Example: Chicken hormones (2 -sample $$t$$-test)

Data description

To compare two growth hormones, 18 chicken were tested: 10 being assigned to hormone A and 8 to hormone B. The gains in weights (grams) over the period of experiment were measured.

Let us load and have a look at the data

data(hormones)
head(hormones)

## # A tibble: 6 × 2
##   hormone  gain
##
## 1       A   615
## 2       A   645
## 3       A   840
## 4       A   645
## 5       A   365
## 6       A   450

str(hormones)

## Classes 'tbl_df', 'tbl' and 'data.frame':    18 obs. of  2 variables:
##  $hormone: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... ##$ gain   : int  615 645 840 645 365 450 530 756 785 560 ...

Descriptives

We begin by calculating some descriptive statistics:

hormones %>% group_by(hormone) %>%
summarise(min = min(gain),
first_qu = quantile(gain, 0.25),
median = median(gain),
mean = mean(gain),
third_qu = quantile(gain, 0.75),
max = max(gain),
sd = sd(gain))

## # A tibble: 2 × 8
##   hormone   min first_qu median    mean third_qu   max       sd
##
## 1       A   365    537.5  630.0 619.100   728.25   840 149.4482
## 2       B   560    742.0  827.5 822.625   916.25  1035 154.5574

And now we draw a boxplot and a stripchart (in this case in vertical) of the data:

ggp <- ggplot(data = hormones, mapping = aes(x=hormone, y=gain, fill=hormone)) +
geom_boxplot()

print(ggp)

Box-and-whiskers plot of gain by hormone

hormones_mean <- hormones %>% group_by(hormone) %>% summarise(gain=mean(gain))

ggp <- ggplot(data = hormones, mapping = aes(x=hormone, y=gain)) +
geom_jitter(position = "dodge", color="darkblue") +
geom_point(data=hormones_mean, mapping = aes(x=hormone,y=gain), colour="red", group=1) +
geom_line(data=hormones_mean, mapping = aes(x=hormone,y=gain), colour="red", group=1)

print(ggp)

Stripchart with connect line of gain by hormone

The geom_point() function and geom_line() functions add the red points of two group averages to second graph and then link them with a segment.
group_by() and summarise() functions applies separately the mean() function to values of gain corresponding to hormone variable levels. In this case, if the segment is near to parallel to the x-axis, then the two means are probably equal. Anyway, both graphs and statistics show that hormone B tends to generate heavier chickens.

Inference and models

Let us check the normality of gain in the two groups through q-q plots

hormones_a <- hormones %>% filter(hormone =="A")
hormones_b <- hormones %>% filter(hormone =="B")

ggp1 <- ggplot(data = hormones_a, mapping = aes(sample = gain)) +
stat_qq(color="#F8766D", size=2) +
geom_abline(intercept=mean(hormones_a$gain),slope=sd(hormones_a$gain), color="red", linetype=2) +
ylab("hormone A") + ggtitle("q-q plot of hormone A")

ggp2 <- ggplot(data = hormones_b, mapping = aes(sample = gain)) +
stat_qq(color="#00C094", size=2) +
geom_abline(mapping = aes(intercept=mean(hormones_b$gain),slope=sd(hormones_b$gain)), color="red", linetype=2) +
ylab("hormone B") + ggtitle("q-q plot of hormone B")

grid.arrange(ggp1, ggp2, nrow = 2)

Normal probability plot of gain by hormone

From graph, no evidence of non-normality is evident. Now let us perform Anderson-Darling’s test for gain in the two groups, exploiting the function tapply:

tapply(hormones$gain, hormones$hormone, ad.test)

## $A ## ## Anderson-Darling normality test ## ## data: X[[i]] ## A = 0.15886, p-value = 0.926 ## ## ##$B
##
##  Anderson-Darling normality test
##
## data:  X[[i]]
## A = 0.13781, p-value = 0.9557

The last line of above code, returns a 2-element list, with AD test for each sub-group. There is no evidence to reject the normality hypothesis within the two groups.
The next line of code tests the homoscedasticity assumption, that is if the variances within the two groups are equal:

var.test(gain ~ hormone, data = hormones)

##
##  F test to compare two variances
##
## data:  gain by hormone
## F = 0.93498, num df = 9, denom df = 7, p-value = 0.9031
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1938497 3.9241513
## sample estimates:
## ratio of variances
##          0.9349792

bartlett.test(gain ~ hormone,data=hormones)

##
##  Bartlett test of homogeneity of variances
##
## data:  gain by hormone
## Bartlett's K-squared = 0.0083868, df = 1, p-value = 0.927

leveneTest(gain ~ hormone,data=hormones)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.0321 0.8601
##       16

All three previous tests accept $$H_0$$, i.e., there is no evidence to say that the variability changes across levels.

Now the $$t$$-test compares the two group means.

The null hypothesis is $$H_0: beta_1 = beta_2$$, where $$beta_1$$ and $$beta_2$$ are the expectations of dependent variable for hormones A and B, respectively.

t.test(gain ~ hormone, data = hormones, var.equal = TRUE)

##
##  Two Sample t-test
##
## data:  gain by hormone
## t = -2.8283, df = 16, p-value = 0.01211
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -356.07303  -50.97697
## sample estimates:
## mean in group A mean in group B
##         619.100         822.625

The $$t$$-test rejects the null hypothesis.

Now the same test may be conducted using a model approach.
The following lines of code perform the model fitting and print the main results.

mod <- lm(gain ~ hormone, data = hormones)
mod

##

## Call:
## lm(formula = gain ~ hormone, data = hormones)
##
## Coefficients:
## (Intercept)     hormoneB
##       619.1        203.5

The parameters shown are the model intercept and the “slope” coefficient for hormone B. The intercept in this case is the estimate of mean growth for A hormone, whereas the slope in this case is the estimate of difference between the mean growth obtained with hormon B and the mean growth obtained with hormone A.
In other words, the mean growth estimate for hormoneA is 619.1, while the mean growth estimate for hormoneB is 619.1 + 203.5 = 822.6. We will return to parameterization issue later.

Using summary the significancy of parameters is also shown:

summary(mod)

##
## Call:
## lm(formula = gain ~ hormone, data = hormones)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -262.62  -83.48   10.90  120.77  220.90
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   619.10      47.97  12.905 7.12e-10 ***
## hormoneB      203.53      71.96   2.828   0.0121 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 151.7 on 16 degrees of freedom
## Multiple R-squared:  0.3333, Adjusted R-squared:  0.2916
## F-statistic: 7.999 on 1 and 16 DF,  p-value: 0.01211

The test results are equal to the ones of $$t$$-test, and the $$t$$-test value is equal to minus $$t$$-test value of hormoneB coefficient.

Why these results? “Behind the scene”, the model has been built using the following **model matrix** ($$underline{X}$$)

model.matrix(mod)

##    (Intercept) hormoneB
## 1            1        0
## 2            1        0
## 3            1        0
## 4            1        0
## 5            1        0
## 6            1        0
## 7            1        0
## 8            1        0
## 9            1        0
## 10           1        0
## 11           1        1
## 12           1        1
## 13           1        1
## 14           1        1
## 15           1        1
## 16           1        1
## 17           1        1
## 18           1        1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$hormone ## [1] "contr.treatment" What this matrix means? For modeling purposes, the independent variable hormone has been “split” in two dummy variables: hormoneA and hormoneB: • hormoneA is 1 when data contain measures relative to hormone A, and is 0 when data contain measures relative to hormone B • hormoneB is 1 when data contain measures relative to hormone B, and is 0 when data contain measures relative to hormone A Obviously, when hormoneA is 0, hormoneB is 1 and vice versa. Then, one dummy variable (in this case, hormoneA), can be removed from the model, because it does not add any useful information, and makes the model mathematically non-manageable. Consequently, the model matrix will contains only the “all 1s” intercept column (that refers to the case when hormone A is used) and the “displacement” hormoneBcolumn which represents the constant added to mean growth when using hormone B instead of hormone A. In other words, the model actually estimated is the following: $$growth_i = beta_0 + beta_1 cdot delta_B(hormone_i) + varepsilon_i$$ Where $$delta_B(hormone_i)$$ represents the function that returns 1 only when the hormone of $$i$$-th unit is B, otherwise, returns 0. $$beta_0$$ in this manner is the value of mean growth when using hormone A (i.e., when $$delta_B(hormone_i)==0$$), while $$beta_1$$ is the “jump” in mean of growth when using hormone B instead of hormone A. Another approach available to manage such type of test is through the analysis of variance model, i.e. by using the aov() function, that directly performs one analysis of variance: fm <- aov(mod) Printing a summary method for analysis of variance objects reflects the different approach to what is essentially the same information as returned by lm(). While a printed linear-model object shows individual coefficients, printed aov objects show terms, which may correspond to several coefficients. However, methods for linear models can always be applied explicitly on aov objects, such as those for the coefficients or residuals. summary(fm) ## Df Sum Sq Mean Sq F value Pr(>F) ## hormone 1 184100 184100 7.999 0.0121 * ## Residuals 16 368229 23014 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Above output shows that hormone, with all its levels, is globally significant (p=0.012) in explaining the dependent variable variability. In this case, since the independent variable (factor) contains two levels only, the resulting p-value for hormone obtained from aov is identical to the p-value of hormone B obtained using lm() (actually, in this example they are the same test). If the independent variable (factor) contains more than two levels, then aov() returns the global significancy (not the “by level significancy”) of independent variable effect on dependent variable. A few more concise code to obtain the same results as above is fm <- aov(gain ~ hormone, data = hormones) summary(fm) ## Df Sum Sq Mean Sq F value Pr(>F) ## hormone 1 184100 184100 7.999 0.0121 * ## Residuals 16 368229 23014 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Next lines of code show that actually aov objects are lm objects with also analysis of variance results added. class(mod) ## [1] "lm" class(fm) ## [1] "aov" "lm" The following commands are useful to read/define the type of contrasts (i.e., the type of coding of regressors matrix $$underline{X}$$) used when unordered or ordered factors are used. options("contrasts") ##$contrasts
##         unordered           ordered
## "contr.treatment"      "contr.poly"

options(contrasts = c("contr.treatment", "contr.poly"))

The first line of code simply shows the current predefined type of contrasts for unordered and ordered factors, while the second line of code sets the predefined contrasts.

To extract coefficients from an already calculated (lm or aov) model, one must use the coefficients method:

coefficients(fm)

## (Intercept)    hormoneB
##     619.100     203.525

Next line of code shows that the hormoneB coefficient is simply the difference between average growths for hormoneA and hormoneB chickens.

diff(tapply(hormones$gain, hormones$hormone, mean))

##       B
## 203.525

R can use several types of contrasts: contr.treatment sets to zero the coefficient of first factor level, that becomes the intercept, and consequently all the means of other levels are relative to the first level. Another option is contr.sum, that sets to zero the sum of coefficients, and then the reference value of coefficients becomes the gran mean, calculated on all factor levels. R uses contr.treatment as default.

Let’s try to change type of contrasts:

options(contrasts = c("contr.sum", "contr.poly"))

fm <- aov(gain~hormone, data = hormones)

The aov() results are the same:

summary(fm)

##             Df Sum Sq Mean Sq F value Pr(>F)
## hormone      1 184100  184100   7.999 0.0121 *
## Residuals   16 368229   23014
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

but, on the whole, the results are changed:

coefficients(fm)

## (Intercept)    hormone1
##    720.8625   -101.7625

The values of the coefficients are not the same because with contr.treatment the intercept represents the mean in the first group (hormone A) and the “slope” represents the difference between the mean in the second group (hormone B) and the mean in the first group; conversely, with contr.sum the intercept represents the grand mean, and the slope represents the difference between the mean in the first group and the grand mean.
The differences in coefficients are perhaps more perceptible looking at the model matrix:

model.matrix(fm)

##    (Intercept) hormone1
## 1            1        1
## 2            1        1
## 3            1        1
## 4            1        1
## 5            1        1
## 6            1        1
## 7            1        1
## 8            1        1
## 9            1        1
## 10           1        1
## 11           1       -1
## 12           1       -1
## 13           1       -1
## 14           1       -1
## 15           1       -1
## 16           1       -1
## 17           1       -1
## 18           1       -1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$hormone ## [1] "contr.sum" Now, firstly we will restore predefined contrasts options(contrasts = c("contr.treatment", "contr.poly")) and then we will fit a model without intercept to data. In general, removing the intercept is a discouraged practice, but sometimes no intercept models can help to enhance coefficients meaning: mod <- lm(gain ~ hormone-1, data = hormones) coefficients(mod) ## hormoneA hormoneB ## 619.100 822.625 The above coefficients represent, respectively, the mean in the group hormoneA and in the group hormoneB. The model matrix can be useful to understand the meaning of the coefficients: model.matrix(mod) ## hormoneA hormoneB ## 1 1 0 ## 2 1 0 ## 3 1 0 ## 4 1 0 ## 5 1 0 ## 6 1 0 ## 7 1 0 ## 8 1 0 ## 9 1 0 ## 10 1 0 ## 11 0 1 ## 12 0 1 ## 13 0 1 ## 14 0 1 ## 15 0 1 ## 16 0 1 ## 17 0 1 ## 18 0 1 ## attr(,"assign") ## [1] 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$hormone
## [1] "contr.treatment"

Next code shows what can happen when one uses a no-intercept model and does not address for that in analysis of variance:

fm <- aov(mod)
summary(fm)

##           Df  Sum Sq Mean Sq F value   Pr(>F)
## hormone    2 9246543 4623272   200.9 4.63e-12 ***
## Residuals 16  368229   23014
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coefficients(fm)

## hormoneA hormoneB
##  619.100  822.625

The significancy of factor is much more strong than in previous calculations, but this is wrong, because this result actually compares the model with 0 intercept with the model that uses one mean for hormoneAand one mean for hormoneB.

Using the following code gives results equivalent to t-test:

anova(mod,lm(gain ~ 1, data = hormones))

## Analysis of Variance Table
##
## Model 1: gain ~ hormone - 1
## Model 2: gain ~ 1
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)
## 1     16 368229
## 2     17 552328 -1   -184100 7.9994 0.01211 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova() performs an analysis of variance to compare two nested models (see the paragraph on Comparing two nested models in Some Theory on Linear Models chapter).
In this case we are comparing a model with the group variable hormone against the model with grand mean only.
The result is the same as seen individually in lm() and aov() outputs.

Model diagnostics

Since the assumpions on linear models, the residuals should arise from a gaussian distribution; then, a Normal Probability Plot of residuals will be plotted to check this assumption.

res <- cbind(hormones,res=fm$residuals) ad.test(res$res)

##
##  Anderson-Darling normality test
##
## data:  res$res ## A = 0.18963, p-value = 0.8859 ggp <- ggplot(data = res, mapping = aes(sample = res, color=hormone)) + stat_qq(size=2) + geom_abline(mapping = aes(intercept=mean(res), slope=sd(res)), colour="red", linetype=2) print(ggp) Overlaid Normal probability plots of residuals The two colors correspond to two groups. Residuals seem gaussian, irrespective of group. The Anderson-Darling (AD) test confirms that: ad.test(fm$residuals)

##
##  Anderson-Darling normality test
##
##  $Car_B: num 28.6 24.4 37.3 39.9 21.7 ... Descriptives A simple boxplot: carctl2 % gather(car, value, Car_A, Car_B) ggp <- ggplot(data = carctl2, mapping = aes(x=car, y=value, fill=car)) + geom_boxplot() print(ggp) Box-and-whiskers plot of parking times for the two cars A stripchart with mean points and mean line: carctl_mean % group_by(car) %>% summarise(value=mean(value)) ggp <- ggplot(data = carctl2, mapping = aes(x=car, y=value)) + geom_jitter(position = "dodge", color="darkblue") + geom_point(data=carctl_mean, mapping = aes(x=car,y=value), colour="red", group=1) + geom_line(data=carctl_mean, mapping = aes(x=car,y=value), colour="red", group=1) print(ggp) Stripchart with connect line of parking times for the two cars And now the $$t$$-test. Before testing the means, a check for normality is made by using qq-plot and Anderson-Darling test ggp <- ggplot(data = carctl2, mapping = aes(sample = value, color=car)) + stat_qq() + geom_abline(data= carctl2 %>% filter(car=="Car_A"), mapping=aes(intercept=mean(value), slope=sd(value), color=car), size=1) + geom_abline(data= carctl2 %>% filter(car=="Car_B"), mapping=aes(intercept=mean(value), slope=sd(value), color=car), size=1) print(ggp) Normal probability plot of parking times of two cars There is not evidence of non-normality within groups, and the AD tests confirm above results: lapply(X=carctl, FUN=ad.test) ##$Car_A
##
##  Anderson-Darling normality test
##
## data:  X[[i]]
## A = 0.19988, p-value = 0.8647
##
##
## $Car_B ## ## Anderson-Darling normality test ## ## data: X[[i]] ## A = 0.36913, p-value = 0.3928 and finally the two independent samples $$t$$-test: t.test(x=carctl$Car_A, y=carctl$Car_B) ## ## Welch Two Sample t-test ## ## data: carctl$Car_A and carctl$Car_B ## t = 1.2205, df = 36.778, p-value = 0.2301 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.056978 8.285186 ## sample estimates: ## mean of x mean of y ## 34.56418 31.45008 Following this result, the null hypothesis is not rejected (at a 5% level). But if we produce a scatterplot of Car_A Vs Car_B, ggp <- ggplot(data = carctl, mapping = aes(x=Car_A, y=Car_B)) + geom_point(col="darkblue") print(ggp) Scattrplot of parking times of Car_A Vs. Car_B we can see that data of two cars are clearly NOT independent!! Measures on Car_A are strongly related to measures on Car_B, because each driver drove either cars. Then the correct test for such as data is: carctl % mutate(car_diff = Car_A - Car_B) t.test(x=carctl$car_diff, mu=0)

##
##  One Sample t-test
##
## data:  carctl$car_diff ## t = 3.8735, df = 19, p-value = 0.001023 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 1.431407 4.796801 ## sample estimates: ## mean of x ## 3.114104 or, equivalently, a paired $$t$$-test(x=carctl$$Car_A, y=carctl$$Car_B, paired=TRUE) t.test(x=carctl$Car_A, y=carctl$Car_B, paired=TRUE) ## ## Paired t-test ## ## data: carctl$Car_A and carctl$Car_B ## t = 3.8735, df = 19, p-value = 0.001023 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.431407 4.796801 ## sample estimates: ## mean of the differences ## 3.114104 In either cases the results are the same: a difference statistically significant exists between mean parking times of two cars. In this case the Normality test must be performed on differences between measures within same driver. Check of normality by Normal probability plot: ggp <- ggplot(data = carctl, mapping = aes(sample = car_diff)) + stat_qq(color="darkblue", size=2) + geom_abline(mapping = aes(intercept=mean(car_diff), slope=sd(car_diff)), colour="red", linetype=2) print(ggp) Normal probability plot of difference of parking times of two cars And the Anderson-Darling test: ad.test(x=carctl$car_diff)

##
##  Anderson-Darling normality test
##
## data:  carctl$car_diff ## A = 0.29177, p-value = 0.5704 For either tests, the Normality assumption is not rejected. Now we try the model approach to the same test. The model approach in this case follows the same procedure seen for the one-sample t-test, but using the differences between times as dependent variable. Fitting of model: mod <- lm(car_diff ~ 1, data = carctl) mod ## ## Call: ## lm(formula = car_diff ~ 1, data = carctl) ## ## Coefficients: ## (Intercept) ## 3.114 summary(mod) ## ## Call: ## lm(formula = car_diff ~ 1, data = carctl) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.2986 -2.1332 0.4141 2.3877 5.6064 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.114 0.804 3.873 0.00102 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.595 on 19 degrees of freedom or, equivalently one may use also: mod <- lm(I(Car_A-Car_B)~1, data=carctl) summary(mod) ## ## Call: ## lm(formula = I(Car_A - Car_B) ~ 1, data = carctl) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.2986 -2.1332 0.4141 2.3877 5.6064 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.114 0.804 3.873 0.00102 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.595 on 19 degrees of freedom q-q plot (or Normal probability plot) of residuals shall be identical to previous one. The Anderson-Darling test too is identical to previuosly made test: ad.test(mod$residuals)

##
##  Anderson-Darling normality test
##
## data:  mod\$residuals
## A = 0.29177, p-value = 0.5704

equivalently:

ad.test(residuals(mod))

##
##  Anderson-Darling normality test
##
## data:  residuals(mod)
## A = 0.29177, p-value = 0.5704

Note that for repeated measures experiments, several analytical options are also available: for example through the Error() function in formula specification, or using mixed-model methods (lme4 or nlmepackages are the most used and known).
In this manual we won’t deep these arguments, because they would require a distinct section for each of them.

The post Raccoon | Ch 2.2 – 2 sample t-test and paired t appeared first on Quantide - R training & consulting.