**Software for Exploratory Data Analysis and Statistical Modelling**, and kindly contributed to R-bloggers)

Analysis of Variance (**ANOVA**) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way **ANOVA** which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared.

In one-way **ANOVA** the data is sub-divided into groups based on a single classification factor and the standard terminology used to describe the set of factor levels is **treatment** even though this might not always have meaning for the particular application. There is variation in the measurements taken on the individual components of the data set and ANOVA investigates whether this variation can be explained by the grouping introduced by the classification factor.

As an example we consider one of the data sets available with R relating to an experiment into plant growth. The purpose of the experiment was to compare the yields on the plants for a control group and two treatments of interest. The response variable was a measurement taken on the dried weight of the plants.

The first step in the investigation is to take a copy of the data frame so that we can make some adjustments as necessary while leaving the original data alone. We use the **factor** function to re-define the labels of the **group** variables that will appear in the output and graphs:

plant.df = PlantGrowth plant.df$group = factor(plant.df$group, labels = c("Control", "Treatment 1", "Treatment 2"))

The **labels** argument is a list of names corresponding to the levels of the **group** factor variable.

A boxplot of the distributions of the dried weights for the three competing groups is created using the **ggplot** package:

require(ggplot2) ggplot(plant.df, aes(x = group, y = weight)) + geom_boxplot(fill = "grey80", colour = "blue") + scale_x_discrete() + xlab("Treatment Group") + ylab("Dried weight of plants")

The **geom_boxplot()** option is used to specify background and outline colours for the boxes. The axis labels are created with the **xlab()** and **ylab()** options. The plot that is produce looks like this:

Initial inspection of the data suggests that there are differences in the dried weight for the two treatments but it is not so clear cut to determine whether the treatments are different to the control group. To investigate these differences we fit the one-way ANOVA model using the **lm** function and look at the parameter estimates and standard errors for the treatment effects. The function call is:

plant.mod1 = lm(weight ~ group, data = plant.df)

We save the model fitted to the data in an object so that we can undertake various actions to study the goodness of the fit to the data and other model assumptions. The standard summary of a **lm** object is used to produce the following output:

> summary(plant.mod1) Call: lm(formula = weight ~ group, data = plant.df) Residuals: Min 1Q Median 3Q Max -1.0710 -0.4180 -0.0060 0.2627 1.3690 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.1971 25.527 <2e-16 *** groupTreatment 1 -0.3710 0.2788 -1.331 0.1944 groupTreatment 2 0.4940 0.2788 1.772 0.0877 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6234 on 27 degrees of freedom Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096 F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591

The model output indicates some evidence of a difference in the average growth for the 2nd treatment compared to the control group. An analysis of variance table for this model can be produced via the **anova** command:

> anova(plant.mod1) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) group 2 3.7663 1.8832 4.8461 0.01591 * Residuals 27 10.4921 0.3886 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This table confirms that there are differences between the groups which were highlighted in the model summary. The function **confint** is used to calculate confidence intervals on the treatment parameters, by default 95% confidence intervals:

> confint(plant.mod1) 2.5 % 97.5 % (Intercept) 4.62752600 5.4364740 groupTreatment 1 -0.94301261 0.2010126 groupTreatment 2 -0.07801261 1.0660126

The model residuals can be plotted against the fitted values to investigate the model assumptions. First we create a data frame with the fitted values, residuals and treatment identifiers:

plant.mod = data.frame(Fitted = fitted(plant.mod1), Residuals = resid(plant.mod1), Treatment = plant.df$group)

and then produce the plot:

ggplot(plant.mod, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

which produces this graph:

We can see that there is no major problem with the diagnostic plot but some evidence of different variabilities in the spread of the residuals for the three treatment groups.

**leave a comment**for the author, please follow the link and comment on his blog:

**Software for Exploratory Data Analysis and Statistical Modelling**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...