Easy cell statistics for factorial designs

December 2, 2011
By

(This article was first published on statMethods blog, and kindly contributed to R-bloggers)

A common task when analyzing multi-group designs is obtaining descriptive statistics for various cells and cell combinations.

There are many functions that can help you accomplish this, including aggregate() and by() in the base installation, summaryBy() in the doBy package, and describe.by() in the psych package. However, I find it easiest to use the melt() and cast() functions in the reshape package.

As an example, consider the mtcars dataframe (included in the base installation) containing road test information on automobiles assessed in 1974. Suppose that you want to obtain the means, standard deviations, and sample sizes for the variables miles per gallon (mpg), horsepower (hp), and weight (wt). You want these statistics for all cars in the dataset, separately by transmission type (am) and number of gears (gear), and for the cells formed by crossing these two variables.

You can accomplish this with the following code:

```options(digits = 3)
library(reshape)

# define and name the statistics of interest
stats <- function(x)(c(N = length(x), Mean = mean(x), SD = sd(x)))

# label the levels of the classification variables (optional)
mtcars\$am   <- factor(mtcars\$am, levels = c(0, 1), labels = c("Automatic", "Manual"))
mtcars\$gear <- factor(mtcars\$gear, levels = c(3, 4, 5),
labels = c("3-Cyl", "4-Cyl", "5-Cyl"))

# melt the dataset
dfm   <- melt(mtcars,
# outcome variables
measure.vars = c("mpg", "hp", "wt"),
# classification variables
id.vars = c("am", "gear"))

# statistics for the entire sample
cast(dfm, variable ~ ., stats)

# statistics for cells defined by transmission type
cast(dfm, am + variable ~ ., stats)

# statistics for cells defined by number of gears
cast(dfm, gear + variable ~ ., stats)

# statistics for cells defined by each am x gear combination
cast(dfm, am + gear + variable ~ ., stats)
```

The output is given below:

```  variable  N   Mean     SD
1      mpg 32  20.09  6.027
2       hp 32 146.69 68.563
3       wt 32   3.22  0.978

am variable  N   Mean     SD
1 Automatic      mpg 19  17.15  3.834
2 Automatic       hp 19 160.26 53.908
3 Automatic       wt 19   3.77  0.777
4    Manual      mpg 13  24.39  6.167
5    Manual       hp 13 126.85 84.062
6    Manual       wt 13   2.41  0.617

gear  variable  N   Mean      SD

1 3-Cyl      mpg 15  16.11   3.372
2 3-Cyl       hp 15 176.13  47.689
3 3-Cyl       wt 15   3.89   0.833
4 4-Cyl      mpg 12  24.53   5.277
5 4-Cyl       hp 12  89.50  25.893
6 4-Cyl       wt 12   2.62   0.633
7 5-Cyl      mpg  5  21.38   6.659
8 5-Cyl       hp  5 195.60 102.834
9 5-Cyl       wt  5   2.63   0.819

am  gear variable  N   Mean      SD
1  Automatic 3-Cyl      mpg 15  16.11   3.372
2  Automatic 3-Cyl       hp 15 176.13  47.689
3  Automatic 3-Cyl       wt 15   3.89   0.833
4  Automatic 4-Cyl      mpg  4  21.05   3.070
5  Automatic 4-Cyl       hp  4 100.75  29.010
6  Automatic 4-Cyl       wt  4   3.30   0.157
7     Manual 4-Cyl      mpg  8  26.27   5.414
8     Manual 4-Cyl       hp  8  83.88  24.175
9     Manual 4-Cyl       wt  8   2.27   0.461
10    Manual 5-Cyl      mpg  5  21.38   6.659
11    Manual 5-Cyl       hp  5 195.60 102.834
12    Manual 5-Cyl       wt  5   2.63   0.819```

The approach is easily generalized to any number of grouping variables (factors), dependent/outcome variables, and statistics, and gives you a powerful tool for slicing and dicing data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...