Almost every biomedical research paper requires a “Table 1: baseline patient
characteristics.” Many developers have published tools to help streamline the
construction of such tables. The qwraps2::summary_table
function is my
contribution to the toolbox.
I have constructed hundreds of Table 1s while working as a biostatistics
consultant in a biostatistics department. I’ve also constructed many helper
functions to try and streamline the production of such tables. However, every
project, every data set, every lead author, every target journal, etc., will
present slightly different requirements for the contents and formating of the
tables. As such, functions which tried to “do it all”, required
constant modification to provide the needed output from each nuanced project.
The tableone
package does a lot
of good things, and is a great tool for quickly building the baseline summary
tables. What my experiences has taught me is that each row group, or even each
row, might require some specific formating. A function that treats all
continuous variables one way and all categorical variables another way, may work
for many cases but not all.
The approach to building the tables I’ve taken now is explicitly define the
summary statistics I want for each variable in the data set, the formatting
for the summary statistics, and in a way that is easy to work with one or more
grouping variables.
The function summary_table
within my
qwraps2
package is the tool I
and a few colleagues have started to rely on for building baseline patient
characteristic tables. (qwraps2, “quick wraps 2”, is a package of formatting
functions I’ve found useful for formating results and generating some graphics
when authoring .Rmd and .Rnw files.)
Load and attach the qwraps2 namespace. We’ll set the qwraps2_markup
option to
markdown
. If this option is not set, qwraps2
uses
get0ption(qwraps2_markup, "latex")
as the default markup language.
library(qwraps2)
options(qwraps2_markup = 'markdown') # default is latex
We’ll use the mtcars
dataset for our examples. Let’s report several summary
statistics for miles per gallon, number of cylinders, and weight of the
vehicles. The following summary is provided to illustrate the functions and
thus will include some summaries that would not be used in a publication.
The data summary we want will be:
- Miles Per Gallon
- min
- mean (sd)
- median (iqr)
- max
- Cylinders
- mean
- n (%) of four cylinders engines
- n (%) of six cylinders engines
- n (%) of eight cylinders engines
- Weight
- range
For cylinders we’ll report several things, the mean number of cylinders, and the
count (%) of 4, 6, and 8 cylinders cars. In a publication we would likely not
report such a summary, treating cylinders as both a continuous and categorical
value. However, doing so here helps to illustrate the flexibility of the
summary_table
method.
Outlining the wanted summary statistics above as a list-of-lists helps to
explain the construction of the summary object constructed below. The
summary_table
method takes two arguments,
.data
, adata.frame
or agrouped_df
object, andsummaries
a list-of-lists of right hand sidedformula
e defining the
summary statistics.
The construction of the summary table is achieved via dplyr::summarize_
.
The mtcar_summaries
object constructed below, defines each needed row of the
summary table via a formula
. I’ve included the qwraps2
namespace for
clarity.
mtcar_summaries <-
list("Miles Per Gallon" =
list("min:" = ~ min(mpg),
"mean (sd)" = ~ qwraps2::mean_sd(mpg, denote_sd = "paren"),
"median (IQR)" = ~ qwraps2::median_iqr(mpg),
"max:" = ~ max(mpg)),
"Cylinders:" =
list("mean" = ~ mean(cyl),
"mean (formatted)" = ~ qwraps2::frmt(mean(cyl)),
"4 cyl, n (%)" = ~ qwraps2::n_perc0(cyl == 4),
"6 cyl, n (%)" = ~ qwraps2::n_perc0(cyl == 6),
"8 cyl, n (%)" = ~ qwraps2::n_perc0(cyl == 8)),
"Weight" =
list("Range" = ~ paste(range(wt), collapse = ", "))
)
The table is constructed and printed with ease:
summary_table(mtcars, mtcar_summaries)
##
##
## | |mtcars (N = 32) |
## |:-----------------------------|:--------------------|
## |**Miles Per Gallon** | |
## | min: |10.4 |
## | mean (sd) |20.09 (6.03) |
## | median (IQR) |19.20 (15.43, 22.80) |
## | max: |33.9 |
## |**Cylinders:** | |
## | mean |6.1875 |
## | mean (formatted) |6.19 |
## | 4 cyl, n (%) |11 (34) |
## | 6 cyl, n (%) |7 (22) |
## | 8 cyl, n (%) |14 (44) |
## |**Weight** | |
## | Range |1.513, 5.424 |
The markdown output, rendered as html is:
mtcars (N = 32) | |
---|---|
Miles Per Gallon | |
min: | 10.4 |
mean (sd) | 20.09 (6.03) |
median (IQR) | 19.20 (15.43, 22.80) |
max: | 33.9 |
Cylinders: | |
mean | 6.1875 |
mean (formatted) | 6.19 |
4 cyl, n (%) | 11 (34) |
6 cyl, n (%) | 7 (22) |
8 cyl, n (%) | 14 (44) |
Weight | |
Range | 1.513, 5.424 |
Extending the table to show the same summary by a grouping variable, we’ll use
am
(Transmission: 0 = automatic, 1 = manual), is done as follows:
summary_table(dplyr::group_by(mtcars, am), mtcar_summaries)
##
##
## | |am: 0 (N = 19) |am: 1 (N = 13) |
## |:-----------------------------|:--------------------|:--------------------|
## |**Miles Per Gallon** | | |
## | min: |10.4 |15.0 |
## | mean (sd) |17.15 (3.83) |24.39 (6.17) |
## | median (IQR) |17.30 (14.95, 19.20) |22.80 (21.00, 30.40) |
## | max: |24.4 |33.9 |
## |**Cylinders:** | | |
## | mean |6.947368 |5.076923 |
## | mean (formatted) |6.95 |5.08 |
## | 4 cyl, n (%) |3 (16) |8 (62) |
## | 6 cyl, n (%) |4 (21) |3 (23) |
## | 8 cyl, n (%) |12 (63) |2 (15) |
## |**Weight** | | |
## | Weight |2.465, 5.424 |1.513, 3.57 |
am: 0 (N = 19) | am: 1 (N = 13) | |
---|---|---|
Miles Per Gallon | ||
min: | 10.4 | 15.0 |
mean (sd) | 17.15 (3.83) | 24.39 (6.17) |
median (IQR) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
max: | 24.4 | 33.9 |
Cylinders: | ||
mean | 6.947368 | 5.076923 |
mean (formatted) | 6.95 | 5.08 |
4 cyl, n (%) | 3 (16) | 8 (62) |
6 cyl, n (%) | 4 (21) | 3 (23) |
8 cyl, n (%) | 12 (63) | 2 (15) |
Weight | ||
Weight | 2.465, 5.424 | 1.513, 3.57 |
And lastly, building one table with a column for the whole data set and columns
for each transmission type is:
cbind(summary_table(mtcars, mtcar_summaries),
summary_table(dplyr::group_by(mtcars, am), mtcar_summaries))
##
##
## | |mtcars (N = 32) |am: 0 (N = 19) |am: 1 (N = 13) |
## |:-----------------------------|:--------------------|:--------------------|:--------------------|
## |**Miles Per Gallon** | | | |
## | min: |10.4 |10.4 |15.0 |
## | mean (sd) |20.09 (6.03) |17.15 (3.83) |24.39 (6.17) |
## | median (IQR) |19.20 (15.43, 22.80) |17.30 (14.95, 19.20) |22.80 (21.00, 30.40) |
## | max: |33.9 |24.4 |33.9 |
## |**Cylinders:** | | | |
## | mean |6.1875 |6.947368 |5.076923 |
## | mean (formatted) |6.19 |6.95 |5.08 |
## | 4 cyl, n (%) |11 (34) |3 (16) |8 (62) |
## | 6 cyl, n (%) |7 (22) |4 (21) |3 (23) |
## | 8 cyl, n (%) |14 (44) |12 (63) |2 (15) |
## |**Weight** | | | |
## | Range |1.513, 5.424 |2.465, 5.424 |1.513, 3.57 |
mtcars (N = 32) | am: 0 (N = 19) | am: 1 (N = 13) | |
---|---|---|---|
Miles Per Gallon | |||
min: | 10.4 | 10.4 | 15.0 |
mean (sd) | 20.09 (6.03) | 17.15 (3.83) | 24.39 (6.17) |
median (IQR) | 19.20 (15.43, 22.80) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
max: | 33.9 | 24.4 | 33.9 |
Cylinders: | |||
mean | 6.1875 | 6.947368 | 5.076923 |
mean (formatted) | 6.19 | 6.95 | 5.08 |
4 cyl, n (%) | 11 (34) | 3 (16) | 8 (62) |
6 cyl, n (%) | 7 (22) | 4 (21) | 3 (23) |
8 cyl, n (%) | 14 (44) | 12 (63) | 2 (15) |
Weight | |||
Range | 1.513, 5.424 | 2.465, 5.424 | 1.513, 3.57 |
Using dplry::group_by
will allow you to build the table with more than one
grouping variable. For example:
cbind(summary_table(mtcars, mtcar_summaries),
summary_table(dplyr::group_by(mtcars, am, vs), mtcar_summaries))
##
##
## | |mtcars (N = 32) |am: 0 vs: 0 (N = 12) |am: 0 vs: 1 (N = 7) |am: 1 vs: 0 (N = 6) |am: 1 vs: 1 (N = 7) |
## |:-----------------------------|:--------------------|:--------------------|:--------------------|:--------------------|:--------------------|
## |**Miles Per Gallon** | | | | | |
## | min: |10.4 |10.4 |17.8 |15.0 |21.4 |
## | mean (sd) |20.09 (6.03) |15.05 (2.77) |20.74 (2.47) |19.75 (4.01) |28.37 (4.76) |
## | median (IQR) |19.20 (15.43, 22.80) |15.20 (14.05, 16.62) |21.40 (18.65, 22.15) |20.35 (16.78, 21.00) |30.40 (25.05, 31.40) |
## | max: |33.9 |19.2 |24.4 |26.0 |33.9 |
## |**Cylinders:** | | | | | |
## | mean |6.1875 |8.000000 |5.142857 |6.333333 |4.000000 |
## | mean (formatted) |6.19 |8.00 |5.14 |6.33 |4.00 |
## | 4 cyl, n (%) |11 (34) |0 (0) |3 (43) |1 (17) |7 (100) |
## | 6 cyl, n (%) |7 (22) |0 (0) |4 (57) |3 (50) |0 (0) |
## | 8 cyl, n (%) |14 (44) |12 (100) |0 (0) |2 (33) |0 (0) |
## |**Weight** | | | | | |
## | Range |1.513, 5.424 |3.435, 5.424 |2.465, 3.46 |2.14, 3.57 |1.513, 2.78 |
mtcars (N = 32) | am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | |
---|---|---|---|---|---|
Miles Per Gallon | |||||
min: | 10.4 | 10.4 | 17.8 | 15.0 | 21.4 |
mean (sd) | 20.09 (6.03) | 15.05 (2.77) | 20.74 (2.47) | 19.75 (4.01) | 28.37 (4.76) |
median (IQR) | 19.20 (15.43, 22.80) | 15.20 (14.05, 16.62) | 21.40 (18.65, 22.15) | 20.35 (16.78, 21.00) | 30.40 (25.05, 31.40) |
max: | 33.9 | 19.2 | 24.4 | 26.0 | 33.9 |
Cylinders: | |||||
mean | 6.1875 | 8.000000 | 5.142857 | 6.333333 | 4.000000 |
mean (formatted) | 6.19 | 8.00 | 5.14 | 6.33 | 4.00 |
4 cyl, n (%) | 11 (34) | 0 (0) | 3 (43) | 1 (17) | 7 (100) |
6 cyl, n (%) | 7 (22) | 0 (0) | 4 (57) | 3 (50) | 0 (0) |
8 cyl, n (%) | 14 (44) | 12 (100) | 0 (0) | 2 (33) | 0 (0) |
Weight | |||||
Range | 1.513, 5.424 | 3.435, 5.424 | 2.465, 3.46 | 2.14, 3.57 | 1.513, 2.78 |
The formatting of the output is controlled by the
qwraps2:::print.qwraps2_summary_table
and qwraps2::qable
functions.
args(qwraps2:::print.qwraps2_summary_table)
## function (x, rgroup = attr(x, "rgroups"), rnames = rownames(x),
## cnames = colnames(x), ...)
## NULL
args(qwraps2::qable)
## function (x, rtitle, rgroup, rnames = rownames(x), cnames = colnames(x),
## markup = getOption("qwraps2_markup", "latex"), ...)
## NULL
The print
method for qwraps2_summary_table
objects calls qable
which is a
wrapper around knitr::kable
. The row groups, rgroup
, row names rnames
,
and column names, cnames
, are explicitly set in the
print.qwraps2_summary_table
method. The ...
passes additional arguments to
qwraps2::qable
which can then continue to pass to knitr::kable
.
A quick example of modifying a table:
by_am <- summary_table(dplyr::group_by(mtcars, am), mtcar_summaries)
print(by_am,
cnames = c("_Manual_", "_Automatic_"),
rtitle = "Vehicle Characteristics",
align = "lcc")
##
##
## |Vehicle Characteristics | _Manual_ | _Automatic_ |
## |:-----------------------------|:--------------------:|:--------------------:|
## |**Miles Per Gallon** | | |
## | min: | 10.4 | 15.0 |
## | mean (sd) | 17.15 (3.83) | 24.39 (6.17) |
## | median (IQR) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
## | max: | 24.4 | 33.9 |
## |**Cylinders:** | | |
## | mean | 6.947368 | 5.076923 |
## | mean (formatted) | 6.95 | 5.08 |
## | 4 cyl, n (%) | 3 (16) | 8 (62) |
## | 6 cyl, n (%) | 4 (21) | 3 (23) |
## | 8 cyl, n (%) | 12 (63) | 2 (15) |
## |**Weight** | | |
## | Weight | 2.465, 5.424 | 1.513, 3.57 |
Vehicle Characteristics | Manual | Automatic |
---|---|---|
Miles Per Gallon | ||
min: | 10.4 | 15.0 |
mean (sd) | 17.15 (3.83) | 24.39 (6.17) |
median (IQR) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
max: | 24.4 | 33.9 |
Cylinders: | ||
mean | 6.947368 | 5.076923 |
mean (formatted) | 6.95 | 5.08 |
4 cyl, n (%) | 3 (16) | 8 (62) |
6 cyl, n (%) | 4 (21) | 3 (23) |
8 cyl, n (%) | 12 (63) | 2 (15) |
Weight | ||
Weight | 2.465, 5.424 | 1.513, 3.57 |
I hope that some readers will find this approach to building summary tables to
be useful. If you find bugs or have suggestions on how to extend and improve
this tool please create an
issue on github.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...