Flowcharts that belong in the analysis pipeline

Max Gordon

2 hours ago

[This article was first published on R – G-Forge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Flowcharts should be beautiful. Just like this CC photo from Wasif Malik,

Thanks to Alan Haynes and
his excellent suggestions, I have spent some time improving the
flowchart component of the Gmisc package. The result is not meant to be
another decorative diagram tool. It is meant for the kind of figures
researchers keep redrawing by hand: CONSORT diagrams, cohort derivation
charts, screening flows, data-cleaning audit trails, and the small but
important maps that explain how a study population came to be.

I like tools such as Excalidraw
for thinking. They are fast, expressive, and excellent for
conversations. But when a figure enters a manuscript, the needs change.
Counts must be updated. Exclusions must match the analysis script.
Treatment arms should align. Follow-up losses should be traceable. The
figure should survive reviewer round three without becoming a manual
editing project.

That is the space where flowchart() in Gmisc is useful:
the diagram becomes part of the research workflow.

The figure above is the kind of chart I want Gmisc to make feel
natural. It is still a grid graphic in R, but it has the visual grammar
of a manuscript figure: grouped arms, side exclusions, count badges,
phase labels, and arrows that do not need nudging after every text
change.

Every figure in this post is generated by code, and the code is
included below each image. They all share the same two-line
preamble:

library(Gmisc)
library(grid)

To save any of them to a file, wrap the call in a graphics device,
e.g.

png("01-consort-color.png", width = 9, height = 7, units = "in", res = 180, bg = "white")
# ... the flowchart code ...
dev.off()

The CONSORT figure above is produced by:

options(boxGrobTxtPadding = unit(3, "mm"))

box_fill <- gpar(fill = "#DDEEFF", col = "#336699", lwd = 1.5)
con_gp <- gpar(col = "#336699", lwd = 1.5, fill = "#336699")
side_gp <- gpar(col = "#CC8800", lwd = 1.2, fill = "#CC8800")
excl_fill <- gpar(fill = "#FFF8E1", col = "#CC8800", lwd = 1.2)
heading_gp <- gpar(fill = "#C8DAF7", col = "#2F5F9F", lwd = 1.1)
badge_gp <- gpar(fill = "#336699", col = NA)
badge_txt_gp <- gpar(col = "white", cex = 0.65)

main_arm_margin <- 0.28
main_x <- 0.5
exclusion_margin <- 0.05

grid.newpage()
flowchart(
  assessed = boxGrob(
    "Patients assessed for eligibility",
    x = main_x, box_gp = box_fill,
    badge_label = "840", badge_gp = badge_gp, badge_txt_gp = badge_txt_gp
  ),
  randomised = boxGrob(
    "Randomised",
    x = main_x, box_gp = box_fill,
    badge_label = "126", badge_gp = badge_gp, badge_txt_gp = badge_txt_gp
  ),
  arms = list(
    cast = boxGrob(
      "Randomised to\ncast immobilisation",
      box_gp = box_fill,
      badge_label = "62", badge_gp = badge_gp, badge_txt_gp = badge_txt_gp
    ),
    surgical = boxGrob(
      "Randomised to\nsurgery",
      box_gp = box_fill,
      badge_label = "64", badge_gp = badge_gp, badge_txt_gp = badge_txt_gp
    )
  ),
  lost = list(
    lost_cast = boxGrob(
      "Lost to follow-up (n = 2)\n  1 no response\n  1 other surgery",
      just = "left", box_gp = excl_fill
    ),
    lost_surgical = boxGrob(
      "Lost to follow-up (n = 3)\n  2 no response\n  1 other surgery",
      just = "left", box_gp = excl_fill
    )
  ),
  analysis = list(
    analysis_cast = boxGrob(
      "Included in\nprimary analysis",
      box_gp = box_fill,
      badge_label = "60", badge_gp = badge_gp, badge_txt_gp = badge_txt_gp
    ),
    analysis_surgical = boxGrob(
      "Included in\nprimary analysis",
      box_gp = box_fill,
      badge_label = "61", badge_gp = badge_gp, badge_txt_gp = badge_txt_gp
    )
  )
) |>
  spread(axis = "y", margin = unit(5, "mm"), exclude = "lost") |>
  align(
    axis = "y",
    subelement = "lost",
    references = list("arms", "analysis")
  ) |>
  equalizeWidths(subelement = list("arms", "analysis")) |>
  spread(axis = "x", subelement = "arms", margin = main_arm_margin) |>
  spread(axis = "x", subelement = "analysis", margin = main_arm_margin) |>
  spread(axis = "x", subelement = "lost", margin = exclusion_margin) |>
  phaseLabel("arms", "Allocation", box_gp = heading_gp) |>
  phaseLabel("analysis", "Analysis", box_gp = heading_gp) |>
  insert(list(excluded = boxGrob(
    "Excluded (n = 714)\n  477 stable ankle mortise\n   64 incongruent ankle mortise\n   30 previous serious trauma\n  143 other reasons",
    just = "left", box_gp = excl_fill
  )), after = "assessed") |>
  move(subelement = "excluded", x = 1 - exclusion_margin, just = "right") |>
  align(
    axis = "y",
    subelement = "excluded",
    references = list("assessed", "randomised")
  ) |>
  connect("assessed", "excluded", type = "L", lty_gp = side_gp, arrow_size = 3, smooth = TRUE) |>
  connect("randomised", "arms", type = "N", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |>
  connect("assessed", "randomised", type = "v", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |>
  connect("arms", "lost", type = "L", lty_gp = side_gp, arrow_size = 3, smooth = TRUE) |>
  connect("arms", "analysis", type = "v", lty_gp = con_gp, arrow_size = 3) |>
  print()

A figure that can
change with the analysis

The biggest advantage of drawing a flowchart in code is not that code
is elegant. It is that research figures are rarely finished when we
think they are.

The inclusion count changes after a database refresh. A reviewer asks
for a sensitivity analysis. Someone notices that two exclusion
categories should be split. The statistician reruns the cohort
definition. If the diagram is hand-drawn, every one of those changes
creates a small risk of mismatch between the paper and the actual
analysis.

If the chart is generated, it can sit beside the code that produced
the numbers.

flowchart(...) |>
  spread(axis = "y") |>
  spread(subelement = "arms", axis = "x") |>
  connect("randomised", "arms", type = "N")

That is the mental model: define boxes, arrange boxes, connect boxes.
The final result can still be polished, but it remains reproducible.

Cohort
derivation from data people already have

Most clinical researchers do not start with a perfect trial flow.
They start with a registry extract, an EHR table, a REDCap project, an
Excel sheet from a collaborator, or a combination of all of them.

That workflow deserves a clear figure too.

This kind of diagram is useful because it does not only show who was
included. It shows how the study base was assembled: what sources were
linked, where exclusions entered, and which analytic populations came
out at the end.

I find this especially helpful for observational studies. A table can
report baseline characteristics, but a flowchart explains the
construction of the cohort. It gives the reader a quick answer to: “What
happened between the raw data and the model?”

source_gp <- gpar(fill = "#E8F5E9", col = "#2E7D32", lwd = 1.4)
link_gp <- gpar(fill = "#E3F2FD", col = "#1565C0", lwd = 1.4)
cohort_gp <- gpar(fill = "#FFF8E1", col = "#C69214", lwd = 1.4)
side_gp <- gpar(fill = "#FCE4EC", col = "#AD1457", lwd = 1.2)
final_gp <- gpar(fill = "#EDE7F6", col = "#512DA8", lwd = 1.4)
con_gp <- gpar(col = "#455A64", fill = "#455A64", lwd = 1.4)
excl_gp <- gpar(col = "#AD1457", fill = "#AD1457", lwd = 1.2)

source_margin <- 0.05
output_margin <- 0.05
main_x <- 0.5
main_path <- list("linked", "cohort")
exclusion_right <- 0.95
exclusion_gap <- unit(5, "pt")
exclusion_line_offset <- unit(14, "mm")

grid.newpage()
flowchart(
  sources = list(
    ehr = boxGrob("Hospital EHR\nadmissions\nn = 241,820",
                  box_gp = source_gp),
    registry = boxGrob("Quality registry\nprocedures\nn = 38,420",
                       box_gp = source_gp),
    deaths = boxGrob("Population registry\nfollow-up\nn = 100%",
                     box_gp = source_gp)
  ),
  linked = boxGrob(
    "Linked study base\nunique patients with follow-up\nn = 29,614",
    x = main_x,
    box_gp = link_gp,
    width = unit(72, "mm")
  ),
  exclusions = list(
    prior = boxGrob("Previous diagnosis\nn = 4,108",
                    just = "left", box_gp = side_gp,
                    width = unit(42, "mm")),
    missing = boxGrob("Missing key\ncovariates\nn = 962",
                      just = "left", box_gp = side_gp,
                      width = unit(42, "mm")),
    outside = boxGrob("Outside study\nwindow\nn = 1,327",
                      just = "left", box_gp = side_gp,
                      width = unit(42, "mm"))
  ),
  cohort = boxGrob(
    "Primary cohort\nn = 23,217",
    box_gp = cohort_gp,
    width = unit(62, "mm")
  ),
  outputs = list(
    primary = boxGrob("Primary analysis\ncomplete case\nn = 22,144",
                      box_gp = final_gp),
    imputed = boxGrob("Sensitivity analysis\nmultiple imputation\nn = 23,217",
                      box_gp = final_gp),
    negative = boxGrob("Negative control\noutcome check\nn = 21,903",
                       box_gp = final_gp)
  )
) |>
  spread(axis = "y", margin = unit(8, "mm"), exclude = "exclusions") |>
  equalizeWidths(subelement = main_path) |>
  align(axis = "x", subelement = "cohort", reference = "linked") |>
  move(subelement = c("exclusions", "prior"),
       y = position("linked", position = "center", type = "y") - exclusion_gap,
       just = c(NA, "top")) |>
  move(subelement = c("exclusions", "missing"),
       y = position(c("exclusions", "prior"), position = "bottom", type = "y") - exclusion_gap,
       just = c(NA, "top")) |>
  move(subelement = c("exclusions", "outside"),
       y = position(c("exclusions", "missing"), position = "bottom", type = "y") - exclusion_gap,
       just = c(NA, "top")) |>
  equalizeWidths(subelement = "sources") |>
  equalizeWidths(subelement = "exclusions", width = unit(42, "mm")) |>
  equalizeWidths(subelement = "outputs") |>
  move(subelement = "exclusions", x = exclusion_right, just = "right") |>
  spread(axis = "x", subelement = "sources", margin = source_margin, type = "center") |>
  spread(axis = "x", subelement = "outputs", margin = output_margin, type = "center") |>
  connect("sources", "linked", type = "vertical_axis", lty_gp = con_gp, arrow_size = 3) |>
  connect("linked", "cohort", type = "v", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |>
  connect("linked", "exclusions",
          type = "side", lty_gp = excl_gp, arrow_size = 3,
          side = "right", end_side = "left",
          side_route = "outside",
          side_offset = exclusion_line_offset,
          label = "Excluded\nn = 6,397",
          label_gp = gpar(col = "#AD1457", cex = 0.8)) |>
  connect("cohort", "outputs", type = "N", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |>
  print()

The audit trail is part of
the story

Another common workflow is less glamorous but just as important: data
validation.

Many research projects have a small data-engineering pipeline even
when nobody calls it that. Data arrive through forms, imports, manual
entry, and collaborator spreadsheets. Then someone checks missing
fields, duplicates, impossible dates, inconsistent IDs, and
outliers.

That process is often hidden in prose. A compact flowchart can make
it visible without turning the methods section into a systems manual. It
is also a useful project-management figure: the same chart can be shown
to clinicians, data managers, statisticians, and co-authors.

Note how the box shapes carry meaning here — ellipses, databases,
documents, tapes, and diamonds all come from dedicated
box*Grob() helpers:

input_gp <- gpar(fill = "#F3F8FF", col = "#3B73C5", lwd = 1.3)
process_gp <- gpar(fill = "#FFF4C7", col = "#C69214", lwd = 1.3)
issue_gp <- gpar(fill = "#FCE4EC", col = "#AD1457", lwd = 1.2)
output_gp <- gpar(fill = "#E8F5E9", col = "#2E7D32", lwd = 1.3)
note_gp <- gpar(fill = "#FFFFFF", col = "#607D8B", lwd = 1, lty = 2)
con_gp <- gpar(col = "#555555", fill = "#555555", lwd = 1.3)
issue_con_gp <- gpar(col = "#AD1457", fill = "#AD1457", lwd = 1.1)

main_path <- list("validation", "clean")
issue_column_x <- 0.08
log_column_x <- 0.92
input_shape_width <- unit(42, "mm")
input_shape_height <- unit(24, "mm")
issue_shape_width <- unit(48, "mm")
issue_shape_height <- unit(14, "mm")

grid.newpage()
flowchart(
  inputs = list(
    web = boxEllipseGrob("REDCap\nform",
                         width = input_shape_width,
                         height = input_shape_height,
                         box_gp = input_gp),
    import = boxDatabaseGrob("CSV\nimport",
                             width = input_shape_width,
                             height = input_shape_height,
                             box_gp = input_gp),
    manual = boxDocumentGrob("Manual\nentry",
                             width = input_shape_width,
                             height = input_shape_height,
                             box_gp = input_gp)
  ),
  shape_note = boxGrob(
    "Shape indicates\nsource type",
    just = "left",
    width = unit(36, "mm"),
    box_gp = note_gp
  ),
  validation = boxTapeGrob(
    "Validation queue\nIDs, dates, ranges, missingness",
    width = unit(.58, "npc"),
    height = unit(.14, "npc"),
    box_gp = process_gp
  ),
  issues = list(
    missing = boxDiamondGrob("Missing\nfields",
                             width = issue_shape_width,
                             height = issue_shape_height,
                             box_gp = issue_gp),
    duplicate = boxDiamondGrob("Duplicate\nID",
                               width = issue_shape_width,
                               height = issue_shape_height,
                               box_gp = issue_gp),
    outlier = boxDiamondGrob("Outlier\nvalue",
                             width = issue_shape_width,
                             height = issue_shape_height,
                             box_gp = issue_gp)
  ),
  log = boxDocumentsGrob(
    "Issue log\nqueries sent\nchanges reviewed",
    width = unit(48, "mm"),
    height = unit(.44, "npc"),
    box_gp = issue_gp
  ),
  clean = boxDatabaseGrob(
    "Analysis-ready dataset\nlocked for report",
    width = unit(.44, "npc"),
    height = unit(.16, "npc"),
    box_gp = output_gp
  )
) |>
  spread(axis = "y", margin = unit(7, "mm"),
         exclude = list("issues", "shape_note")) |>
  spread(axis = "x", subelement = "inputs",
         from = 0, to = 0.7, margin = 0.05,
         type = "center") |>
  equalizeWidths(subelement = main_path) |>
  align(axis = "x", subelement = "validation", reference = "inputs") |>
  align(axis = "x", subelement = "clean", reference = "validation") |>
  align(axis = "y", subelement = "shape_note", reference = "inputs") |>
  align(axis = "y", subelement = "log",
        references = list("validation", "clean")) |>
  spread(axis = "y", subelement = "issues",
         from = position("log", position = "top", type = "y"),
         to = position("log", position = "bottom", type = "y"),
         margin = unit(2, "mm")) |>
  move(subelement = "shape_note", x = 0.95, just = "right") |>
  move(subelement = "issues", x = issue_column_x, just = "left") |>
  move(subelement = "log", x = log_column_x, just = "right") |>
  connect("inputs", "validation",
          type = "vertical_axis", lty_gp = con_gp, arrow_size = 3) |>
  connect("issues", "log",
          type = "horizontal_axis", lty_gp = issue_con_gp, arrow_size = 3) |>
  connect("validation", "clean", type = "vertical_axis",
          lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |>
  print()

Follow-up is rarely just
down the page

Longitudinal studies often need to distinguish between people who are
lost, censored, withdrawn, dead, or still contributing information up to
a time point. A simple downward flow can imply that everyone leaving a
box disappears from the analysis, which is not always true.

Dotted return arrows are useful for this. They can show that a
participant left direct follow-up but still contributes information to
the final analysis up to censoring. That is a visual detail, but it
communicates an analytical idea.

This is where small flowchart improvements matter. Not because the
reader cares about the drawing API, but because the figure can express
the study design more faithfully.

options(boxGrobTxtPadding = unit(1, "mm"))

main_gp <- gpar(fill = "#FFFFFF", col = "#263238", lwd = 1.2)
arm_gp <- gpar(fill = "#E3F2FD", col = "#1565C0", lwd = 1.3)
ex_gp <- gpar(fill = "#FFF8E1", col = "#C69214", lwd = 1.2)
con_gp <- gpar(col = "#1565C0", fill = "#1565C0", lwd = 1.3)
side_gp <- gpar(col = "#C69214", fill = "#C69214", lwd = 1.2)
dotted_gp <- gpar(col = "#455A64", fill = "#455A64", lwd = 1.1, lty = 2)

arm_from <- .24
arm_to <- .76
box_width <- unit(54, "mm")
ex_width <- unit(45, "mm")
ex_page_margin <- 0.03           # excluded columns hug the page edge by this npc margin
side_offset <- unit(4, "mm")     # side branches step out this far before turning to the excluded box
fan_in_offset <- unit(2, "mm")   # dotted return line runs 2 mm outside the excluded boxes

grid.newpage()
flowchart(
  rando = boxGrob("Randomised\nN = 197", box_gp = main_gp),
  groups = list(
    boxGrob("96 assigned to intervention\n95 received treatment",
            box_gp = arm_gp),
    boxGrob("101 assigned to control\n93 received treatment",
            box_gp = arm_gp)
  ),
  ex1 = list(
    boxGrob("8 died\n1 withdrew consent", just = "left", box_gp = ex_gp),
    boxGrob("18 died\n1 withdrew consent", just = "left", box_gp = ex_gp)
  ),
  groups1 = list(
    boxGrob("87 completed day 30\nfollow-up", box_gp = arm_gp),
    boxGrob("79 completed day 30\nfollow-up", box_gp = arm_gp)
  ),
  ex2 = list(
    boxGrob("8 died", just = "left", box_gp = ex_gp),
    boxGrob("9 died\n1 withdrew consent\n2 lost to follow-up",
            just = "left", box_gp = ex_gp)
  ),
  groups2 = list(
    boxGrob("79 completed day 180\nfollow-up", box_gp = arm_gp),
    boxGrob("68 completed day 180\nfollow-up", box_gp = arm_gp)
  ),
  analysis = list(
    boxGrob("95 included in primary\noutcome analysis", box_gp = arm_gp),
    boxGrob("95 included in primary\noutcome analysis", box_gp = arm_gp)
  )
) |>
  spread(axis = "y", margin = unit(0.02, "npc")) |>
  equalizeWidths(subelement = stringr::regex("^groups|analysis"), width = box_width) |>
  equalizeHeights(subelement = stringr::regex("^groups|analysis")) |>
  equalizeWidths(subelement = stringr::regex("^ex"), width = ex_width) |>
  spread(subelement = stringr::regex("^groups|analysis"), axis = "x",
         from = arm_from, to = arm_to, type = "center") |>
  move(subelement = "rando",
       x = position("groups", position = "center", type = "x")) |>
  move(subelement = list(c("ex1", 1), c("ex2", 1)),
       x = ex_page_margin, just = "left") |>
  move(subelement = list(c("ex1", 2), c("ex2", 2)),
       x = 1 - ex_page_margin, just = "right") |>
  connect("rando", "groups", type = "N", lty_gp = con_gp, arrow_size = 3, smooth = TRUE) |>
  connect(c("groups$1", "groups1$1"), c("ex1$1", "ex2$1"),
          type = "side", lty_gp = side_gp, arrow_size = 3,
          side = "left", end_side = "right",
          side_route = "outside", side_offset = side_offset) |>
  connect(c("groups$2", "groups1$2"), c("ex1$2", "ex2$2"),
          type = "side", lty_gp = side_gp, arrow_size = 3,
          side = "right", end_side = "left",
          side_route = "outside", side_offset = side_offset) |>
  connect("groups", "groups1", type = "vertical", lty_gp = con_gp, arrow_size = 3) |>
  connect("groups1", "groups2", type = "vertical", lty_gp = con_gp, arrow_size = 3) |>
  connect("groups2", "analysis", type = "vertical", lty_gp = con_gp, arrow_size = 3) |>
  connect(list("ex1$1", "ex2$1"), "analysis$1", type = "side",
          lty_gp = dotted_gp, arrow_size = 3,
          side = "left", end_side = "left",
          side_route = "outside",
          side_offset = fan_in_offset) |>
  connect(list("ex1$2", "ex2$2"), "analysis$2", type = "side",
          lty_gp = dotted_gp, arrow_size = 3,
          side = "right", end_side = "right",
          side_route = "outside",
          side_offset = fan_in_offset) |>
  print()

Why this belongs in Gmisc

Gmisc has always collected the small tools I found myself needing
around medical statistics: descriptive tables, transition plots, and
grid-based figures. Flowcharts fit that same pattern. They are not a
statistical model, but they are part of how research is
communicated.

The new flowchart work in 3.4.0 is therefore aimed at the practical
problems:

making CONSORT-like diagrams less painful to draw
keeping grouped stages aligned and readable
making arrows behave predictably
supporting side paths, return paths, and repeated box patterns
producing figures that can be regenerated when the study
changes

The vignette contains the full API and examples:

vignette("Grid-based_flowcharts", package = "Gmisc")

The blog figures in this post are intentionally close to things
researchers already have in their workflow: trial enrollment, registry
construction, data validation, and follow-up accounting. My hope is that
they make the flowchart tools feel less like a drawing utility and more
like a small extension of the analysis itself.

To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A figure that can change with the analysis

Cohort derivation from data people already have

The audit trail is part of the story

Follow-up is rarely just down the page

Why this belongs in Gmisc

Related

A figure that can
change with the analysis

Cohort
derivation from data people already have

The audit trail is part of
the story

Follow-up is rarely just
down the page