Site icon R-bloggers

Speeding up tidySummarizedExperiment through query optimisation and the plyxp backend

[This article was first published on tidyomicsBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tidySummarizedExperiment logo

Contributors: Michael Love, Justin Landis, Pierre-Paul Axisa

The generality of tidySummarizedExperiment makes it easy to interface with several tidyverse packages (e.g. dplyr, tidyr, ggplot2, purrr, plotly). This is possible thanks to its approach of converting SummarizedExperiment objects to tibbles, performing operations, and converting back to the original format. This conversion process introduces substantial overhead when working with large-scale datasets. Each operation requires multiple data transformations, with the conversion to tibble format creating memory copies of the entire dataset, followed by the reverse conversion back to SummarizedExperiment. For datasets containing hundreds of samples and tens of thousands of genes, these repeated conversions can consume memory and add significant computational overhead to even simple operations such as filtering or grouping.

With the new tidySummarizedExperiment release (v1.19.7), we have introduced new optimisations that address these performance limitations. This optimisation is powered by:

  1. Check for the query domain (assay, colData, rowData), and execute specialised operation.
  2. Use of plyxp for complex domain-specific queries.

plyxp is a tidyomics package developed by Justin Landis, and first released as part of Bioconductor 3.20 in October 2024. It uses data-masking functionality from the rlang package to perform efficient operations on SummarizedExperiment objects.

< section id="motivation-and-design-principles" class="level3">

Motivation and design principles

This benchmark supports ongoing work to improve the performance of tidySummarizedExperiment. In this benchmark, we show up to 30x improvement in operations such as mutate().

The current optimisation is grounded in three principles:

These design choices aim to preserve dimnames, avoid unnecessary tibble round-trips, and provide predictable performance across simple and mixed-slot scenarios.

< section id="example-of-code-optimisation" class="level3">

Example of code optimisation

This was the mutate() method before optimisation. The previous implementation relied on as_tibble() |> dplyr::mutate() |> update_SE_from_tibble(.data)

The function update_SE_from_tibble interprets the input tibble and converts it back to a SummarizedExperiment. Although this step provides great generality and flexibility, it is particularly expensive because it must infer whether columns are sample-wise or feature-wise.

< details class="code-fold"> < summary>Show pre-optimization source
mutate.SummarizedExperiment <- function(.data, ...) {
    # Legacy implementation of mutate() for SummarizedExperiment:
    # - Validates requested edits against special/view-only columns
    # - Performs mutate() via tibble round-trip, then reconstructs the SE
    # Check that we are not modifying a key column
    cols <- enquos(...) |> names()
    
    # Deprecation of special column names:
    # capture all quoted args to detect deprecated special-column usage
    .cols <- enquos(..., .ignore_empty="all") %>% 
        map(~ quo_name(.x)) %>% unlist()
    if (is_sample_feature_deprecated_used(.data, .cols)) {
        # Record deprecated usage into metadata for backward compatibility
        .data <- ping_old_special_column_into_metadata(.data)
    }
    
    # Identify view-only/special columns (sample/feature keys, etc.)
    # Use a small slice to reduce overhead while probing structure
    special_columns <- get_special_columns(
        # Decrease the size of the dataset
        .data[1:min(100, nrow(.data)), 1:min(20, ncol(.data))]
    ) |> c(get_needed_columns(.data))
    
    # Are any requested targets among special/view-only columns?
    tst <-
        intersect(
            cols,
            special_columns
        ) |> 
        length() |>
        gt(0)

    if (tst) {
        columns <-
            special_columns |>
                paste(collapse=", ")
        stop(
            "tidySummarizedExperiment says:",
            " you are trying to rename a column that is view only",
            columns,
            "(it is not present in the colData).",
            " If you want to mutate a view-only column,",
            " make a copy and mutate that one."
        )
    }

    # If Ranges column not in query, prefer faster tibble conversion
    # Skip expanding GRanges columns when not referenced
    skip_GRanges <-
        get_GRanges_colnames() %in% 
        cols |>
        not()
    
    # Round-trip: SE -> tibble -> dplyr::mutate -> SE
    .data |>
        as_tibble(skip_GRanges=skip_GRanges) |>
        dplyr::mutate(...) |>
        update_SE_from_tibble(.data)
}

The new implementation captures all easy cases, such as sample-only and feature-only metadata mutate(). If mutate() is a mixed operation that can be factored out to sample- and feature-wise operation it is handled by plyxp. Otherwise, the general solution is used.

Key components to compare: – The pre-optimization code always uses a tibble round-trip (as_tibble() |> dplyr::mutate() |> update_SE_from_tibble()). – The optimized code first analyzes scope (colData, rowData, assay, or mixed) and dispatches to specialized paths. – The fallback still exists (mutate_via_tibble) for complex cases, preserving generality.

< details class="code-fold"> < summary>Show post-optimization source
mutate.SummarizedExperiment <- function(.data, ...) {

       # Check if query is composed (multiple expressions)
    if (is_composed("mutate", ...)) return(decompose_tidy_operation("mutate", ...)(.data))

        # Check for scope and dispatch elegantly
        scope_report <- analyze_query_scope_mutate(.data, ...)
        scope <- scope_report$scope

        result <-
        if(scope == "coldata_only") modify_samples(.data, "mutate", ...)
        else if(scope == "rowdata_only") modify_features(.data, "mutate", ...)
        else if(scope == "assay_only") mutate_assay(.data, ...)
        else if(scope == "mixed") modify_se_plyxp(.data, "mutate", scope_report, ...)
        else mutate_via_tibble(.data, ...)

        # Record latest mutate scope into metadata for testing/introspection
        <- S4Vectors::metadata(result)
        if (is.null(meta)) <- list()
        meta$latest_mutate_scope_report <- scope_report
        S4Vectors::metadata(result) <- meta

        return(result)

}
< section id="benchmarking-overview" class="level1">

Benchmarking Overview

This vignette benchmarks a set of mutate(), filter(), select(), and distinct() scenarios (see documentation) comparing performance before and after optimisation, by explicitly checking out specific commits via git worktree, loading each commit’s code with devtools::load_all(), running the same scenarios multiple times, and comparing the runtimes with ggplot boxplots.

< section id="setup-helper-functions" class="level3">

Setup helper functions

< details class="code-fold"> < summary>Show the code
suppressPackageStartupMessages({
  library(ggplot2)
  library(dplyr)
  library(SummarizedExperiment)
  library(rlang)
  library(devtools)
  library(airway)
  library(microbenchmark)
  library(reactable)
  library(patchwork)
})

load_branch_code <- function(worktree_dir) {
  if (!requireNamespace("devtools", quietly = TRUE)) stop("Please install devtools to run this vignette.")
  # Debug: print the directory we're looking for
  cat("Looking for worktree directory:", worktree_dir, "\n")
  cat("Directory exists:", dir.exists(worktree_dir), "\n")
  cat("Current working directory:", getwd(), "\n")
  # Check if directory exists
  if (!dir.exists(worktree_dir)) {
    stop(paste("Worktree directory does not exist:", worktree_dir))
  }
  suppressMessages(devtools::load_all(worktree_dir, quiet = TRUE))
}

create_airway_test_se <- function() {
  suppressPackageStartupMessages(library(airway))
  data(airway)
  se <- airway
  se[1:200, ]
}

benchmark_scenarios <- function() {
  list(
    coldata_simple_assignment = quo({ se |> mutate(new_dex = dex) }),
    coldata_arithmetic = quo({ se |> mutate(avgLength_plus_5 = avgLength + 5) }),
    coldata_concat = quo({ se |> mutate(sample_info = paste(cell, dex, SampleName, sep = "_")) }),
    coldata_grouped_mean = quo({ se |> group_by(dex) |> mutate(avgLength_group_mean = mean(avgLength)) |> ungroup() }),
    assay_simple_assignment = quo({ se |> mutate(counts_copy = counts) }),
    assay_plus_one = quo({ se |> mutate(counts_plus_1 = counts + 1) }),
    assay_log = quo({ se |> mutate(log_counts_manual = log2(counts + 1)) }),
    complex_conditional_coldata = quo({ se |> mutate(length_group = ifelse(avgLength > mean(avgLength), "longer", "shorter")) }),
    complex_nested = quo({ se |> mutate(complex_category = ifelse(dex == "trt" & avgLength > mean(avgLength), "treated_long", ifelse(dex == "untrt", "untreated", "other"))) }),
    mixed_assay_coldata = quo({ se |> mutate(new_counts = counts * avgLength) }),
    multiple_simple_assay = quo({ se |> mutate(normalized_counts = counts / 1000, sqrt_counts = sqrt(counts)) }),
    chained_mutates = quo({ se |> mutate(tmp = avgLength * 2) |> mutate(flag = ifelse(tmp > mean(tmp), 1, 0)) }),

    # Filter benchmarks (scoped and non-rectangular)
    filter_coldata_simple = quo({ se |> filter(dex == "trt") }),
    filter_coldata_numeric = quo({ se |> filter(avgLength > median(avgLength)) }),
    filter_assay_nonrect = quo({ se |> filter(counts > 0) }),

    # Select benchmarks (covering colData-only, rowData-only, assays-only, mixed)
    select_coldata_simple = quo({ se |> select(.sample, dex) }),
    select_rowdata_simple = quo({ se |> select(.feature) }),
    select_assay_only = quo({ se |> select(counts) }),
    select_mixed_keys_counts = quo({ se |> select(.sample, .feature, counts) }),
    select_coldata_wide = quo({ se |> select(.sample, dex, avgLength, SampleName) }),

    # Distinct benchmarks (covering colData-only, rowData-only, assays-only, mixed)
    distinct_coldata_simple = quo({ se |> distinct(dex) }),
    distinct_coldata_multiple = quo({ se |> distinct(dex, avgLength) }),
    distinct_rowdata_simple = quo({ se |> distinct(.feature) }),
    distinct_assay_only = quo({ se |> distinct(counts) }),
    distinct_mixed_keys_counts = quo({ se |> distinct(.sample, .feature, counts) }),
    distinct_coldata_wide = quo({ se |> distinct(.sample, dex, avgLength, SampleName) }),
    distinct_with_keep_all = quo({ se |> distinct(dex, .keep_all = TRUE) }),
    distinct_complex_expression = quo({ se |> distinct(dex, avgLength) })
  )
}

run_one <- function(expr_quo, reps = 5L) {
  se_base <- create_airway_test_se()
  mb <- microbenchmark::microbenchmark(
    eval_tidy(expr_quo),
    times = reps,
    setup = { se <- se_base },          # reuse the same input, avoid recreating inside the timed expr
    control = list(warmup = 2L)
  )
  # microbenchmark returns nanoseconds; convert to milliseconds
  as.numeric(mb$time) / 1e6
}

run_all_scenarios <- function(branch_label, reps = 7L) {
  scenarios <- benchmark_scenarios()
  out <- list()
  for (nm in names(scenarios)) {
    tms <- run_one(scenarios[[nm]], reps = reps)
    out[[length(out) + 1L]] <- data.frame(
      branch = branch_label,
      scenario = nm,
      replicate = seq_along(tms),
      elapsed_ms = tms,
      stringsAsFactors = FALSE
    )
  }
  bind_rows(out)
}

# Parallel version: run each scenario on a separate worker
run_all_scenarios_parallel <- function(branch_label, reps = 20L, workers = 1L, initializer = NULL) {
  scenarios <- benchmark_scenarios()
  nms <- names(scenarios)
  old_plan <- future::plan()
  on.exit(future::plan(old_plan), add = TRUE)
  future::plan(future::multisession, workers = workers)
  res <- future.apply::future_lapply(nms, function(nm) {
    if (!is.null(initializer)) initializer()
    tms <- run_one(scenarios[[nm]], reps = reps)
    data.frame(
      branch = branch_label,
      scenario = nm,
      replicate = seq_along(tms),
      elapsed_ms = tms,
      stringsAsFactors = FALSE
    )
  }, future.seed = TRUE)
  dplyr::bind_rows(res)
}
< section id="run-benchmarks-on-both-branches" class="level3">

Run benchmarks on both branches

< details class="code-fold"> < summary>Show the code
# Worktree directories (already exist in the post directory)
wt_before <- ".__bench_before__"
wt_after <- ".__bench_after__"

# Verify worktrees exist
if (!dir.exists(wt_before)) {
  stop("Worktree directory does not exist: ", wt_before)
}
if (!dir.exists(wt_after)) {
  stop("Worktree directory does not exist: ", wt_after)
}

# Before optimisation (commit 87445757)
load_branch_code(wt_before)
Looking for worktree directory: .__bench_before__ 
Directory exists: TRUE 
Current working directory: /Users/a1234450/Documents/GitHub/tidyomicsBlog/posts/2025-10-25-tidySummarizedExperiment-optimization 
< details class="code-fold"> < summary>Show the code
res_before <- run_all_scenarios(branch_label = "before_optimization", reps = 10L)

# After optimisation (commit 9f7c26e)
load_branch_code(wt_after)
Looking for worktree directory: .__bench_after__ 
Directory exists: TRUE 
Current working directory: /Users/a1234450/Documents/GitHub/tidyomicsBlog/posts/2025-10-25-tidySummarizedExperiment-optimization 
< details class="code-fold"> < summary>Show the code
res_after <- run_all_scenarios(branch_label = "after_optimization", reps = 10L)

results <- dplyr::bind_rows(res_before, res_after) |>
  dplyr::mutate(operation = dplyr::case_when(
    grepl("^filter", scenario) ~ "filter",
    grepl("^select", scenario) ~ "select",
    grepl("^distinct", scenario) ~ "distinct",
    TRUE ~ "mutate"
  ))

summary_table <- results |>
  group_by(branch, scenario) |>
  summarise(median_ms = median(elapsed_ms), .groups = "drop") |>
  tidyr::pivot_wider(names_from = branch, values_from = median_ms) |> 
  dplyr::mutate(speedup = round(before_optimization / after_optimization, 2))
< section id="visualize-with-combined-performance-plots" class="level1">

Visualize with combined performance plots

< details class="code-fold"> < summary>Show the code
dodge_w <- 0.7

p_box <- ggplot(results, aes(x = scenario, y = elapsed_ms, fill = branch)) +
  geom_boxplot(position = position_dodge(width = dodge_w), width = 0.7, outlier.shape = NA) +

  # Add jittered points aligned with the dodged boxplots
  geom_point(
    position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = dodge_w), 
    alpha = 0.6, 
    size = 0.5
  ) +
  scale_y_log10() + 
  coord_flip() +
  facet_grid(operation ~ ., scales = "free_y", space = "free_y") +
  annotation_logticks(sides = "b") +
  labs(title = "Performance comparison: Before vs After optimization",
       x = "Scenario",
       y = "Elapsed (ms)") +
  theme_bw() +
  
  # Angle x labels  
  theme(legend.position = "top", axis.text.x = element_text(angle = 45, hjust = 1))

# Speedup summary panel (median before/after ratio)
speedup_plot_data <- summary_table |>
  dplyr::mutate(operation = dplyr::case_when(
    grepl("^filter", scenario) ~ "filter",
    grepl("^select", scenario) ~ "select",
    grepl("^distinct", scenario) ~ "distinct",
    TRUE ~ "mutate"
  ))

p_speedup <- ggplot(
  speedup_plot_data,
  aes(x = speedup, y = reorder(scenario, speedup))
) +
  geom_col(width = 0.7, fill = "grey70", color = "grey40") +
  facet_grid(operation ~ ., scales = "free_y", space = "free_y") +
  labs(
    title = "Median speedup by scenario",
    x = "Speedup (before/after, x)",
    y = NULL
  ) +
  theme_bw() +
  theme(legend.position = "none")

combined_plot <- p_box + p_speedup + patchwork::plot_layout(widths = c(2.3, 1))
combined_plot

< details class="code-fold"> < summary>Show the code
# Save the combined figure
ggsave("benchmark_plot.png", plot = combined_plot, width = 14, height = 8)
< section id="interpreting-the-benchmark-results" class="level3">

Interpreting the benchmark results

Across all scenarios, speedup ranges from 0.69x to 26.17x.

Operations with the strongest gains are: coldata_arithmetic (26.17x), coldata_simple_assignment (25.71x), chained_mutates (25.45x).

Lower-gain outliers are: distinct_coldata_wide (0.69x), coldata_grouped_mean (0.94x), distinct_mixed_keys_counts (0.94x).

By operation family, median speedup is: mutate (23.66x), filter (1.19x), select (1.11x), distinct (1.08x).

< section id="session-info" class="level1">

Session Info

R version 4.5.3 (2026-03-11)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Adelaide
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] tidySummarizedExperiment_1.19.7 tidyr_1.3.2                    
 [3] testthat_3.3.2                  ttservice_0.5.3                
 [5] patchwork_1.3.2                 reactable_0.4.5                
 [7] rlang_1.1.7                     microbenchmark_1.5.0           
 [9] airway_1.30.0                   SummarizedExperiment_1.40.0    
[11] Biobase_2.70.0                  GenomicRanges_1.62.1           
[13] Seqinfo_1.0.0                   IRanges_2.44.0                 
[15] S4Vectors_0.48.0                BiocGenerics_0.56.0            
[17] generics_0.1.4                  MatrixGenerics_1.22.0          
[19] matrixStats_1.5.0               dplyr_1.2.0                    
[21] ggplot2_4.0.2                   devtools_2.4.6                 
[23] usethis_3.2.1                  

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    viridisLite_0.4.3   farver_2.1.2       
 [4] S7_0.2.1            fastmap_1.2.0       lazyeval_0.2.2     
 [7] digest_0.6.39       plyxp_1.4.3         lifecycle_1.0.5    
[10] ellipsis_0.3.2      magrittr_2.0.4      compiler_4.5.3     
[13] tools_4.5.3         yaml_2.3.12         data.table_1.18.2.1
[16] knitr_1.51          S4Arrays_1.10.1     labeling_0.4.3     
[19] htmlwidgets_1.6.4   pkgbuild_1.4.8      DelayedArray_0.36.0
[22] RColorBrewer_1.1-3  pkgload_1.5.0       abind_1.4-8        
[25] withr_3.0.2         purrr_1.2.1         desc_1.4.3         
[28] grid_4.5.3          fansi_1.0.7         scales_1.4.0       
[31] cli_3.6.5           rmarkdown_2.30      ragg_1.5.1         
[34] remotes_2.5.0       otel_0.2.0          rstudioapi_0.18.0  
[37] httr_1.4.8          sessioninfo_1.2.3   cachem_1.1.0       
[40] stringr_1.6.0       XVector_0.50.0      vctrs_0.7.1        
[43] Matrix_1.7-4        jsonlite_2.0.0      systems_1.3.2  
[46] crosstalk_1.2.2     plotly_4.12.0       glue_1.8.0         
[49] reactR_0.6.1        stringi_1.8.7       gtable_0.3.6       
[52] tibble_3.3.1        pillar_1.11.1       htmltools_0.5.9    
[55] brio_1.1.5          R6_2.6.1            textshaping_1.0.5  
[58] rprojroot_2.1.1     evaluate_1.0.5      lattice_0.22-9     
[61] memoise_2.0.1       SparseArray_1.10.9  xfun_0.56          
[64] fs_1.6.7            pkgconfig_2.0.3    

© 2025 tidyomics. Content is published under Creative Commons CC-BY-4.0 License for the text and BSD 3-Clause License for any code. | R-Bloggers

To leave a comment for the author, please follow the link and comment on their blog: tidyomicsBlog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version