Tidy Pairwise Operations

[This article was first published on rstats on Bryan Shalloway's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview

Say you want to map an operation or list of operations across all two-way1 combinations of a set of variables/columns in a dataframe. For example, you may be doing feature engineering and want to create a set of interaction terms, ratios, etc2. You may be interested in computing a summary statistic across all pairwise combinations of a given set of variables3. In some cases there may be a pairwise implementation already available, e.g. R’s cor() function for computing correlations4. In other cases one may not exist or is not easy to use5. In this post I’ll walk through an example6 explaining code and steps for setting-up arbitrary pairwise operations across sets of variables.

I’ll break my approach down into several steps:

I. Nest and pivot
II. Expand combinations
III. Filter redundancies
IV. Map function(s)
V. Return to normal dataframe
VI. Bind back to data

If your interest is only in computing summary statistics (as opposed to modifying an existing dataframe with new columns / features), then only steps I – IV are needed.

Relevant software and style:

I will primarily be using R’s tidyverse packages. I make frequent use of lists as columns within dataframes – if you are new to these, see my previous talk and the resources7 I link to in the description.

Throughout this post, wherever I write “dataframe” I really mean “tibble” (a dataframe with minor changes to default options and printing behavior). Also note that I am using dplyr 0.8.3 rather than the newly released 1.0.08.

Other resources and open issues (updated 2020-06-14):

In particular, the comments in issue 44 for the corrr package contain excellent solutions for doing pairwise operations (the subject of this post)9. Issue 94 also features discussion on this topic. Throughout this post I will reference other alternative code/approaches (especially in the footnotes and the Appendix).

Data:

I’ll use the ames housing dataset across examples.

ames <- AmesHousing::make_ames()

Specifically, I’ll focus on ten numeric columns that, based on a random sample of 1000 rows, show the highest correlation with Sale_Price10.

library(tidyverse)

set.seed(2020)
ames_cols <- ames %>% 
  select_if(is.numeric) %>% 
  sample_n(1000) %>% 
  corrr::correlate() %>% 
  corrr::focus(Sale_Price) %>% 
  arrange(-abs(Sale_Price)) %>% 
  head(10) %>% 
  pull(rowname)

ames_subset <- select(ames, ames_cols) %>% 
  # Could normalize data or do other prep 
  # but is not pertinent for examples
  mutate_all(as.double)

I. Nest and pivot

There are a variety of ways to make lists into columns within a dataframe. In the example below, I first use summarise_all(.tbl = ames_subset, .funs = list) to create a one row dataframe where each column is a list containing a single element and each individual element corresponds with a numeric vector of length 2930.

After nesting, I pivot11 the columns leaving a dataframe with two columns:

  • var the variable names
  • vector a list where each element contains the associated vector
df_lists <- ames_subset %>% 
  summarise_all(list) %>% 
  pivot_longer(cols = everything(), 
               names_to = "var", 
               values_to = "vector") %>% 
  print()
## # A tibble: 10 x 2
##    var            vector       
##                     
##  1 Gr_Liv_Area    
##  2 Garage_Cars    
##  3 Garage_Area    
##  4 Total_Bsmt_SF  
##  5 First_Flr_SF   
##  6 Year_Built     
##  7 Full_Bath      
##  8 Year_Remod_Add 
##  9 TotRms_AbvGrd  
## 10 Fireplaces     

See Pivot and then summarise for a nearly identical approach with just an altered order of steps. Also see Nested tibbles for how you could create a list-column of dataframes12 rather than vectors.

What if my variables are across rows not columns?

For example, pretend you want to see if Sale_Price is different across Mo_Sold. Perhaps you started by doing an F-test, found that to be significant, and now want to do pairwise t-tests across the samples of Sale_Price for each Mo_Sold. To set this up, you will want a group_by() rather than a pivot_longer() step. E.g.

ames %>% 
  group_by(Mo_Sold) %>% 
  summarise(Sale_Price = list(Sale_Price)) 
## # A tibble: 12 x 2
##    Mo_Sold Sale_Price 
##            
##  1       1 
##  2       2 
##  3       3 
##  4       4 
##  5       5 
##  6       6 
##  7       7 
##  8       8 
##  9       9 
## 10      10 
## 11      11 
## 12      12 

At which point your data is in fundamentally the same form as was created in the previous code chunk (at least for if we only care about computing summary metrics that don’t require vectors of equal length13) so you can move onto II. Expand combinations.

If the variables needed for your combinations of interest are across both rows and columns, you may want to use both pivot_longer() and group_by() steps and may need to make a few small modifications.

II. Expand combinations

I then use tidyr::nesting() within tidyr::expand() to make all 2-way combinations of our rows.

df_lists_comb <- expand(df_lists,
                        nesting(var, vector),
                        nesting(var2 = var, vector2 = vector)) %>% 
  print()
## # A tibble: 100 x 4
##    var        vector        var2           vector2      
##                                   
##  1 Fireplaces  Fireplaces     
##  2 Fireplaces  First_Flr_SF   
##  3 Fireplaces  Full_Bath      
##  4 Fireplaces  Garage_Area    
##  5 Fireplaces  Garage_Cars    
##  6 Fireplaces  Gr_Liv_Area    
##  7 Fireplaces  Total_Bsmt_SF  
##  8 Fireplaces  TotRms_AbvGrd  
##  9 Fireplaces  Year_Built     
## 10 Fireplaces  Year_Remod_Add 
## # ... with 90 more rows

See Expand via join for an alternative approach using the dplyr::*_join() operations.

You could make a strong case that this step should be after III. Filter redundancies14. However putting it beforehand makes the required code easier to write and to read.

III. Filter redundancies

Filter-out redundant columns, sort the rows, better organize the columns.

df_lists_comb <- df_lists_comb %>% 
  filter(var != var2) %>% 
  arrange(var, var2) %>% 
  mutate(vars = paste0(var, ".", var2)) %>% 
  select(contains("var"), everything()) %>% 
  print()
## # A tibble: 90 x 5
##    var          var2           vars                    vector       vector2     
##                                                      
##  1 Fireplaces   First_Flr_SF   Fireplaces.First_Flr_SF 

If your operation of interest is associative15, apply a filter to remove additional redundant combinations.

c_sort_collapse <- function(...){
  c(...) %>% 
    sort() %>% 
    str_c(collapse = ".")
}

df_lists_comb_as <- df_lists_comb %>% 
  mutate(vars = map2_chr(.x = var, 
                         .y = var2, 
                         .f = c_sort_collapse)) %>%
  distinct(vars, .keep_all = TRUE)

IV. Map function(s)

Each row of your dataframe now contains the relevant combinations of variables and is ready to have any arbitrary function(s) mapped across them.

Example with summary statistic16:

For example, let’s say we want to compute the p-value of the correlation coefficient for each pair17.

pairs_cor_pvalues <- df_lists_comb_as %>% 
  mutate(cor_pvalue = map2(vector, vector2, cor.test) %>% map_dbl("p.value"),
         vars = fct_reorder(vars, -cor_pvalue)) %>% 
  arrange(cor_pvalue) %>% 
  print()
## # A tibble: 45 x 6
##    var        var2         vars                vector     vector2     cor_pvalue
##                                                 
##  1 First_Flr~ Total_Bsmt_~ First_Flr_SF.Total~ 

For fun, let’s plot the most significant associations onto a bar graph.

pairs_cor_pvalues %>% 
  head(15) %>% 
  mutate(cor_pvalue_nlog = -log(cor_pvalue)) %>% 
  ggplot(aes(x = vars, 
             y = cor_pvalue_nlog, 
             fill = is.infinite(cor_pvalue_nlog) %>% factor(c(T, F))))+
  geom_col()+
  coord_flip()+
  theme_bw()+
  labs(title = "We are confident that garage area and # of garage cars are correlated",
       y = "Negative log of p-value of correlation coefficient",
       x = "Variable combinations",
       fill = "Too high to\nmeaningfully\ndifferentiate:")+
  theme(plot.title.position = "plot")

You could use this approach to calculate any pairwise summary statistic. For example, see gist where I calculate the K-S statistic across each combination of a group of distributions.

If you only care about computing summary statistics on your pairwise combinations, (and not adding new columns onto your original dataframe) you can stop here.

Example with transformations18:

Back to the feature engineering example, perhaps we want to create new features of the difference and quotient of each combination of our variables.

new_features_prep1 <- df_lists_comb %>% 
  mutate(difference = map2(vector, vector2, `-`),
         ratio = map2(vector, vector2, `/`))

V. Return to normal dataframe

The next set of steps will put our data back into a more traditional form consistent with our starting dataframe/tibble.

First let’s revert our data to a form similar to where it was at the end of I. Nest and pivot where we had two columns:

  • one with our variable names
  • a second containing a list-column of vectors
new_features_prep2 <- new_features_prep1 %>% 
  pivot_longer(cols = c(difference, ratio)) %>% # 1
  mutate(name_vars = str_c(var, name, var2, sep = ".")) %>% # 2
  select(name_vars, value) # 3

At the end of each line of code above is a number corresponding with the following explanations:

  1. if we had done just one operation, this step would not be needed, but we did multiple operations, created multiple list-columns (difference and ratio) which we need to get into a single list-column
  2. create new variable name that combines constituent variable names with name of transformation
  3. remove old columns

Next we simply apply the inverse of those operations performed in I. Nest and pivot.

new_features <- new_features_prep2 %>% 
  pivot_wider(values_from = value,
              names_from = name_vars) %>%
  unnest(cols = everything())

The new features will add a good number of columns onto our original dataset19.

dim(new_features)
## [1] 2930  180

VI. Bind back to data

I then bind the new features back onto the original subsetted dataframe.

ames_data_features <- bind_cols(ames_subset, new_features)

At which point I could do further exploring, feature engineering, model building, etc.

Functionalize

I put these steps into a few (unpolished) functions found at this gist20.

devtools::source_gist("https://gist.github.com/brshallo/f92a5820030e21cfed8f823a6e1d56e1")

operations_combinations() takes in your dataframe, the set of numeric columns to create pairwise combinations from, and a list of functions21 to apply.

Example creating & evaluating features

Let’s use the new operations_combinations() function to create new columns for the differences and quotients between all pairwise combinations of our variables of interest.

ames_data_features_example <- operations_combinations(
  df = mutate_if(ames, is.numeric, as.double),
  one_of(ames_cols),
  funs = list("/", "-"),
  funs_names = list("ratio", "difference"),
  associative = FALSE
)

Perhaps you want to calculate some measure of association between your features and a target of interest. To keep things simple, I’ll remove any columns that contain any NA’s or infinite values.

features_keep <- ames_data_features_example %>% 
  keep(is.numeric) %>% 
  keep(~sum(is.na(.) | is.infinite(.)) == 0) %>% 
  colnames()

Maybe, for some reason, you want to see the statistical significance of the correlation of each feature with Sale_Price when weighting by Lot_Area. I’ll calculate these across variables (and a random sample of 1500 observations) then plot them on a histogram.

set.seed(1234)
ames_data_features_example %>% 
  sample_n(1500) %>% 
  summarise_at(
    .vars = features_keep[!(features_keep %in% c("Sale_Price", "Lot_Area"))],
    .funs = ~weights::wtd.cor(., Sale_Price, weight = Lot_Area)[1]) %>% 
  gather() %>% # gather() is an older version of pivot_longer() w/ fewer parameters
  ggplot(aes(x = value))+
  geom_vline(xintercept = 0, colour = "lightgray", size = 2)+
  geom_histogram()+
  scale_x_continuous(labels = scales::comma)+
  labs(title = "Distribution of correlations with Sale_Price",
       subtitle = "Weighted by Lot Area",
       x = "Weighted correlation coefficient")

If doing predictive modeling or inference you may want to fit any transformations and analysis into a tidymodels pipeline or other framework. For some brief notes on this see Interactions example, tidymodels.

When is this approach inappropriate?

Combinatorial growth is very fast22. As you increase either the number of variables in your pool or the size of each set, you will quickly bump into computational limitations.

Tidyverse packages are optimized to be efficient. However operations with matrices or other specialized formats23 are generally faster24 than with dataframes/tibbles. If you are running into computational challenges but prefer to stick with a tidyverse aesthetic (which uses dataframes as a cornerstone), you might:

  • Use heuristics to reduce the number of variables or operations you need to perform (e.g. take a sample, use a preliminary filter, a step-wise like iteration, etc.)
  • Look for packages that abstract the storage and computationally heavy operations away25 and then return back an output in a convenient form26
  • Improve the efficiency of your code (e.g. filter redundancies before rather than after expanding combinations)27
  • Consider parralelizing
  • Use matrices28

There is sometimes an urge to do everything in a tidy way, which is not necessary. For example, you could use an approach like the one I walk through to calculate pairwise correlations between each of your variables. However, the cor() function would do this much more efficiently if called on a matrix or traditional dataframe without list-columns (though you could also use the corrr package within the tidymodels suite which calls cor() in the back-end29).

However, for many operations…

  • there may not be an efficient pairwise implementation available / accessible
  • the slower computation may not matter or can be mitigated in some way

These situations30 are where the approach I walked through is most appropriate.

Appendix

Interactions example, tidymodels

A good example for creating and evaluating interaction terms31 is in The Brute-Force Approach to Identifying Predictive Interactions, Simple Screening section of Max Kuhn and Kjell Johnson’s (free) online book “Feature Engineering and Selection: A Practical Approach for Predictive Models”.

The source code shows another approach for combining variables. The author uses…

  • combn() to create all combinations of variable names which are then…
  • turned into formulas and passed into recipes::step_interact(), specifying the new columns to be created32
  • for each interaction term…
  • in each associated model being evaluated

The example uses a mix of packages and styles and is not a purely tidy approach – tidymodels has also gone through a lot of development since “Feature Engineering and Selection…” was published in 201933. Section 11.2 on Greedy Search Methods, Simple Filters is also highly relevant.

Expand via join

You can take advantage of join34 behavior to create all possible row combinations. In this case, the output will be the same as shown when using expand() (except row order will be different).

left_join(mutate(df_lists, id = 1),
          mutate(df_lists, id = 1) %>% rename_at(vars(-one_of("id")), paste0, "2")) %>%
  select(-id)

Nested tibbles

Creates list of tibbles rather than list of vectors – typically the first way lists as columns in dataframes is introduced.

ames_subset %>% 
  pivot_longer(everything(), names_to = "var", values_to = "list") %>% 
  group_by(var) %>% 
  nest()

Pivot and then summarise

(Almost) equivalent to the example in I. Nest and pivot. Steps just run in a different order (row order will also be different).

ames_test %>% 
  pivot_longer(cols = everything(), 
             names_to = "var", 
             values_to = "vector") %>% 
  group_by(var) %>% 
  summarise_all(list)

Gif for social media

AmesHousing::make_ames() %>% 
  select(Year = Year_Sold, Price = Sale_Price) %>% 
  # I.
  group_by(Year) %>% 
  summarise(Price = list(Gr_Liv_Area)) %>% 
  ungroup() %>% 
  # II.
  expand(nesting(Year, Price),
         nesting(Year2 = Year, Price2 = Price)
  ) %>%
  # III.
  filter(Year != Year2) %>% 
  mutate(Years = map2_chr(.x = Year, 
                          .y = Year2, 
                          .f = c_sort_collapse)) %>%
  distinct(Years, .keep_all = TRUE) %>% 
  select(-Years) %>% 
  #IV.
  mutate(ks_test = map2(Price, 
                        Price2, 
                        stats::ks.test) %>% map_dbl("p.value")
  )

Actual gif was created by embedding above code into a presentation and exporting it as a gif and then making a few minor edits.

Session info

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.4.0   stringr_1.4.0   dplyr_1.0.1     purrr_0.3.4    
## [5] readr_1.3.1     tidyr_1.1.1     tibble_3.0.3    ggplot2_3.3.2  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.3.2            usethis_1.4.0       lubridate_1.7.4    
##  [4] devtools_2.0.0      RColorBrewer_1.1-2  httr_1.4.0         
##  [7] rprojroot_1.3-2     tools_3.5.1         backports_1.1.8    
## [10] utf8_1.1.4          R6_2.4.1            rpart_4.1-13       
## [13] Hmisc_4.1-1         colorspace_1.4-1    nnet_7.3-12        
## [16] withr_2.2.0         gridExtra_2.3       tidyselect_1.1.0   
## [19] prettyunits_1.1.1   processx_3.4.2      curl_3.3           
## [22] compiler_3.5.1      cli_2.0.2           rvest_0.3.4        
## [25] htmlTable_1.12      mice_3.8.0          xml2_1.2.0         
## [28] desc_1.2.0          labeling_0.3        bookdown_0.11      
## [31] checkmate_2.0.0     scales_1.1.1        corrr_0.4.2.9000   
## [34] callr_3.4.3         digest_0.6.25       foreign_0.8-70     
## [37] rmarkdown_1.13      base64enc_0.1-3     pkgconfig_2.0.3    
## [40] htmltools_0.5.0     sessioninfo_1.1.1   htmlwidgets_1.3    
## [43] rlang_0.4.7         readxl_1.3.1        rstudioapi_0.11    
## [46] farver_2.0.3        generics_0.0.2      jsonlite_1.6.1     
## [49] gtools_3.8.2        acepack_1.4.1       magrittr_1.5       
## [52] Formula_1.2-3       Matrix_1.2-14       Rcpp_1.0.4.6       
## [55] munsell_0.5.0       fansi_0.4.1         lifecycle_0.2.0    
## [58] weights_1.0.1       stringi_1.4.6       yaml_2.2.1         
## [61] pkgbuild_1.1.0      grid_3.5.1          gdata_2.18.0       
## [64] crayon_1.3.4        lattice_0.20-35     haven_2.1.0        
## [67] splines_3.5.1       hms_0.5.2           knitr_1.29         
## [70] ps_1.3.2            pillar_1.4.6        pkgload_1.0.2      
## [73] glue_1.4.1          evaluate_0.14       blogdown_0.15      
## [76] latticeExtra_0.6-28 data.table_1.12.8   remotes_2.1.0      
## [79] modelr_0.1.4        vctrs_0.3.2         testthat_2.3.2     
## [82] cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1   
## [85] xfun_0.16           broom_0.7.0         AmesHousing_0.0.3  
## [88] survival_2.42-3     memoise_1.1.0       cluster_2.0.7-1    
## [91] ellipsis_0.3.1

  1. Will focus on two-way example in this post, but could use similar methods to make more generalizable solution across n-way examples. If I were to do this, the code below would change. E.g.

    • to use pmap*() operations over map2*() operations
    • I’d need to make some functions that make it so I can remove all the places where I have var and var2 type column names hard-coded
    • Alternatively, I might shift approaches and make better use of combn()

  2. Though this “throw everything and the kitchen-sink” approach may not always be a good idea.

  3. I’ve done this type of operation in a variety of ways. Sometimes without any really good reason as to why I used one approach or another. It isn’t completely clear (at least to me) the recommended way of doing these type of operations within the tidyverse – hence the diversity of my approaches in the past and deciding to document the typical steps in the approach I take… via writing this post.

  4. Or the tidymodels implementation corrr::correlate() in the corrr package.

  5. or is not in a style you prefer

  6. I’ll also reference related approaches / small tweaks (putting those materials in the Appendix. This is by no means an exhaustive list (e.g. don’t have an example with a for loop or with a %do% operator). The source code of my post on Ambiguous Absolute Value signs shows a related but more complex / messy approach on a combinatorics problem.

  7. In particular, the chapters on “Iteration” and “Many Models” in R for Data Science. I would also recommend Rebecca Barter’s Learn to purrr blog post.

  8. The new dplyr 1.0.0. contains new functions that would have been potentially useful for several of these operations. I highly recommend checking these updates out in the various recent posts by Hadley Wickham. Some of the major updates (potentially relevant to the types of operations I’ll be discussing in my post):

    • new approach for across-column operations (replacing _at(), _if(), _all() variants with across() function)
    • brought-back rowwise operations
    • emphasize ability to output tibbles / multiple columns in core dplyr verbs. This is something I had only taken advantage of occassionally in the past (example), but will look to use more going forward.

  9. f I’d spotted this issue initially I’m not sure I would have written this post. However what this post offers is a more verbose treatment of the problem which may be useful for people newer to pairwise operations or the tidyverse.

  10. For technical reasons, I also converted all integer types to doubles – was getting integer overflow problems in later operations before changing. Thread on integer overflow in R. In this post I’m not taking a disciplined approach to feature engineering. For example it may make sense to normalize the variables so that variable combinations would be starting on a similar scale. This could be done using recipes::step_normalize() or with code like dplyr::mutate_all(df, ~(. - mean(.)) / sd(.)) .

  11. Note that this part of the problem is one where I actually find using tidyr::gather() easier – but I’ve been forcing myself to switch over to using the pivot_() functions over spread() and gather().

  12. The more common approach.

  13. If your variables are across rows you are likely concerned with getting summary metrics rather than creating new features – as if your data is across rows there is nothing guaranteeing you have the same number of observations or that they are lined-up appropriately. If you are interested in creating new features, you should probably have first reshaped your data to ensure each column represented a variable.

  14. As switching these would be more computationally efficient – see When is this approach inappropriate? for notes related to this. Switching the order here would suggest using approaches with thecombn() function.

  15. I.e. has the same output regardless of the order of the variables. E.g. multiplication or addition but not subtraction or division.

  16. Function(s) that output vectors of length 1 (or less than length of input vectors).

  17. Note that the pairwise implementation psych::corr.test() could have been used on your original subsetted dataframe, see stack overflow thread.

  18. Function(s) that output vector of length equal to length of input vectors.

  19. Did not print this output because cluttered-up page with so many column names.

  20. Steps I - III and V & VI are essentially direct copies of the code above. The approach I took with Step IV may take more effort to follow as it requires understanding a little rlang and could likely have been done more simply.

  21. Must have two vectors as input, but do not need to be infix functions.

  22. Non-technical article discussing combinatorial explosion in context of company user growth targets: Exponential Growth Isn’t Cool. Combinatorial Growth Is..

  23. E.g. data.table dataframes

  24. Hence, if you are doing operations across combinations of lots of variables it may not make sense to do the operations directly within dataframes.

  25. Much (if not most) of the tidyverse (and the R programming language generally) is about creating a smooth interface between the analyst/scientist and the back-end complexity of the operations they are performing. Projects like sparklyr, DBI, reticulate, tidymodels, and brms (to name a few) represent cases where this interface role of R is most apparent.

  26. For tidyverse packages, this is often returned into or in the form of a dataframe.

  27. Could make better use of combn() function to help.

  28. Depending on the complexity may just need to brush-up on your linear algebra.

  29. corrr can also be used to run the operation on databases that may have larger data than you could fit on your computer.

  30. Likely more common for many, if not most, analysts and data scientists.

  31. I.e. multiplying two variables together

  32. Created upon the recipe being baked or juiced – if you have not checked it out, recipes is AWESOME!

  33. Maybe at a future date I’ll make a post writing out the example here using the newer approaches now available in tidymodels. Gist of combn_ttible()… starting place for if I ever get to that write-up.

  34. Could also have used right_join() or full_join().

To leave a comment for the author, please follow the link and comment on their blog: rstats on Bryan Shalloway's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)