Tidy Pairwise Operations
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Overview
Say you want to map an operation or list of operations across all twoway^{1} combinations of a set of variables/columns in a dataframe. For example, you may be doing feature engineering and want to create a set of interaction terms, ratios, etc^{2}. You may be interested in computing a summary statistic across all pairwise combinations of a given set of variables^{3}. In some cases there may be a pairwise implementation already available, e.g. R’s cor()
function for computing correlations^{4}. In other cases one may not exist or is not easy to use^{5}. In this post I’ll walk through an example^{6} explaining code and steps for settingup arbitrary pairwise operations across sets of variables.
I’ll break my approach down into several steps:
I. Nest and pivot
II. Expand combinations
III. Filter redundancies
IV. Map function(s)
V. Return to normal dataframe
VI. Bind back to data
If your interest is only in computing summary statistics (as opposed to modifying an existing dataframe with new columns / features), then only steps I – IV are needed.
Relevant software and style:
I will primarily be using R’s tidyverse
packages. I make frequent use of lists as columns within dataframes – if you are new to these, see my previous talk and the resources^{7} I link to in the description.
Throughout this post, wherever I write “dataframe” I really mean “tibble” (a dataframe with minor changes to default options and printing behavior). Also note that I am using dplyr
0.8.3 rather than the newly released 1.0.0^{8}.
Other resources and open issues (updated 20200614):
In particular, the comments in issue 44 for the corrr
package contain excellent solutions for doing pairwise operations (the subject of this post)^{9}. Issue 94 also features discussion on this topic. Throughout this post I will reference other alternative code/approaches (especially in the footnotes and the Appendix).
Data:
I’ll use the ames housing dataset across examples.
ames < AmesHousing::make_ames()
Specifically, I’ll focus on ten numeric columns that, based on a random sample of 1000 rows, show the highest correlation with Sale_Price
^{10}.
library(tidyverse) set.seed(2020) ames_cols < ames %>% select_if(is.numeric) %>% sample_n(1000) %>% corrr::correlate() %>% corrr::focus(Sale_Price) %>% arrange(abs(Sale_Price)) %>% head(10) %>% pull(rowname) ames_subset < select(ames, ames_cols) %>% # Could normalize data or do other prep # but is not pertinent for examples mutate_all(as.double)
I. Nest and pivot
There are a variety of ways to make lists into columns within a dataframe. In the example below, I first use summarise_all(.tbl = ames_subset, .funs = list)
to create a one row dataframe where each column is a list containing a single element and each individual element corresponds with a numeric vector of length 2930.
After nesting, I pivot^{11} the columns leaving a dataframe with two columns:
var
the variable namesvector
a list where each element contains the associated vector
df_lists < ames_subset %>% summarise_all(list) %>% pivot_longer(cols = everything(), names_to = "var", values_to = "vector") %>% print() ## # A tibble: 10 x 2 ## var vector #### 1 Gr_Liv_Area
## 2 Garage_Cars ## 3 Garage_Area ## 4 Total_Bsmt_SF ## 5 First_Flr_SF ## 6 Year_Built ## 7 Full_Bath ## 8 Year_Remod_Add ## 9 TotRms_AbvGrd ## 10 Fireplaces
See Pivot and then summarise for a nearly identical approach with just an altered order of steps. Also see Nested tibbles for how you could create a listcolumn of dataframes^{12} rather than vectors.
What if my variables are across rows not columns?
For example, pretend you want to see if Sale_Price
is different across Mo_Sold
. Perhaps you started by doing an Ftest, found that to be significant, and now want to do pairwise ttests across the samples of Sale_Price
for each Mo_Sold
. To set this up, you will want a group_by()
rather than a pivot_longer()
step. E.g.
ames %>% group_by(Mo_Sold) %>% summarise(Sale_Price = list(Sale_Price)) ## # A tibble: 12 x 2 ## Mo_Sold Sale_Price #### 1 1
## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 ## 10 10 ## 11 11 ## 12 12
At which point your data is in fundamentally the same form as was created in the previous code chunk (at least for if we only care about computing summary metrics that don’t require vectors of equal length^{13}) so you can move onto II. Expand combinations.
If the variables needed for your combinations of interest are across both rows and columns, you may want to use both pivot_longer()
and group_by()
steps and may need to make a few small modifications.
II. Expand combinations
I then use tidyr::nesting()
within tidyr::expand()
to make all 2way combinations of our rows.
df_lists_comb < expand(df_lists, nesting(var, vector), nesting(var2 = var, vector2 = vector)) %>% print() ## # A tibble: 100 x 4 ## var vector var2 vector2 ##
## 1 Fireplaces
Fireplaces ## 2 Fireplaces First_Flr_SF ## 3 Fireplaces Full_Bath ## 4 Fireplaces Garage_Area ## 5 Fireplaces Garage_Cars ## 6 Fireplaces Gr_Liv_Area ## 7 Fireplaces Total_Bsmt_SF ## 8 Fireplaces TotRms_AbvGrd ## 9 Fireplaces Year_Built ## 10 Fireplaces Year_Remod_Add ## # ... with 90 more rows
See Expand via join for an alternative approach using the dplyr::*_join()
operations.
You could make a strong case that this step should be after III. Filter redundancies^{14}. However putting it beforehand makes the required code easier to write and to read.
III. Filter redundancies
Filterout redundant columns, sort the rows, better organize the columns.
df_lists_comb < df_lists_comb %>% filter(var != var2) %>% arrange(var, var2) %>% mutate(vars = paste0(var, ".", var2)) %>% select(contains("var"), everything()) %>% print() ## # A tibble: 90 x 5 ## var var2 vars vector vector2 ##
## 1 Fireplaces First_Flr_SF Fireplaces.First_Flr_SF
If your operation of interest is associative^{15}, apply a filter to remove additional redundant combinations.
c_sort_collapse < function(...){ c(...) %>% sort() %>% str_c(collapse = ".") } df_lists_comb_as < df_lists_comb %>% mutate(vars = map2_chr(.x = var, .y = var2, .f = c_sort_collapse)) %>% distinct(vars, .keep_all = TRUE)
IV. Map function(s)
Each row of your dataframe now contains the relevant combinations of variables and is ready to have any arbitrary function(s) mapped across them.
Example with summary statistic^{16}:
For example, let’s say we want to compute the pvalue of the correlation coefficient for each pair^{17}.
pairs_cor_pvalues < df_lists_comb_as %>% mutate(cor_pvalue = map2(vector, vector2, cor.test) %>% map_dbl("p.value"), vars = fct_reorder(vars, cor_pvalue)) %>% arrange(cor_pvalue) %>% print() ## # A tibble: 45 x 6 ## var var2 vars vector vector2 cor_pvalue ##
## 1 First_Flr~ Total_Bsmt_~ First_Flr_SF.Total~ For fun, let’s plot the most significant associations onto a bar graph.
pairs_cor_pvalues %>% head(15) %>% mutate(cor_pvalue_nlog = log(cor_pvalue)) %>% ggplot(aes(x = vars, y = cor_pvalue_nlog, fill = is.infinite(cor_pvalue_nlog) %>% factor(c(T, F))))+ geom_col()+ coord_flip()+ theme_bw()+ labs(title = "We are confident that garage area and # of garage cars are correlated", y = "Negative log of pvalue of correlation coefficient", x = "Variable combinations", fill = "Too high to\nmeaningfully\ndifferentiate:")+ theme(plot.title.position = "plot")You could use this approach to calculate any pairwise summary statistic. For example, see gist where I calculate the KS statistic across each combination of a group of distributions.
If you only care about computing summary statistics on your pairwise combinations, (and not adding new columns onto your original dataframe) you can stop here.
Example with transformations^{18}:
Back to the feature engineering example, perhaps we want to create new features of the difference and quotient of each combination of our variables.
new_features_prep1 < df_lists_comb %>% mutate(difference = map2(vector, vector2, ``), ratio = map2(vector, vector2, `/`))
V. Return to normal dataframe
The next set of steps will put our data back into a more traditional form consistent with our starting dataframe/tibble.
First let’s revert our data to a form similar to where it was at the end of I. Nest and pivot where we had two columns:
 one with our variable names
 a second containing a listcolumn of vectors
new_features_prep2 < new_features_prep1 %>% pivot_longer(cols = c(difference, ratio)) %>% # 1 mutate(name_vars = str_c(var, name, var2, sep = ".")) %>% # 2 select(name_vars, value) # 3
At the end of each line of code above is a number corresponding with the following explanations:
 if we had done just one operation, this step would not be needed, but we did multiple operations, created multiple listcolumns (
difference
andratio
) which we need to get into a single listcolumn  create new variable name that combines constituent variable names with name of transformation
 remove old columns
Next we simply apply the inverse of those operations performed in I. Nest and pivot.
new_features < new_features_prep2 %>% pivot_wider(values_from = value, names_from = name_vars) %>% unnest(cols = everything())
The new features will add a good number of columns onto our original dataset^{19}.
dim(new_features) ## [1] 2930 180
VI. Bind back to data
I then bind the new features back onto the original subsetted dataframe.
ames_data_features < bind_cols(ames_subset, new_features)
At which point I could do further exploring, feature engineering, model building, etc.
Functionalize
I put these steps into a few (unpolished) functions found at this gist^{20}.
devtools::source_gist("https://gist.github.com/brshallo/f92a5820030e21cfed8f823a6e1d56e1")
operations_combinations()
takes in your dataframe, the set of numeric columns to create pairwise combinations from, and a list of functions^{21} to apply.
Example creating & evaluating features
Let’s use the new operations_combinations()
function to create new columns for the differences and quotients between all pairwise combinations of our variables of interest.
ames_data_features_example < operations_combinations( df = mutate_if(ames, is.numeric, as.double), one_of(ames_cols), funs = list("/", ""), funs_names = list("ratio", "difference"), associative = FALSE )
Perhaps you want to calculate some measure of association between your features and a target of interest. To keep things simple, I’ll remove any columns that contain any NA’s or infinite values.
features_keep < ames_data_features_example %>% keep(is.numeric) %>% keep(~sum(is.na(.)  is.infinite(.)) == 0) %>% colnames()
Maybe, for some reason, you want to see the statistical significance of the correlation of each feature with Sale_Price
when weighting by Lot_Area
. I’ll calculate these across variables (and a random sample of 1500 observations) then plot them on a histogram.
set.seed(1234) ames_data_features_example %>% sample_n(1500) %>% summarise_at( .vars = features_keep[!(features_keep %in% c("Sale_Price", "Lot_Area"))], .funs = ~weights::wtd.cor(., Sale_Price, weight = Lot_Area)[1]) %>% gather() %>% # gather() is an older version of pivot_longer() w/ fewer parameters ggplot(aes(x = value))+ geom_vline(xintercept = 0, colour = "lightgray", size = 2)+ geom_histogram()+ scale_x_continuous(labels = scales::comma)+ labs(title = "Distribution of correlations with Sale_Price", subtitle = "Weighted by Lot Area", x = "Weighted correlation coefficient")
If doing predictive modeling or inference you may want to fit any transformations and analysis into a tidymodels
pipeline or other framework. For some brief notes on this see Interactions example, tidymodels.
When is this approach inappropriate?
Combinatorial growth is very fast^{22}. As you increase either the number of variables in your pool or the size of each set, you will quickly bump into computational limitations.
Tidyverse packages are optimized to be efficient. However operations with matrices or other specialized formats^{23} are generally faster^{24} than with dataframes/tibbles. If you are running into computational challenges but prefer to stick with a tidyverse aesthetic (which uses dataframes as a cornerstone), you might:
 Use heuristics to reduce the number of variables or operations you need to perform (e.g. take a sample, use a preliminary filter, a stepwise like iteration, etc.)
 Look for packages that abstract the storage and computationally heavy operations away^{25} and then return back an output in a convenient form^{26}
 Improve the efficiency of your code (e.g. filter redundancies before rather than after expanding combinations)^{27}
 Consider parralelizing
 Use matrices^{28}
There is sometimes an urge to do everything in a tidy way, which is not necessary. For example, you could use an approach like the one I walk through to calculate pairwise correlations between each of your variables. However, the cor()
function would do this much more efficiently if called on a matrix or traditional dataframe without listcolumns (though you could also use the corrr
package within the tidymodels
suite which calls cor()
in the backend^{29}).
However, for many operations…
 there may not be an efficient pairwise implementation available / accessible
 the slower computation may not matter or can be mitigated in some way
These situations^{30} are where the approach I walked through is most appropriate.
Appendix
Interactions example, tidymodels
A good example for creating and evaluating interaction terms^{31} is in The BruteForce Approach to Identifying Predictive Interactions, Simple Screening section of Max Kuhn and Kjell Johnson’s (free) online book “Feature Engineering and Selection: A Practical Approach for Predictive Models”.
The source code shows another approach for combining variables. The author uses…
combn()
to create all combinations of variable names which are then… turned into formulas and passed into
recipes::step_interact()
, specifying the new columns to be created^{32}…  for each interaction term…
 in each associated model being evaluated
The example uses a mix of packages and styles and is not a purely tidy approach – tidymodels
has also gone through a lot of development since “Feature Engineering and Selection…” was published in 2019^{33}. Section 11.2 on Greedy Search Methods, Simple Filters is also highly relevant.
Expand via join
You can take advantage of join^{34} behavior to create all possible row combinations. In this case, the output will be the same as shown when using expand()
(except row order will be different).
left_join(mutate(df_lists, id = 1), mutate(df_lists, id = 1) %>% rename_at(vars(one_of("id")), paste0, "2")) %>% select(id)
Nested tibbles
Creates list of tibbles rather than list of vectors – typically the first way lists as columns in dataframes is introduced.
ames_subset %>% pivot_longer(everything(), names_to = "var", values_to = "list") %>% group_by(var) %>% nest()
Pivot and then summarise
(Almost) equivalent to the example in I. Nest and pivot. Steps just run in a different order (row order will also be different).
ames_test %>% pivot_longer(cols = everything(), names_to = "var", values_to = "vector") %>% group_by(var) %>% summarise_all(list)
Session info
sessionInfo() ## R version 3.5.1 (20180702) ## Platform: x86_64w64mingw32/x64 (64bit) ## Running under: Windows 10 x64 (build 18363) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] forcats_0.4.0 stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4 ## [5] readr_1.3.1 tidyr_1.1.1 tibble_3.0.3 ggplot2_3.3.2 ## [9] tidyverse_1.2.1 ## ## loaded via a namespace (and not attached): ## [1] fs_1.3.2 usethis_1.4.0 lubridate_1.7.4 ## [4] devtools_2.0.0 RColorBrewer_1.12 httr_1.4.0 ## [7] rprojroot_1.32 tools_3.5.1 backports_1.1.8 ## [10] utf8_1.1.4 R6_2.4.1 rpart_4.113 ## [13] Hmisc_4.11 colorspace_1.41 nnet_7.312 ## [16] withr_2.2.0 gridExtra_2.3 tidyselect_1.1.0 ## [19] prettyunits_1.1.1 processx_3.4.2 curl_3.3 ## [22] compiler_3.5.1 cli_2.0.2 rvest_0.3.4 ## [25] htmlTable_1.12 mice_3.8.0 xml2_1.2.0 ## [28] desc_1.2.0 labeling_0.3 bookdown_0.11 ## [31] checkmate_2.0.0 scales_1.1.1 corrr_0.4.2.9000 ## [34] callr_3.4.3 digest_0.6.25 foreign_0.870 ## [37] rmarkdown_1.13 base64enc_0.13 pkgconfig_2.0.3 ## [40] htmltools_0.5.0 sessioninfo_1.1.1 htmlwidgets_1.3 ## [43] rlang_0.4.7 readxl_1.3.1 rstudioapi_0.11 ## [46] farver_2.0.3 generics_0.0.2 jsonlite_1.6.1 ## [49] gtools_3.8.2 acepack_1.4.1 magrittr_1.5 ## [52] Formula_1.23 Matrix_1.214 Rcpp_1.0.4.6 ## [55] munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0 ## [58] weights_1.0.1 stringi_1.4.6 yaml_2.2.1 ## [61] pkgbuild_1.1.0 grid_3.5.1 gdata_2.18.0 ## [64] crayon_1.3.4 lattice_0.2035 haven_2.1.0 ## [67] splines_3.5.1 hms_0.5.2 knitr_1.29 ## [70] ps_1.3.2 pillar_1.4.6 pkgload_1.0.2 ## [73] glue_1.4.1 evaluate_0.14 blogdown_0.15 ## [76] latticeExtra_0.628 data.table_1.12.8 remotes_2.1.0 ## [79] modelr_0.1.4 vctrs_0.3.2 testthat_2.3.2 ## [82] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 ## [85] xfun_0.16 broom_0.7.0 AmesHousing_0.0.3 ## [88] survival_2.423 memoise_1.1.0 cluster_2.0.71 ## [91] ellipsis_0.3.1

Will focus on twoway example in this post, but could use similar methods to make more generalizable solution across nway examples. If I were to do this, the code below would change. E.g.
 to use
pmap*()
operations overmap2*()
operations  I’d need to make some functions that make it so I can remove all the places where I have
var
andvar2
type column names hardcoded  Alternatively, I might shift approaches and make better use of
combn()
 to use

Though this “throw everything and the kitchensink” approach may not always be a good idea.↩

I’ve done this type of operation in a variety of ways. Sometimes without any really good reason as to why I used one approach or another. It isn’t completely clear (at least to me) the recommended way of doing these type of operations within the tidyverse – hence the diversity of my approaches in the past and deciding to document the typical steps in the approach I take… via writing this post.↩

Or the tidymodels implementation
corrr::correlate()
in the corrr package.↩ 
or is not in a style you prefer↩

I’ll also reference related approaches / small tweaks (putting those materials in the Appendix. This is by no means an exhaustive list (e.g. don’t have an example with a
for
loop or with a%do%
operator). The source code of my post on Ambiguous Absolute Value signs shows a related but more complex / messy approach on a combinatorics problem.↩ 
In particular, the chapters on “Iteration” and “Many Models” in R for Data Science. I would also recommend Rebecca Barter’s Learn to purrr blog post.↩

The new
dplyr
1.0.0. contains new functions that would have been potentially useful for several of these operations. I highly recommend checking these updates out in the various recent posts by Hadley Wickham. Some of the major updates (potentially relevant to the types of operations I’ll be discussing in my post): new approach for acrosscolumn operations (replacing
_at()
,_if()
,_all()
variants withacross()
function)  broughtback rowwise operations
 emphasize ability to output tibbles / multiple columns in core
dplyr
verbs. This is something I had only taken advantage of occassionally in the past (example), but will look to use more going forward.
 new approach for acrosscolumn operations (replacing

f I’d spotted this issue initially I’m not sure I would have written this post. However what this post offers is a more verbose treatment of the problem which may be useful for people newer to pairwise operations or the tidyverse.↩

For technical reasons, I also converted all integer types to doubles – was getting integer overflow problems in later operations before changing. Thread on integer overflow in R. In this post I’m not taking a disciplined approach to feature engineering. For example it may make sense to normalize the variables so that variable combinations would be starting on a similar scale. This could be done using
recipes::step_normalize()
or with code likedplyr::mutate_all(df, ~(.  mean(.)) / sd(.))
.↩ 
Note that this part of the problem is one where I actually find using
tidyr::gather()
easier – but I’ve been forcing myself to switch over to using thepivot_()
functions overspread()
andgather()
.↩ 
The more common approach.↩

If your variables are across rows you are likely concerned with getting summary metrics rather than creating new features – as if your data is across rows there is nothing guaranteeing you have the same number of observations or that they are linedup appropriately. If you are interested in creating new features, you should probably have first reshaped your data to ensure each column represented a variable.↩

As switching these would be more computationally efficient – see When is this approach inappropriate? for notes related to this. Switching the order here would suggest using approaches with the
combn()
function.↩ 
I.e. has the same output regardless of the order of the variables. E.g. multiplication or addition but not subtraction or division.↩

Function(s) that output vectors of length 1 (or less than length of input vectors).↩

Note that the pairwise implementation
psych::corr.test()
could have been used on your original subsetted dataframe, see stack overflow thread.↩ 
Function(s) that output vector of length equal to length of input vectors.↩

Did not print this output because clutteredup page with so many column names.↩

Steps I  III and V & VI are essentially direct copies of the code above. The approach I took with Step IV may take more effort to follow as it requires understanding a little
rlang
and could likely have been done more simply.↩ 
Must have two vectors as input, but do not need to be infix functions.↩

Nontechnical article discussing combinatorial explosion in context of company user growth targets: Exponential Growth Isn’t Cool. Combinatorial Growth Is..↩

E.g. data.table dataframes↩

Hence, if you are doing operations across combinations of lots of variables it may not make sense to do the operations directly within dataframes.↩

Much (if not most) of the
tidyverse
(and the R programming language generally) is about creating a smooth interface between the analyst/scientist and the backend complexity of the operations they are performing. Projects like sparklyr, DBI, reticulate, tidymodels, and brms (to name a few) represent cases where this interface role of R is most apparent.↩ 
For tidyverse packages, this is often returned into or in the form of a dataframe.↩

Could make better use of
combn()
function to help.↩ 
Depending on the complexity may just need to brushup on your linear algebra.↩

corrr
can also be used to run the operation on databases that may have larger data than you could fit on your computer.↩ 
Likely more common for many, if not most, analysts and data scientists.↩

I.e. multiplying two variables together↩

Created upon the recipe being baked or juiced – if you have not checked it out, recipes is AWESOME!↩

Maybe at a future date I’ll make a post writing out the example here using the newer approaches now available in
tidymodels
. Gist ofcombn_ttible()
… starting place for if I ever get to that writeup.↩ 
Could also have used
right_join()
orfull_join()
.↩
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.