Site icon R-bloggers

Investigating Ranks, Monotonicity, and Spearman’s Rho with R

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Problem

I have a bunch of data that can be categorized into many small groups. Each small group has a set of values for an ordered set of intervals. Having observed that the values for most groups seem to increase with the order of the interval, I hypothesize that their is a statistically-significant, monotonically increasing trend.

An Analogy

To make this abstract problem more relatable, imagine the following scenario.

There are many companies (groups) selling products in an industry that is on the verge of widespread growth. Each company sets a projection (not a goal) for end-of-year sales (value). and adjust this projection once per quarter (i.e. four times a year) (interval) after analyzing their to-date sales. Having observed that most companies that follow this scheme tend to increase their goals for next-quarter sales (based on an initial, beginning-of-the-year projection that is too low), the market analyst (me) wonders if there is a non-trivial, positive trend across quarters in the year. 1

To summarize how the “variables” in this analogy relate to my abstract description of the situation,

(Actually, there is one more variable in this context, given that I am interested in relative value—the rank of the value, relative to the other values.)

And, to make the example concrete, here is what the data (for four companies) might look like.

company q1 q2 q3 q4


A 3.6 8.5 9.2 9.4 B -0.73 0.04 0.28 0.47 C 6604 4631 6604 7217 D -49 -9 400 87

(Bad) Solution: Linear Regression

Just at first thought, running a univariate linear regression might seem like a good way of attacking this problem. However, there are a couple of basic “gotchas” that make ordinary linear regression a not-so-great idea for this scenario:

Aside from these caveats, the value for a given interval is not relevant—rather, its relationship with all other values is, and, more specifically, its relationships with the previous and subsequent values. 2

(Better) Solution: Spearman’s Rho

Given the nature of the data (which one might say is non linear) and my intent to quantify ordinality between two variables, it turns out that Spearman’s rho, in theory, provides exactly the measurement that I want—it quantifies the association between paired samples using the ranks of the variables (not their values) relative to their samples. Notably, the statistical significance (i.e. via a p-value) can be calculated (traditionally, using a t-test), which should be handy for my intent on identifying non-triviality.

Nonetheless, even though this metric seems promisable, it will certainly be sensitive to the small samples of each group (assuming that it is calculated for each group). Don’t believe me? Check out how the Spearman’s rho value changes for the numerica columns in the built-in iris dataset (which has 150 rows) when it is calculated for just the first 10 rows.

Another Solution: Custom Heuristic

So, what can be done? Well, even with the hope that the Spearman’s rho metric provides for quantification and significance inference, I thought that I would try to create some kind of easily understandable heuristic that I could explain to someone else without having to delve into statistical theory. Nonetheless, I would be ignorant to not compare (and validate) the results of my heurisitc with those of statistical theory after creating my heurisitc.

Having this plan in mind, I began to think about how I would define my heuristic, which, in essence, tries to quantify monotocity. But what exactly constitutes monoticity? Surprisingly, that’s a more complex question than it might seem. 0

For my purposes, I don’t necessarily care if the set of values is strictly increasing or decreasing, but they should be “sufficiently” increasing or decreasing. For example, while it is clear that the sequence 1, 2, 3, 4 is strictly monotonic (increasing) and the sequence 1, 2, 4, 3 is not, I would consider the latter “sufficiently” monotonic. On the the hand, I would consider something like 4, 2, 3, 1 because the 1 and 4 are “badly” misplaced in one another’s appropriate places, which are at the extreme ends of the sequence. Moreover, if I was intent on identifying increasing monoticity (as opposed to decreasing monotonicity), I would consider 4, 3, 2, 1 “bad”, even though it is strictly monotonically decreasing. But what about something like 1, 4, 3, 2 (again, assuming that I am evaluating increasing monotonicity)? Even though the 2 and 4 are swapped, I might still consider this sequence “sufficiently” monotonic because the 1 and 3 are placed correctly and the 2 and 4 are “not too far apart”. Anyways, it’s easy to see how having some kind of formal definition/calculation/criteria for monotonicity is handy.

The “Algorithm”

After some thinking, I came up with the following algorithm (if one can even call it that).

(NOTE: I explicitly list the variable names that I use in the code that follows to help the reader understand the relationships between these steps and the implementation.)

  1. Given an n-length sequence of arbitrary values, assign each value an integer value between 1 and n to represent its “actual” rank. This rank is to be assigned based on relative value in the set of real numbers. 3
  1. Create a matrix of all permutations of actual rank and “assigned” rank. To be clear, this “assigned” rank is independent of the actual rank and value. To make things straightforward, these assigned ranks should be transformed to use only the same set of possible rank values dictated by the actual ranks (i.e. integers between 1 and n).
  1. Calculate the absolute difference between the “actual” and “assigned” ranks for each value in the sequence. Subtract this distance from the maximum rank value in the sequence. The resulting value is what I call the “inverse monotonic distance” (mntc_distinv in the following code).

  2. Repeat the calculation of inverse monotonic distance for all groups (grp) of “actual” (y0) and “assigned” (x0) ranks.

  3. Sum up the inverse monotonic distance for each value in the permutation group and take the average of this sum for each group. 4 Re-scale this per-group value to a 0 to 1 basis. 0

  4. Identify any group (grp) corresponding to a sum-averaged-re-scaled value (mntc_distinv) in the upper 50% quantile of all values (i.e. assigned the value "(0.5,1]" for the mntc_tier2 variable) as “sufficiently” monotonic. (The mntc_tier2 variable can be interpreted as my heuristic.)

Notably, even though the result set is split at the 50% threshold (which is a subjective choice), this does not mean that 50% of all possible groups are classified as 50%. (According to my method, only 33% are for n = 4.)

Implementing the Heuristic

Ok, that is enough discussion. What follows is the implementation.

NOTE: In order to keep focus on how the code implements methodology, I recommend reviewing the code but not worrying too much about the details (such as the internal workings of my custom functions). Rather, I’d recommend inspecting in detail only the parts that are printed out (and going back later to understand the complexities, if curious).

library("dplyr")
library("ggplot2")
library("tidyr")
# These packages are used, but their functions are called explicitly.
# library("purrr")
# library("broom")
# library("combinat")

The only choice that I need to make to begin is the length of the set of values (i.e. n), which should be a “small” integer. I’ll choose 4, simply because 3 seems like it is “too small” and because subsequent visualization(s) becomes “cluttered” (and interpretation becomes less direct) if a number 5 or greater is chosen. 0

The following code chunk corresponds to steps 1 and 2 in my methodology, which are basically just set-up steps.

create_permns <- function(n = 1L) {
  
  n_seq <- seq(1L, n, by = 1L)
  
  combs <-
    combinat::permn(n_seq) %>%
    purrr::map( ~ paste(.x, collapse = "")) %>%
    unlist() %>%
    as.integer()
  data_xy <-
    tibble(grp_x = combs, grp_y = combs) %>%
    expand(grp_x, grp_y) %>%
    mutate(grp = paste0("x", grp_x, "y", grp_y))
  
  into_seq <- seq(1L, n, by = 1L) %>% as.character()
  sep_seq <- seq(1L, n - 1L, by = 1L)
  wrangle_data_xy <-
    function(data = NULL, which = NULL) {
      col_grp <- rlang::sym(paste0("grp_", which))
      col_0 <- rlang::sym(paste0(which, "0"))
      data %>%
        separate(!!col_grp, into_seq, sep = sep_seq, remove = FALSE) %>%
        gather(idx, !!col_0, matches("^[0-9]$")) %>%
        mutate_at(vars(idx, !!col_0), funs(as.integer))
    }
  inner_join(data_xy %>% wrangle_data_xy("x"),
               data_xy %>% wrangle_data_xy("y")) %>%
    select(-idx) %>%
    arrange(grp_x, grp_y)
}

data_permns <- create_permns(n = n)
data_permns

grp_x grp_y grp x0 y0 1234 1234 x1234y1234 1 1 1234 1234 x1234y1234 2 2 1234 1234 x1234y1234 3 3 1234 1234 x1234y1234 4 4 1234 1243 x1234y1243 1 1 1234 1243 x1234y1243 2 2 1234 1243 x1234y1243 3 4 1234 1243 x1234y1243 4 3 1234 1324 x1234y1324 1 1 1234 1324 x1234y1324 2 3 ^1^ # of total rows: 2304 Note(s) about the above code chunk:

Now, I implement the initial calculation of “inverse monotonic distance” (mntc_distinv).

add_mntc_cols <- function(data = NULL) {
  data %>%
    group_by(grp) %>%
    arrange(x0, .by_group = TRUE) %>% 
    mutate(mntc = ifelse((y0 == cummax(y0)) | (y0 == cummin(y0)), 1L, 0L)) %>% 
    mutate(mntc_distinv = as.integer(x0 * (max(x0) - abs(x0 - y0)))) %>% 
    ungroup()
}
data_mntc <- add_mntc_cols(data_permns)
data_mntc

grp_x grp_y grp x0 y0 mntc mntc_distinv 1234 1234 x1234y1234 1 1 1 4 1234 1234 x1234y1234 2 2 1 8 1234 1234 x1234y1234 3 3 1 12 1234 1234 x1234y1234 4 4 1 16 1234 1243 x1234y1243 1 1 1 4 1234 1243 x1234y1243 2 2 1 8 1234 1243 x1234y1243 3 4 1 9 1234 1243 x1234y1243 4 3 0 12 1234 1324 x1234y1324 1 1 1 4 1234 1324 x1234y1324 2 3 1 6 ^1^ # of total rows: 2304 Note(s) about the above code chunk:

Next is the calculation of the transformed (i.e. summed-averaged-re-scaled) version of the “inverse monotonic distance” (mntc_distinv), as well as the split of the mntc_distinv into upper and lower 50% quantiles (mntc_tier2).

unitize <- function(x = NULL) {
  (x - min(x)) / (max(x) - min(x))
}
summarise_mntc <- function(data = NULL) {
  data %>%
    group_by(grp) %>% 
    summarise_at(vars(mntc_distinv), funs(mean)) %>% 
    ungroup() %>% 
    mutate_at(vars(mntc_distinv), funs(unitize)) %>% 
    mutate(mntc_tier2 = cut(mntc_distinv, 2))
}
summ_mntc <- summarise_mntc(data_mntc)
summ_mntc

Now, with the “algorithm” fully implemented, I can begin to evaluate the results.

Exactly how many values make up each 50% quantile?

mntc_tier2 n n_pct


(-0.001,0.5] 384 66.67 (0.5,1] 192 33.33

What does the distribution of all “inverse monotonic distance” values look like?

The positive identificaiton (in yellow) of combinations along the left-to-right, lower-to-upper diagonal is what I would expect. These are the values where x0 and y0 are perfectly matched. Conversely, values along the other diagonal are generally colored in purple, as I would expect. These combinations consist of sequences of x0 and y0 values that are “negatively” symmetric (e.g. (1, 2, 3, 4) and (4, 3, 2, 1)).

Checking the Heuristic

Ok, my heuristic seems valid, but how can I know for sure that it is reasonable? I mentioned before that Spearman’s rho should serve a good measure, so I’ll take a look at it now.

add_cortest_cols <- function(data = NULL) {
  data %>%
    group_by(grp) %>%
    nest() %>%
    mutate(cortest =
             purrr::map(data, ~ broom::tidy(cor.test(.$x0, .$y0, method = "spearman")))
    ) %>%
    unnest(cortest, .drop = TRUE) %>%
    select(grp, estimate, p.value)
}
summarise_mntc_wcortest <- function(data = NULL) {
   summ <- summarise_mntc(data)
   data %>%
     add_cortest_cols() %>% 
     inner_join(summ, by = "grp")
}

summ_mntc_wcortest <- summarise_mntc_wcortest(data_mntc)
summ_mntc_wcortest

grp estimate p.value mntc_distinv mntc_tier2 n_pct x1234y1234 1.0 0.08 1.00 (0.5,1] 100 x1234y1243 0.8 0.33 0.67 (0.5,1] 100 x1234y1324 0.8 0.33 0.76 (0.5,1] 100 x1234y1342 0.4 0.75 0.38 (-0.001,0.5] 100 x1234y1423 0.4 0.75 0.48 (-0.001,0.5] 100 x1234y1432 0.2 0.92 0.43 (-0.001,0.5] 100 x1234y2134 0.8 0.33 0.86 (0.5,1] 100 x1234y2143 0.6 0.42 0.52 (0.5,1] 100 x1234y2314 0.4 0.75 0.57 (0.5,1] 100 x1234y2341 -0.2 0.92 0.14 (-0.001,0.5] 100 ^1^ # of total rows: 576 What exactly is the distribution of the Pearson’s rho t-test estimates and p-values?

abs(estimate) p.value n


0.4 0.75 192 0.8 0.33 144 0.2 0.92 96 0.0 1.00 48 0.6 0.42 48 1.0 0.08 48

Note(s) about the above output:

Now, to understand how the Pearson’s rho t-test estimates and p.values correspond to my heuristic, I’ll simply overlay the combinations that are identified as significant to my previous heat map of rank combinations. Because I’m erring on the side of flexibility in defining “sufficient” monotonicity, I’ll say that the pairs corresponding to the bottom two tiers of p-values (corresponding to 0.0833 and 0.33) constitute “sufficient” monoticity.

![](viz_summ_mntc_wcortest-1.png

It looks like there is a large amount of overlap between my heuristic classification of “sufficient” monoticity and that identified by a more statistical approach.

Now I’ll repeat the simulation for other values of n. (Because computations start to become intensive with n = 6, and because the n = 2 is relatvily trivial, I’ll evaluate values of 3, 4, and 5 for n. 6)

ns <- tibble(n = 3L:5L)
summ_mntc_byn <-
  ns %>% 
  mutate(data = purrr::map(n, ~(create_permns(.x) %>% add_mntc_cols()))) %>% 
  mutate(summ = purrr::map(data, summarise_mntc_wcortest)) %>% 
  unnest(summ, .drop = TRUE) %>% 
  ungroup() %>% 
  arrange(n)

What is the breakdown of mntc_tier2 values?

n mntc_tier2 nn nn_pct


3 (-0.001,0.5] 24 66.7 3 (0.5,1] 12 33.3 4 (-0.001,0.5] 384 66.7 4 (0.5,1] 192 33.3 5 (-0.001,0.5] 9720 67.5 5 (0.5,1] 4680 32.5

What about the distribution of mntc_distinv values? And of the estimates and p.values?

n abs(estimate) p.value mntc_tier2 nn nn_pct 3 0.5 1.0 (-0.001,0.5] 18 50.0 3 1.0 0.3 (-0.001,0.5] 6 16.7 3 0.5 1.0 (0.5,1] 6 16.7 3 1.0 0.3 (0.5,1] 6 16.7 4 0.0 1.0 (-0.001,0.5] 48 8.3 4 0.2 0.9 (-0.001,0.5] 72 12.5 4 0.4 0.8 (-0.001,0.5] 144 25.0 4 0.6 0.4 (-0.001,0.5] 24 4.2 4 0.8 0.3 (-0.001,0.5] 72 12.5 4 1.0 0.1 (-0.001,0.5] 24 4.2 4 0.2 0.9 (0.5,1] 24 4.2 4 0.4 0.8 (0.5,1] 48 8.3 4 0.6 0.4 (0.5,1] 24 4.2 4 0.8 0.3 (0.5,1] 72 12.5 4 1.0 0.1 (0.5,1] 24 4.2 5 0.0 1.0 (-0.001,0.5] 600 4.2 5 0.1 1.0 (-0.001,0.5] 1800 12.5 5 0.2 0.8 (-0.001,0.5] 1440 10.0 5 0.3 0.7 (-0.001,0.5] 1680 11.7 5 0.4 0.5 (-0.001,0.5] 600 4.2 ^1^ # of total rows: 36

The distributions are sparse due to the relatively small number of unique values for each metric (mntc_distinv, p.value, etc.). 7 Consequently, it is a bit difficult to extract much meaningful insight about the relationships among the metrics. To really understand how the distributions and relationships scale with larger values of n, mathematical theory would need to be applied.

Nonethless, without jumping more into statistical theory, It seems to me that the identification of rank combinations as significant by my heuristic classification and Spearman’s rho (assuming that one uses the traditional p-value-below-a-thresshold approach) would become more dissimilar as the value of n increases. This is because my classification simply splits all possible values into two sets for any value of n, meaning that the percentage of all possible combinations is relatively insensitive to the value of n. 8 On the other hand, the Spearman’s rho p-values would become more refined with larger values of n.

Anyways, I believe that my heuristic serves my purposes well. I only really intended it to be used for small values of n. Also, I intended to create a “relaxed” definition of monotonocity, so having only a very small percentage of all possible rank combinations meet the criteria would have actually been undesireable.

Conclusion

In the end, I think I did more work than I really needed to do to answer my original question about quantifying monotonocity and inferring significance, but I think, in all, this was a worthwhile exploration.



  1. The market analyst does not necessarily hypothesize why annual projections tend to be high (i.e. perhaps due to over-confidence) ^
  2. Additionally, prediction is not the concern—rather, quantification of trend is. (While regression certainly can help with trend identification, its capability to create predictions is perhaps its better use.) ^
  3. In reality, the rank values could also be any arbitrary value on the real number scale. ^
  4. A sum of sums (instead of an average of sums) could be used here and the subsequent results would not change. ^
  5. Here, there are only n + 1 (i.e. 5 unique abs(estimate)s and p.values. This result is not generally true. (For example, when choosing n = 5, there will be more than 6 unique values of each metric.) ^
  6. This presents a good opportunity to implement a version of the “nest-mutate-unnest” idiom that can be very effective for creating many models. The “many models” chapter in the R For Data Science book provides an excellent example of this process. ^
  7. Unfortunately this is due to the nature of the data and the simulation, so nothing can be done about it. ^
  8. Note that the 33% number found for n = 4 is not generally true, although this percentage does not seem to change drastically with different values of n. ^

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.