Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third post of a series on the concept of “network centrality” with applications in R and the package netrankr. The last part introduced the concept of neighborhood-inclusion and its implications for centrality. In this post, we extend the concept to a broader class of dominance relations by deconstructing indices into a series of building blocks and introduce new ways of evaluating centrality.

library(igraph)
library(ggraph)
library(tidyverse)
library(netrankr)

# Introduction

Neighborhood-inclusion seems to underlie many different centrality indices and as such serves (or better: could serve) as a defining property of a centrality index. That is:

An index is a measure of centrality if and only if it preserves the partial ranking induced by neighborhood-inclusion

While this gives as a theoretical basis for centrality, it comes with a bit of problem. Namely, that we do not expect many comparable pairs in (large) real-world networks. Take the third example network from the first post.

g3 <- readRDS("example_3.rds")
P <- neighborhood_inclusion(g3)
comparable_pairs(P)
## [1] 0.00600532

Only 0.6% of pairs are comparable. This means that centrality indices are at liberty to rank 99.4% of pairs in any order. So, neighborhood-inclusion may not be very restrictive in large networks. This is much like with structural equivalence. It is a very intuitive concept for equivalences in networks (having exactly the same neighbors), however we do not expect many equivalent pairs in large networks. This does not render the concept useless, but requires some relaxations.

In the following, we introduce dominance relations, which incrementally extend the partial ranking of neighborhood-inclusion and thus tighten the partial ranking that indices preserve. We start by illustrating how indices can be decomposed into a series of building blocks.

# Deconstructing Indices

Centrality indices assess structural importance based on a great variety of different graph theoretic notions, like shortest paths (closeness) or walks (subgraph centrality). Implicitly, though, they all follow a simple recipe:

• derive an indirect relation
• aggregate the indirect relations

As mentioned above, indirect relation are commonly derived via graph trajectories such as paths and walks, for instance to compute distances. The aggregation is usually a simple summation of all relations of a node, but others are possible too (e.g. $$\max$$ as in eccentricity). In its most generic form, we can thus write a centrality index as $c(i)= \sum_j \tau(x)_{ij}$ where $$x$$ are the observed relations (basically the adjacency matrix) and $$\tau$$ a generic transformation function. Replacing $$\tau$$ with $$id$$, we obtain degree centrality and setting $$\tau=dist$$, we obtain closeness. A suitable function $$\tau$$ can be defined for all centrality indices, such that any index can basically be seen as degree in an appropriately transformed network.

The package netrankr provides a set of 24 different indirect relations that can be used to construct indices. A few common examples are given below.

#closeness
g %>%
indirect_relations(type='dist_sp') %>%
aggregate_positions(type='invsum')

#betweenness
g %>%
indirect_relations(type='depend_sp') %>%
aggregate_positions(type='sum')

#eigenvector
g %>%
indirect_relations(type='walks',FUN=walks_limit_prop) %>%
aggregate_positions(type='sum')

#subgraph
g %>%
indirect_relations(type='walks',FUN=walks_exp) %>%
aggregate_positions(type='self')

(Consult the help for indirect_relations() to see all options) Note that we make use of the %>% operator. This should appeal to the recipe idea from above: network %>% indirect_relation %>% aggregation. The package also includes a handy RStudio addin, which can be used to build the pipelines more easily.

Defining indices in this way is certainly more cumbersome than using, say, betweennes(g). However, it allows us to intervene at any step and do something else.

# Extended Dominance Relations

To illustrate the “something else”, we look at our small example network again.

g1 <- readRDS("example_1.rds")

Following the recipe, you have decided, that the relations of interest for your analysis are the distances between nodes. The problem is, aggregating them into an index can still be done in various ways. Three distance based centrality examples are shown below.

#classic closeness
c_C <- g1 %>%
indirect_relations(type="dist_sp") %>%
aggregate_positions(type="invsum")

#harmonic closeness
c_HC <- g1 %>%
indirect_relations(type="dist_sp",FUN=dist_inv) %>%
aggregate_positions(type="sum")

#residual-type closeness
c_RC <- g1 %>%
indirect_relations(type="dist_sp",FUN=dist_2pow) %>%
aggregate_positions(type="sum")

Any of the above indices starts with the derivation of distances, but then proceeds with a different form of aggregation: $c_C(i)=\frac{1}{\sum dist(i,j)},\quad c_{HC}(i)=\sum\frac{1}{dist(i,j)}, \quad c_{RC}(i)=\sum 2^{-dist(i,j)}$ Possibilities are virtually endless for aggregating distances into an index. From the previous part, we know that any of these indices preserve neighborhood-inclusion. Once we have settled for a relation, as in this case, we can extend the partial ranking using the following considerations: If a $$dist(i,k)$$ is larger than $$dist(j,k)$$ for all nodes $$k$$, then no matter how we aggregate the relations (as long as it is monotonic), $$i$$ will always be less central then $$j$$. With a bit more formalism: $dist(i,k) \geq dist(j,k) \text{ for all } k \implies c_x(i)\leq c_x(j)$ where $$c_x$$ is an arbitrary centrality index based on distances. This actually defines a new dominance relation among nodes. In fact, we can go a step further. It does not really matter in which order we aggregate the distances, the result will always be the same. Hence, we can permute all relations of a single node without affecting the result. A convenient choice of permutation is simply to reorder all relations in descending order. Afterwards, we can compute the dominance relations as above. (More formal details on these dominance relations can be found in this article.)

We can compute this new dominance relation using the function positional_dominance(). The benefit parameter is set to FALSE since large distances are not beneficial for a node to have. Setting map=TRUE invokes the above mentioned reordering. For comparison, we also compute neighborhood-inclusion again.

P <- neighborhood_inclusion(g1)
D <- g1 %>%
indirect_relations(type="dist_sp") %>%
positional_dominance(benefit=FALSE,map=TRUE)

c("neighborhood-inclusion"=comparable_pairs(P),"distance dominance"=comparable_pairs(D))
## neighborhood-inclusion     distance dominance
##              0.1636364              0.8727273

By fixing our relation of interest to distances and allowing reordering of relations, we went from only 16% of comparable pairs to 87%! Hence, no matter what index based on distance we use, results will always be very similar. As a sanity check, we can verify that all distance based indices from above preserve the dominance relations.

c("classic"=is_preserved(D,c_C),"harmonic"=is_preserved(D,c_HC),"residual"=is_preserved(D,c_RC))
##  classic harmonic residual
##     TRUE     TRUE     TRUE

# Partial Centrality

By now, we should have understood that there are various kinds of partial rankings in networks, which form the basis of centrality. Indices extend these partial rankings into one possible ranking, but, as we will see later, there might be hundreds of thousands of possible rankings. And hence, hundreds of thousands of indices that produce these rankings. Instead of inventing hundreds of thousands of indices, why not just study the partial rankings? Or why not be extremely bold, and try to analyse all possible rankings at once?

In this section, we consider the former question, by introducing rank intervals. A rank interval of a node is the set of ranks a node can have in any ranking that extends a given partial ranking. Let us consider two extreme cases. A node that is neither dominated nor dominates any other node can potentially have any rank. So its rank interval is $$[1,n]$$. (We use the convention, that $$n$$ is the top rank and $$1$$ the lowest possible rank). On the other hand, if a node dominates all other nodes, it can only be ranked on the top. So its rank interval is just a point.

netrankr includes the function rank_intervals() which returns the rank intervals for all nodes in the network. A visual representation of the intervals can be obtained with the plot_rank_intervals() function, as done below for the first example network and neighborhood-inclusion as partial ranking input. We also included an optional data.frame containing centrality scores of five indices.

cent_scores <- data.frame(
degree=degree(g1),
betweenness=betweenness(g1),
closeness=closeness(g1),
eigen=eigen_centrality(g1)$vector, subgraph=subgraph.centrality(g1)) plot_rank_intervals(P,cent.df = cent_scores,ties.method = "average") The rank intervals are extremely big for this network. Node 10, for instance can take any possible rank. The most constraint interval is that of node 1, containing 6 possible ranks. The rank intervals can be interpreted as a sort of confidence interval for centrality. The larger the interval, the less explanatory power a single index may have Again, consider Node 10. It is the most central node according to subgraph centrality, but ranks very low in betweenness. We have learned that we can extend neighborhood-inclusion by choosing a relation of interest as basis for our analysis. For the example network, we considered distances. The below figure shows the resulting rank intervals. cent_scores <- data.frame( classic=c_C, harmonic=c_HC, residual=c_RC) plot_rank_intervals(D,cent.df = cent_scores) Notice how much smaller they got. The intervals of node 1 and 2 even collapse into a single point. They will thus always be ranked at the bottom in any distance based centrality ranking. Rank intervals are a convenient choice to assess the possibilities of rankings in a network. It is important to understand, though, that the ranks in each interval do not occur with uniform probability. A rank interval $$[6,7]$$ does not mean that the node is ranked 6th in 50% of all possible rankings. We address the rank probabilities in the next section. # Probabilistic Centrality A node ranking can be defined as a mapping $rk: V \to \{1,\ldots,n\},$ where we use the convention that $$u$$ is the top ranked node if $$rk(u)=n$$ and the bottom ranked one if $$rk(u)=1$$. The set of all possible rankings can then be characterized as $\mathcal{R}(\leq)=\{rk:V \to \{1,\ldots,n\}\; : \; u\leq v \implies rk(u)\leq rk(v)\}.$ This set contains all rankings that could be obtained for centrality indices that preserve the partial ranking of a dominance relation “$$\leq$$”. Once $$\mathcal{R}(\leq)$$ is calculated, it can be used for a probabilistic assessment of centrality, analyzing all possible rankings at once. Examples include relative rank probabilities (How likely is it, that a node $$u$$ is more central than another node $$v$$?) or expected ranks (How central do we expect a node $$u$$ to be). netrankr includes the function exact_rank_prob(), which helps to answer the above posted questions. We stick with our small example network and apply the function to both, neighborhood-inclusion and dominance based on distances. probNI <- exact_rank_prob(P) probD <- exact_rank_prob(D) The function returns a large list of different outputs, which we discuss in the following. The number of possible rankings is stored in lin.ext. c("neighborhood-inclusion"=probNI$lin.ext,"distances"=probD$lin.ext) ## neighborhood-inclusion distances ## 739200 20 So, for this tiny network, there are still more than 700,000 possibilities to rank the nodes differently. If we restrict ourselves to distances, we end up with only 20. The rank probabilities (for example how likely is it that node $$u$$ is ranked on top?) are stored in the matrix rank.prob. We here focus on the probability to be the most central node. top_rank <- ncol(probNI$rank.prob)
probNI$rank.prob[,11] ## V1 V2 V3 V4 V5 V6 ## 0.00000000 0.00000000 0.00000000 0.13636364 0.16363636 0.10909091 ## V7 V8 V9 V10 V11 ## 0.10909091 0.13636364 0.09090909 0.09090909 0.16363636 Node 5 and 11 have the highest probability to be the most central node. You can think of the probabilities as follows: If we would apply thousands of indices to the network, in 16% of the cases will node 5 be the most central node. Relative rank probabilities (How likely is it that $$u$$ is less central than $$v$$?) are stored in the matrix relative.rank. round(probNI$relative.rank,2)
##       V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11
## V1  0.00 0.67 1.00 0.95 1.00 1.00 1.00 0.95 0.86 0.86 1.00
## V2  0.33 0.00 0.67 1.00 0.92 0.83 0.83 1.00 0.75 0.75 0.92
## V3  0.00 0.33 0.00 0.80 1.00 0.75 0.75 0.80 0.64 0.64 1.00
## V4  0.05 0.00 0.20 0.00 0.56 0.44 0.44 0.50 0.38 0.38 0.56
## V5  0.00 0.08 0.00 0.44 0.00 0.38 0.38 0.44 0.32 0.32 0.50
## V6  0.00 0.17 0.25 0.56 0.62 0.00 0.50 0.56 0.43 0.43 0.62
## V7  0.00 0.17 0.25 0.56 0.62 0.50 0.00 0.56 0.43 0.43 0.62
## V8  0.05 0.00 0.20 0.50 0.56 0.44 0.44 0.00 0.38 0.38 0.56
## V9  0.14 0.25 0.36 0.62 0.68 0.57 0.57 0.62 0.00 0.50 0.68
## V10 0.14 0.25 0.36 0.62 0.68 0.57 0.57 0.62 0.50 0.00 0.68
## V11 0.00 0.08 0.00 0.44 0.50 0.37 0.37 0.44 0.32 0.32 0.00

For example, the probability that node 2 is less central than node 1 is $$0.33$$. The closer a probability to $$0.5$$ (see node 11 and 5), the less reason exists to put either node on top of the other.

The last return values of interest are the expected ranks and the standard deviation in expected.rank and rank.spread. The expected ranks can be seen as a sort of baseline ranking. Applying hundreds of random indices, this is the ranking we could expect to get on average.

exp_rk <- round(probNI$expected.rank,2) sd_rk <- round(probNI$rank.spread,2)
tibble(nodes=1:11,expected=exp_rk,sd=sd_rk) %>%
ggplot(aes(x=reorder(nodes,expected)))+
geom_segment(aes(y=expected-sd,yend=expected+sd,xend=reorder(nodes,expected)))+
geom_point(aes(y=expected))+
theme_minimal()+
labs(y="Rank",x="Node")

The standard deviations are quite large for neighborhood-inclusion, which was to be expected from the big rank intervals. The below figure shows the expected ranks for the distance based dominance.

exp_rk <- round(probD$expected.rank,2) sd_rk <- round(probD$rank.spread,2)
tibble(nodes=1:11,expected=exp_rk,sd=sd_rk) %>%
ggplot(aes(x=reorder(nodes,expected)))+
geom_segment(aes(y=expected-sd,yend=expected+sd,xend=reorder(nodes,expected)))+
geom_point(aes(y=expected))+
theme_minimal()+
labs(y="Rank",x="Node")

As a word of warning: The problem of finding all possible rankings for a partial ranking is computationally a hard problem. So it is advisable to use exact_rank_prob() only for small networks. Some benchmark results and approximation methods for larger networks can be found here.

# Summary

After this post, it is time to take stock of what we have done so far. To date, putting it bluntly, network centrality is nothing more than the application of indices to a network:

The only degree of freedom is the choice of index and it is hard to justify choices without resorting to data-driven reasoning, as in “We used betweenness because it worked best”.

The introduced neighborhood-inclusion and more specific dominance concepts allow for additional ways of analyzing centrality in networks, described in this superficial diagram.

Any centrality analysis starts with identifying the relation of interest, which replaces the choice of index. The relation of interest is usually some graph-theoretic property (e.g. distances) that we assume to be indicative for centrality. The relations of a node to all others is called its position. The aggregation of the relations leads to the definition of indices, hence the usual centrality concept. However, positions can also be compared via positional dominance, the dominance relation introduced in this post, leading to partial centrality rankings and the option to calculate probabilistic centrality rankings.

So far, we have mostly been toying around with small contrived networks. The final post will illustrate the introduced methodology by means of a more realistic application example.