# Network Centrality in R: New ways of measuring Centrality

**schochastics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third post of a series on the concept of “network centrality” with

applications in R and the package `netrankr`

. The last part introduced the concept of

neighborhood-inclusion and its implications for centrality. In this post, we

extend the concept to a broader class of dominance relations by deconstructing indices

into a series of building blocks and introduce new ways of evaluating centrality.

library(igraph) library(ggraph) library(tidyverse) library(netrankr)

# Introduction

Neighborhood-inclusion seems to underlie many different centrality indices and as such

serves (or better: could serve) as a defining property of a centrality index. That is:

An index is a measure of centrality if and only if it preserves the partial ranking induced by neighborhood-inclusion

While this gives as a theoretical basis for centrality, it comes with a bit of problem.

Namely, that we do not expect many comparable pairs in (large) real-world networks.

Take the third example network from the first post.

g3 <- readRDS("example_3.rds") P <- neighborhood_inclusion(g3) comparable_pairs(P)

## [1] 0.00600532

Only 0.6% of pairs are comparable. This means that

centrality indices are at liberty to rank 99.4% of pairs in any order. So, neighborhood-inclusion

may not be very restrictive in large networks. This is much like with *structural equivalence*.

It is a very intuitive concept for equivalences in networks (having exactly the same neighbors), however we do not expect many equivalent pairs in large networks. This does not render the concept useless, but requires some relaxations.

In the following, we introduce dominance relations, which incrementally extend the

partial ranking of neighborhood-inclusion and thus tighten the partial ranking

that indices preserve. We start by illustrating how indices can be decomposed into a series

of building blocks.

# Deconstructing Indices

Centrality indices assess structural importance based on a great variety of different

graph theoretic notions, like shortest paths (closeness) or walks (subgraph centrality).

Implicitly, though, they all follow a simple recipe:

- derive an indirect relation
- aggregate the indirect relations

As mentioned above, indirect relation are commonly derived via graph trajectories such as paths

and walks, for instance to compute distances. The aggregation is usually a simple summation of all relations of a node, but others are

possible too (e.g. \(\max\) as in eccentricity). In its most generic form, we can thus write a centrality index as

\[

c(i)= \sum_j \tau(x)_{ij}

\]

where \(x\) are the observed relations (basically the adjacency matrix) and \(\tau\)

a generic transformation function. Replacing \(\tau\) with \(id\), we obtain degree centrality

and setting \(\tau=dist\), we obtain closeness. A suitable function \(\tau\) can be

defined for all centrality indices, such that any index can basically be seen as degree in an

appropriately transformed network.

The package `netrankr`

provides a set of 24 different indirect relations that can be used

to construct indices. A few common examples are given below.

#closeness g %>% indirect_relations(type='dist_sp') %>% aggregate_positions(type='invsum') #betweenness g %>% indirect_relations(type='depend_sp') %>% aggregate_positions(type='sum') #eigenvector g %>% indirect_relations(type='walks',FUN=walks_limit_prop) %>% aggregate_positions(type='sum') #subgraph g %>% indirect_relations(type='walks',FUN=walks_exp) %>% aggregate_positions(type='self')

(Consult the help for `indirect_relations()`

to see all options)

Note that we make use of the `%>%`

operator. This should appeal to the recipe

idea from above: `network %>% indirect_relation %>% aggregation`

. The package also

includes a handy RStudio addin, which can be used to build the pipelines more easily.

Defining indices in this way is certainly more cumbersome than using, say, `betweennes(g)`

.

However, it allows us to intervene at any step and *do something else*.

# Extended Dominance Relations

To illustrate the “something else”, we look at our small example network again.

g1 <- readRDS("example_1.rds")

Following the recipe, you have decided, that the *relations of interest* for your analysis

are the distances between nodes. The problem is, aggregating them into an index can still

be done in various ways. Three distance based centrality examples are shown below.

#classic closeness c_C <- g1 %>% indirect_relations(type="dist_sp") %>% aggregate_positions(type="invsum") #harmonic closeness c_HC <- g1 %>% indirect_relations(type="dist_sp",FUN=dist_inv) %>% aggregate_positions(type="sum") #residual-type closeness c_RC <- g1 %>% indirect_relations(type="dist_sp",FUN=dist_2pow) %>% aggregate_positions(type="sum")

Any of the above indices starts with the derivation of distances, but then proceeds with

a different form of aggregation:

\[

c_C(i)=\frac{1}{\sum dist(i,j)},\quad c_{HC}(i)=\sum\frac{1}{dist(i,j)}, \quad c_{RC}(i)=\sum 2^{-dist(i,j)}

\]

Possibilities are virtually endless for aggregating distances into an index.

From the previous part, we know that any of these indices preserve neighborhood-inclusion.

Once we have settled for a relation, as in this case, we can extend the partial ranking

using the following considerations: If a \(dist(i,k)\) is larger than \(dist(j,k)\) for all

nodes \(k\), then no matter how we aggregate the relations (as long as it is monotonic),

\(i\) will always be less central then \(j\). With a bit more formalism:

\[

dist(i,k) \geq dist(j,k) \text{ for all } k \implies c_x(i)\leq c_x(j)

\]

where \(c_x\) is an arbitrary centrality index based on distances. This actually defines

a new dominance relation among nodes. In fact, we can go a step further. It does

not really matter in which order we aggregate the distances, the result will always be the same.

Hence, we can permute all relations of a single node without affecting the result.

A convenient choice of permutation is simply to reorder all relations in descending

order. Afterwards, we can compute the dominance relations as above. (More formal details on these dominance relations can be found in this article.)

We can compute this new dominance relation using the function `positional_dominance()`

.

The `benefit`

parameter is set to `FALSE`

since large distances are not beneficial for

a node to have. Setting `map=TRUE`

invokes the above mentioned reordering. For comparison,

we also compute neighborhood-inclusion again.

P <- neighborhood_inclusion(g1) D <- g1 %>% indirect_relations(type="dist_sp") %>% positional_dominance(benefit=FALSE,map=TRUE) c("neighborhood-inclusion"=comparable_pairs(P),"distance dominance"=comparable_pairs(D))

## neighborhood-inclusion distance dominance ## 0.1636364 0.8727273

By fixing our relation of interest to distances and allowing reordering of relations, we

went from only 16% of comparable pairs to 87%! Hence, no matter what index based on distance we use,

results will always be very similar. As a sanity check, we can verify that all distance based indices from above preserve the dominance relations.

c("classic"=is_preserved(D,c_C),"harmonic"=is_preserved(D,c_HC),"residual"=is_preserved(D,c_RC))

## classic harmonic residual ## TRUE TRUE TRUE

# Partial Centrality

By now, we should have understood that there are various kinds of partial rankings in networks,

which form the basis of centrality. Indices extend these partial rankings into **one** possible ranking,

but, as we will see later, there might be hundreds of thousands of possible rankings. And hence, hundreds of

thousands of indices that produce these rankings. Instead of inventing hundreds of thousands of

indices, why not just study the partial rankings? Or why not be extremely bold, and try to

analyse **all** possible rankings at once?

In this section, we consider the former question, by introducing *rank intervals*. A rank interval

of a node is the set of ranks a node can have in any ranking that extends a given partial ranking.

Let us consider two extreme cases. A node that is neither dominated nor dominates any other node can potentially

have any rank. So its rank interval is \([1,n]\). (We use the convention, that \(n\) is the top rank and \(1\) the lowest possible rank).

On the other hand, if a node dominates all other nodes, it can only be ranked on the top. So its rank interval is just a point.

`netrankr`

includes the function `rank_intervals()`

which returns the rank intervals for all nodes in the network.

A visual representation of the intervals can be obtained with the `plot_rank_intervals()`

function, as done below for the

first example network and neighborhood-inclusion as partial ranking input. We also included an optional `data.frame`

containing centrality scores of five indices.

cent_scores <- data.frame( degree=degree(g1), betweenness=betweenness(g1), closeness=closeness(g1), eigen=eigen_centrality(g1)$vector, subgraph=subgraph.centrality(g1)) plot_rank_intervals(P,cent.df = cent_scores,ties.method = "average")

The rank intervals are extremely big for this network. Node 10, for instance can take any possible rank.

The most constraint interval is that of node 1, containing 6 possible ranks. The rank intervals can be interpreted as

a sort of confidence interval for centrality. The larger the interval, the less explanatory power

a single index may have Again, consider Node 10. It is the most central node according to subgraph centrality, but ranks very low in

betweenness.

We have learned that we can extend neighborhood-inclusion by choosing a relation of interest as

basis for our analysis. For the example network, we considered distances.

The below figure shows the resulting rank intervals.

cent_scores <- data.frame( classic=c_C, harmonic=c_HC, residual=c_RC) plot_rank_intervals(D,cent.df = cent_scores)

Notice how much smaller they got. The intervals of node 1 and 2 even collapse into a single point.

They will thus always be ranked at the bottom in any distance based centrality ranking.

Rank intervals are a convenient choice to assess the possibilities of rankings in a network.

It is important to understand, though, that the ranks in each interval do not occur with

uniform probability. A rank interval \([6,7]\) does not mean that the node is ranked 6th in 50%

of all possible rankings. We address the *rank probabilities* in the next section.

# Probabilistic Centrality

A node ranking can be defined as a mapping

\[rk: V \to \{1,\ldots,n\},\]

where we use the convention that \(u\) is the top ranked node if \(rk(u)=n\) and the

bottom ranked one if \(rk(u)=1\). The set of all possible rankings can then be characterized as

\[

\mathcal{R}(\leq)=\{rk:V \to \{1,\ldots,n\}\; : \; u\leq v \implies rk(u)\leq rk(v)\}.

\]

This set contains all rankings that could be obtained for centrality indices that preserve the

partial ranking of a dominance relation “\(\leq\)”.

Once \(\mathcal{R}(\leq)\) is calculated, it can be used for a probabilistic assessment of centrality,

analyzing all possible rankings at once. Examples include *relative rank probabilities*

(How likely is it, that a node \(u\) is more central than another node \(v\)?) or

*expected ranks* (How central do we expect a node \(u\) to be).

`netrankr`

includes the function `exact_rank_prob()`

, which helps to answer the above posted questions.

We stick with our small example network and apply the function to both, neighborhood-inclusion

and dominance based on distances.

probNI <- exact_rank_prob(P) probD <- exact_rank_prob(D)

The function returns a large list of different outputs, which we discuss in the following.

The number of possible rankings is stored in `lin.ext`

.

c("neighborhood-inclusion"=probNI$lin.ext,"distances"=probD$lin.ext)

## neighborhood-inclusion distances ## 739200 20

So, for this tiny network, there are still more than 700,000 possibilities to rank the nodes differently.

If we restrict ourselves to distances, we end up with only 20.

The rank probabilities (for example how likely is it that node \(u\) is ranked on top?)

are stored in the matrix `rank.prob`

. We here focus on the probability to be the most central node.

top_rank <- ncol(probNI$rank.prob) probNI$rank.prob[,11]

## V1 V2 V3 V4 V5 V6 ## 0.00000000 0.00000000 0.00000000 0.13636364 0.16363636 0.10909091 ## V7 V8 V9 V10 V11 ## 0.10909091 0.13636364 0.09090909 0.09090909 0.16363636

Node 5 and 11 have the highest probability to be the most central node. You can think of the

probabilities as follows: If we would apply thousands of indices to the network, in 16% of the cases

will node 5 be the most central node.

Relative rank probabilities (How likely is it that \(u\) is less central than \(v\)?) are stored in the matrix `relative.rank`

.

round(probNI$relative.rank,2)

## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ## V1 0.00 0.67 1.00 0.95 1.00 1.00 1.00 0.95 0.86 0.86 1.00 ## V2 0.33 0.00 0.67 1.00 0.92 0.83 0.83 1.00 0.75 0.75 0.92 ## V3 0.00 0.33 0.00 0.80 1.00 0.75 0.75 0.80 0.64 0.64 1.00 ## V4 0.05 0.00 0.20 0.00 0.56 0.44 0.44 0.50 0.38 0.38 0.56 ## V5 0.00 0.08 0.00 0.44 0.00 0.38 0.38 0.44 0.32 0.32 0.50 ## V6 0.00 0.17 0.25 0.56 0.62 0.00 0.50 0.56 0.43 0.43 0.62 ## V7 0.00 0.17 0.25 0.56 0.62 0.50 0.00 0.56 0.43 0.43 0.62 ## V8 0.05 0.00 0.20 0.50 0.56 0.44 0.44 0.00 0.38 0.38 0.56 ## V9 0.14 0.25 0.36 0.62 0.68 0.57 0.57 0.62 0.00 0.50 0.68 ## V10 0.14 0.25 0.36 0.62 0.68 0.57 0.57 0.62 0.50 0.00 0.68 ## V11 0.00 0.08 0.00 0.44 0.50 0.37 0.37 0.44 0.32 0.32 0.00

For example, the probability that node 2 is less central than node 1 is \(0.33\).

The closer a probability to \(0.5\) (see node 11 and 5), the less reason exists to put either node on top of the other.

The last return values of interest are the expected ranks and the standard deviation in `expected.rank`

and `rank.spread`

.

The expected ranks can be seen as a sort of baseline ranking. Applying hundreds of random indices, this is the ranking we could expect to get on average.

exp_rk <- round(probNI$expected.rank,2) sd_rk <- round(probNI$rank.spread,2) tibble(nodes=1:11,expected=exp_rk,sd=sd_rk) %>% ggplot(aes(x=reorder(nodes,expected)))+ geom_segment(aes(y=expected-sd,yend=expected+sd,xend=reorder(nodes,expected)))+ geom_point(aes(y=expected))+ theme_minimal()+ labs(y="Rank",x="Node")

The standard deviations are quite large for neighborhood-inclusion, which was to be expected from the big rank

intervals. The below figure shows the expected ranks for the distance based dominance.

exp_rk <- round(probD$expected.rank,2) sd_rk <- round(probD$rank.spread,2) tibble(nodes=1:11,expected=exp_rk,sd=sd_rk) %>% ggplot(aes(x=reorder(nodes,expected)))+ geom_segment(aes(y=expected-sd,yend=expected+sd,xend=reorder(nodes,expected)))+ geom_point(aes(y=expected))+ theme_minimal()+ labs(y="Rank",x="Node")

As a word of warning: The problem of finding all possible rankings for a partial ranking is

computationally a hard problem. So it is advisable to use `exact_rank_prob()`

only for small

networks. Some benchmark results and approximation methods for larger networks can be found here.

# Summary

After this post, it is time to take stock of what we have done so far.

To date, putting it bluntly, network centrality is nothing more than the application of indices

to a network:

The only degree of freedom is the choice of index and it is hard to justify choices

without resorting to data-driven reasoning, as in “We used betweenness because it worked best”.

The introduced neighborhood-inclusion and more specific

dominance concepts allow for additional ways of analyzing centrality in networks,

described in this superficial diagram.

Any centrality analysis starts with identifying the *relation of interest*, which

replaces the choice of index. The relation of interest is usually some graph-theoretic

property (e.g. distances) that we assume to be indicative for centrality. The relations

of a node to all others is called its *position*. The aggregation of the relations leads

to the definition of indices, hence the usual centrality concept. However, positions can also

be compared via *positional dominance*, the dominance relation introduced in this post, leading to partial centrality rankings and

the option to calculate probabilistic centrality rankings.

So far, we have mostly been toying around with small contrived networks. The final post

will illustrate the introduced methodology by means of a more realistic application example.

**leave a comment**for the author, please follow the link and comment on their blog:

**schochastics**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.