# Predictability in Network Models

**Jonas Haslbeck - r**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Network models have become a popular way to abstract complex systems and gain insights into relational patterns among observed variables in almost any area of science. The majority of these applications focuses on analyzing the structure of the network. However, if the network is not directly observed (Alice and Bob are friends) but *estimated* from data (there is a relation between smoking and cancer), we can analyze – in addition to the network structure – the predictability of the nodes in the network. That is, we would like to know: how well can an arbitrarily picked node in the network predicted by all remaining nodes in the network?

Predictability is interesting for several reasons:

- It gives us an idea of how
*practically relevant*edges are: if node A is connected to many other nodes but these only explain, let’s say, only 1% of its variance, how interesting are the edges connected to A? - We get an indication of how to design an
*intervention*in order to achieve a change in a certain set of nodes and we can estimate how efficient the intervention will be - It tells us to which extent different parts of the network are
*self-determined or determined by other factors*that are not included in the network

In this blogpost, we use the R-package mgm to estimate a network model and compute node wise predictability measures for a dataset on Post Traumatic Stress Disorder (PTSD) symptoms of Chinese earthquake victims. We visualize the network model and predictability using the qgraph package and discuss how the combination of network model and node wise predictability can be used to design effective interventions on the symptom network.

## Load Data

We load the data which the authors made freely available:

The datasets contains complete responses to 17 PTSD symptoms of 344 individuals. The answer categories for the intensity of symptoms ranges from 1 ‘not at all’ to 5 ‘extremely’. The exact wording of all symptoms is in the paper of McNally and colleagues.

## Estimate Network Model

We estimate a Mixed Graphical Model (MGM), where we treat all variables as continuous-Gaussian variables. Hence we set the type of all variables to `type = 'g'`

and the number of categories for each variable to 1, which is the default for continuous variables `lev = 1`

:

For more info on how to estimate Mixed Graphical Models using the mgm package see this previous post or the mgm paper.

## Compute Predictability of Nodes

After estimating the network model we are ready to compute the predictability for each node. Node wise predictability (or error) can be easily computed, because the graph is estimated by taking each node in turn and regressing all other nodes on it. As a measure for predictability we pick the propotion of explained variance, as it is straight forward to interpret: 0 means the node at hand is not explained at all by other nodes in the nentwork, 1 means perfect prediction. We centered all variables before estimation in order to remove any influence of the intercepts. For a detailed description of how to compute predictions and to choose predictability measures, check out this preprint. In case there are additional variable types (e.g. categorical) in the network, we can choose an appropriate measure for these variables (e.g. % correct classification, see `?predict.mgm`

).

We calculated the percentage of variance explained in each of the nodes in the network. Next, we visualize the estimated network and discuss its structure in relation to explained variance.

## Visualize Network & Predictability

We provide the estimated weighted adjacency matrix and the node wise predictability measures as arguments to `qgraph()`

…

… and get the following network visualization:

[Click here for the original post with larger figures]

Each variable is represented by a node and the edges correspond to partial correlations, because in this dataset the MGM consists only of conditional Gaussian variables. The green color of the edges indicates that all partial correlations in this graph are positive, and the edge-width is proportional to the absolute value of the partial correlation. The blue pie chart behind the node indicates the predictability measure for each node (more blue = higher predictability).

We see that intrusive memories, traumatic dreams and flashbacks cluster together. Also, we observe that avoidance of thoughts (avoidth) about trauma interacts with avoidance of acitivies reminiscent of the trauma (avoidact) and that hypervigilant (hyper) behavior is related to feeling easily startled (startle). But there are also less obvious interactions, for instance between anger and concentration problems.

Now, if we would like to reduce sleep problems, the network model suggests to intervene on the variables anger and startle. But what the network structure does not tell us is *how much* we could possibly change sleep through the variables anger and startle. The predictability measure gives us an answer to this question: 53.1%. If the goal was to intervene on amnesia, we see that all adjacent nodes in the network explain only 32.7% of its variance. In addition, we see that there are many small edges connected to amnesia, suggesting that it is hard to intervene on amnesia via other nodes in the symptom network. Thus, one would possibly try to find additional variables that are not included in the network that interact with amnesia or try to intervene on amnesia directly.

## Limitations!

Of course, there are limitations to interpreting explained variance as predicted treatment outcome: first, we cannot know the causal direction of the edges, so any edge could point in one or both directions. However, if there is no edge, there is also no causal effect in any direction. Also, it is often reasonable to combine the network model with general knoweldge: for instance, it seems more likely that amnesia causes being upset than the other way around. Second, we estimated the model on cross-sectional data (each row is one person) and hence assume that all people are the same, which is an assumption that is always violated to some extent. To solve this problem we would need (many) repeated measurements of a single person, in order to estimate a model specific to that person. This also solves the first problem to some degree as we can use the direction of time as the direction of causality. One would then use models that predict all symptoms at time point t by all symptoms at an earlier time point, let’s say t-1. An example of such a model is the Vector Autoregressive (VAR) model.

## Compare Within vs. Out of Sample Predictability

So far we looked into how well we can predict nodes by all other nodes within our sample. But in most situations we are interested in the predictability of nodes in new, unseen data. In what follows, we compare the within sample predictability with the out of sample predictability.

We first split the data in two parts: a training part (60% of the data), which we use to estimate the network model and a test part, which we will use to compute predictability measures on:

Next, we estimate the network only on the training data and compute the predictability measure both on the training data and the test data:

We now look at the mean predictability over nodes for the training- and test dataset:

As to be expected, the explained variance is higher in the training dataset. This is because we fit the model to structure that is specific to the training data and is not present in the population (noise). Note that both means are lower than the mean we would get by taking the mean of the explained variances above, because we used less observation to estimate the model and hence have less power to detect edges.

While the explained variance values are lower in the test set, there is a strong correlation between the explained variance of a node in the training- and the test set

which means that if a node has high explained variance in the training set, it tends to also have a high explained variance in the test set.

**leave a comment**for the author, please follow the link and comment on their blog:

**Jonas Haslbeck - r**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.