Predictability in Network Models

[This article was first published on Jonas Haslbeck - r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Network models have become a popular way to abstract complex systems and gain insights into relational patterns among observed variables in almost any area of science. The majority of these applications focuses on analyzing the structure of the network. However, if the network is not directly observed (Alice and Bob are friends) but estimated from data (there is a relation between smoking and cancer), we can analyze – in addition to the network structure – the predictability of the nodes in the network. That is, we would like to know: how well can an arbitrarily picked node in the network predicted by all remaining nodes in the network?

Predictability is interesting for several reasons:

  1. It gives us an idea of how practically relevant edges are: if node A is connected to many other nodes but these only explain, let’s say, only 1% of its variance, how interesting are the edges connected to A?
  2. We get an indication of how to design an intervention in order to achieve a change in a certain set of nodes and we can estimate how efficient the intervention will be
  3. It tells us to which extent different parts of the network are self-determined or determined by other factors that are not included in the network

In this blogpost, we use the R-package mgm to estimate a network model and compute node wise predictability measures for a dataset on Post Traumatic Stress Disorder (PTSD) symptoms of Chinese earthquake victims. We visualize the network model and predictability using the qgraph package and discuss how the combination of network model and node wise predictability can be used to design effective interventions on the symptom network.

Load Data

We load the data which the authors made freely available:

data <- read.csv('http://psychosystems.org/wp-content/uploads/2014/10/Wenchuan.csv')
data <- na.omit(data)
p <- ncol(data)
dim(data)
[1] 344  17

The datasets contains complete responses to 17 PTSD symptoms of 344 individuals. The answer categories for the intensity of symptoms ranges from 1 ‘not at all’ to 5 ‘extremely’. The exact wording of all symptoms is in the paper of McNally and colleagues.

Estimate Network Model

We estimate a Mixed Graphical Model (MGM), where we treat all variables as continuous-Gaussian variables. Hence we set the type of all variables to type = 'g' and the number of categories for each variable to 1, which is the default for continuous variables lev = 1:

install.packages('mgm')
library(mgm)
fit_obj <- mgmfit(data = data, 
                  type = rep('g', p),
                  lev = rep(1, p),
                  rule.reg = 'OR')

For more info on how to estimate Mixed Graphical Models using the mgm package see this previous post or the mgm paper.

Compute Predictability of Nodes

After estimating the network model we are ready to compute the predictability for each node. Node wise predictability (or error) can be easily computed, because the graph is estimated by taking each node in turn and regressing all other nodes on it. As a measure for predictability we pick the propotion of explained variance, as it is straight forward to interpret: 0 means the node at hand is not explained at all by other nodes in the nentwork, 1 means perfect prediction. We centered all variables before estimation in order to remove any influence of the intercepts. For a detailed description of how to compute predictions and to choose predictability measures, check out this preprint. In case there are additional variable types (e.g. categorical) in the network, we can choose an appropriate measure for these variables (e.g. % correct classification, see ?predict.mgm).

pred_obj <- predict(fit_obj, data, 
                    error.continuous = 'VarExpl')

> pred_obj$error
    Variable Error ErrorType
1  intrusion 0.583   VarExpl
2     dreams 0.590   VarExpl
3      flash 0.513   VarExpl
4      upset 0.615   VarExpl
5    physior 0.601   VarExpl
6    avoidth 0.648   VarExpl
7   avoidact 0.626   VarExpl
8    amnesia 0.327   VarExpl
9    lossint 0.419   VarExpl
10   distant 0.450   VarExpl
11      numb 0.333   VarExpl
12    future 0.450   VarExpl
13     sleep 0.531   VarExpl
14     anger 0.483   VarExpl
15    concen 0.604   VarExpl
16     hyper 0.602   VarExpl
17   startle 0.605   VarExpl

We calculated the percentage of variance explained in each of the nodes in the network. Next, we visualize the estimated network and discuss its structure in relation to explained variance.

Visualize Network & Predictability

We provide the estimated weighted adjacency matrix and the node wise predictability measures as arguments to qgraph()

install.packages('qgraph')
library(qgraph)
jpeg(paste0(figDir, 'McNallyNetwork.jpg'), width = 1500, height = 1500)
qgraph(fit_obj$wadj, # weighted adjacency matrix as input
       layout = 'spring', 
       pie = pred_obj$error$Error, # provide errors as input
       pieColor = rep('#377EB8',p),
       node.color = fit_obj$edgecolor,
       labels = colnames(data))
dev.off()

… and get the following network visualization:

center

[Click here for the original post with larger figures]

Each variable is represented by a node and the edges correspond to partial correlations, because in this dataset the MGM consists only of conditional Gaussian variables. The green color of the edges indicates that all partial correlations in this graph are positive, and the edge-width is proportional to the absolute value of the partial correlation. The blue pie chart behind the node indicates the predictability measure for each node (more blue = higher predictability).

We see that intrusive memories, traumatic dreams and flashbacks cluster together. Also, we observe that avoidance of thoughts (avoidth) about trauma interacts with avoidance of acitivies reminiscent of the trauma (avoidact) and that hypervigilant (hyper) behavior is related to feeling easily startled (startle). But there are also less obvious interactions, for instance between anger and concentration problems.

Now, if we would like to reduce sleep problems, the network model suggests to intervene on the variables anger and startle. But what the network structure does not tell us is how much we could possibly change sleep through the variables anger and startle. The predictability measure gives us an answer to this question: 53.1%. If the goal was to intervene on amnesia, we see that all adjacent nodes in the network explain only 32.7% of its variance. In addition, we see that there are many small edges connected to amnesia, suggesting that it is hard to intervene on amnesia via other nodes in the symptom network. Thus, one would possibly try to find additional variables that are not included in the network that interact with amnesia or try to intervene on amnesia directly.

Limitations!

Of course, there are limitations to interpreting explained variance as predicted treatment outcome: first, we cannot know the causal direction of the edges, so any edge could point in one or both directions. However, if there is no edge, there is also no causal effect in any direction. Also, it is often reasonable to combine the network model with general knoweldge: for instance, it seems more likely that amnesia causes being upset than the other way around. Second, we estimated the model on cross-sectional data (each row is one person) and hence assume that all people are the same, which is an assumption that is always violated to some extent. To solve this problem we would need (many) repeated measurements of a single person, in order to estimate a model specific to that person. This also solves the first problem to some degree as we can use the direction of time as the direction of causality. One would then use models that predict all symptoms at time point t by all symptoms at an earlier time point, let’s say t-1. An example of such a model is the Vector Autoregressive (VAR) model.

Compare Within vs. Out of Sample Predictability

So far we looked into how well we can predict nodes by all other nodes within our sample. But in most situations we are interested in the predictability of nodes in new, unseen data. In what follows, we compare the within sample predictability with the out of sample predictability.

We first split the data in two parts: a training part (60% of the data), which we use to estimate the network model and a test part, which we will use to compute predictability measures on:

set.seed(1)
ind <- sample(c(TRUE,FALSE), prob=c(.6, .4), size=nrow(data), replace=T) # divide data in 2 parts (60% training set, 40% test set)

Next, we estimate the network only on the training data and compute the predictability measure both on the training data and the test data:

fit_obj_cv <- mgmfit(data = data[ind,], 
                    type = rep('g', p),
                    lev = rep(1, p),
                    rule.reg = 'OR')

pred_obj_train <- predict(fit_obj_cv, data[ind,], error.continuous = 'VarExpl') # Compute Preditions on training data 60%
pred_obj_test <- predict(fit_obj_cv, data[!ind,], error.continuous = 'VarExpl')# Compute Predictions on test data 40%

We now look at the mean predictability over nodes for the training- and test dataset:

mean(pred_obj_train$error$Error) # mean explained variance training data
[1] 0.5384118

mean(pred_obj_test$error$Error) # mean explained variance test data
[1] 0.4494118

As to be expected, the explained variance is higher in the training dataset. This is because we fit the model to structure that is specific to the training data and is not present in the population (noise). Note that both means are lower than the mean we would get by taking the mean of the explained variances above, because we used less observation to estimate the model and hence have less power to detect edges.

While the explained variance values are lower in the test set, there is a strong correlation between the explained variance of a node in the training- and the test set

cor(pred_obj_train$error$Error, pred_obj_test$error$Error)
[1] 0.8018155

which means that if a node has high explained variance in the training set, it tends to also have a high explained variance in the test set.

To leave a comment for the author, please follow the link and comment on their blog: Jonas Haslbeck - r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)