Boundary Conditions and Anatomy – Exploring Correlated Data Simulation in R

Posted on December 16, 2020 by [R]eliability in R bloggers | 0 Comments

[This article was first published on [R]eliability, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today we have been assigned the task of identifying boundary conditions for a benchtop durability test of an implantable, artificial heart valve. In other words, we need to identify credible parameters for a physical test such that our test engineers can challenge the device under severe but realistic geometries and loads. To facilitate this task our clinical team has analyzed images and extracted pressure measurements and geometric features for a sample of n=300 patients. The rest of this post explores what we should do with these data.

For the sake of simplicity, assume that the three parameters we care about are the ellipticity of the vessel cross section, curvature of the vessel in the vessel region of interest, and the blood pressure. Features such as these are important because they influence both the equilibrium geometry and the magnitude of forces acting on the implantable valve (in other words: the boundary conditions). The image below shows a schematic/example of ellipticity and vessel curvature in the LVOT and aortic valve annulus as observed in CT imaging.¹

There are two main challenges when working with these data:

How do we use our sample to simulate the full population?

How do we use the simulated, full population to identify groups of interest and recommend boundary conditions for the test

Here is our dataset.² It turns out that each of these features can be well described by a lognormal distribution and we will assume that this is confirmed via prior domain knowledge. Large pressures, large ellipticities, and small radius of curvature are all bad because they would represent more extreme geometries and/or loads for the implant to resist without migrating or fracturing.

As normal, we’re working in R and will lean on the tidyverse packages to accelerate things. I wasn’t intending for this post to go as long as it did so I’m also offering this table of contents to show I bear no malice.

library(readxl)
library(knitr)
library(DiagrammeR)
library(fitdistrplus)
library(MASS)
library(ggrepel)
library(readxl)
library(ks)
library(broom)
library(ggExtra)
library(GGally)
library(car)
library(rgl)
library(anySim)
library(tidyverse)

Start by reading in the data and taking a look at the format.

sample_data <- readRDS(file = "sim_anatomy_data.rds")
sample_data
## # A tibble: 300 x 3
##    ellip  curv pressure
##    <dbl> <dbl>    <dbl>
##  1  1.26  4.51     92.7
##  2  1.28  5.02    183. 
##  3  1.29  4.03    154. 
##  4  1.23  2.14    109. 
##  5  1.13  3.67    124. 
##  6  1.22  2.37    114. 
##  7  1.10  3.06    113. 
##  8  1.04  2.31    105. 
##  9  1.11  5.31    115. 
## 10  1.09  2.04    109. 
## # ... with 290 more rows

As expected, 300 rows with our 3 features of interest.

Key Point:

Since we are asked to find a worst-case set of values, it might seem reasonable or tempting to extract the maximum value from each group (or maybe something like the 95th percentile) and report those values together as a conservative worst-case for ellipticity, curvature, and pressure. Our test engineers would then set up a benchtop test to challenge our prototype devices in those same conditions to see if they survive. The problem with this approach is that each row of data is from a specific patient, so the variables may be correlated. It could be that those severe, 95th percentile values for each variable never occur together in the same patient. If we choose them together, we would over-test the device and over-design the device, potentially setting the program way behind. We must instead look at the data as a joint distribution and investigate correlations and covariance among the variables.

Let’s verify the shape of the distributions for each feature and see if there is any correlation between the variables:

ellip_curv_plt <- sample_data %>%
  ggplot(aes(x = ellip, y = curv)) +
  geom_point(alpha = .5) +
  labs(
    title = "Patient Data From n=300 Scans",
    subtitle = "Vessel Ellipticity and Vessel Curvature Joint Distribution",
    x = "Ellipticity",
    y = "Curvature (mm)"
  )

ellip_pressure_plt <- sample_data %>%
  ggplot(aes(x = ellip, y = pressure)) +
  geom_point(alpha = .5, color = "firebrick") +
  labs(
    title = "Patient Data From n=300 Scans",
    subtitle = "Vessel Ellipticity and Blood Pressure Joint Distribution",
    x = "Ellipticity",
    y = "Pressure (mm Hg)"
  )

curv_pressure_plt <- sample_data %>%
  ggplot(aes(x = curv, y = pressure)) +
  geom_point(alpha = .5, color = "limegreen") +
  labs(
    title = "Patient Data From n=300 Scans",
    subtitle = "Vessel Curvature and Blood Pressure Joint Distribution",
    x = "Curvature (mm)",
    y = "Pressure (mm Hg"
  )

ellip_curv_mplt <- ggExtra::ggMarginal(ellip_curv_plt, type = "density", fill = "#2c3e50", alpha = .5)
ellip_pressure_mplt <- ggExtra::ggMarginal(ellip_pressure_plt, type = "density", fill = "firebrick", alpha = .5)
curv_pressure_mplt <- ggExtra::ggMarginal(curv_pressure_plt, type = "density", fill = "limegreen", alpha = .5)

Correlations in the Original Dataset

ggcorr() from the GGally package is very convenient for visualizing correlations.

sample_data %>% ggcorr(
  high = "#20a486ff",
  low = "#fde725ff",
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  label_round = 3,
  nbreaks = 3
) +
  labs(
    title = "Correlation Matrix - n=300 Patient Set",
    subtitle = "Pearson Method Using Pairwise Observations"
  )

We see that there are some positive correlations in this dataset.

To build out the sample into a simulated population we will fit a MLE estimate and use the model to push out a lot of predictions. If each variable was independent the job would be easy - just execute a few rlnorm()s and bind them together. The job is more challenging when the variables are correlated because they must be simulated all at once. I will show 2 approaches in the sections below. Note that the 2nd approach is more efficient but it helped me to walk through the first one to understand the workflow. If you are impatient I would skip to the section on approach 2.

Approach 1 - Manually Transform Everything to Normal

For cases where each variable is normal or can be easily transformed there is a straightforward and relatively simple workflow to generate simulated joint distribution using the mvrnorm() function from the MASS package:

In our case, we can easily convert our data from lognormal to normal and will then be able to use the mvrnorm() function to draw from a multivariate normal distribution and then undo the transformation later to recover a simulated population with desired correlations (as shown above). From there we can identify patients of interest, whether they be extreme challenging cases or a central, common group.

Per the workflow above, start by fitting the native data to lognormal distributions using fitdist() and extract the parameters. Storing all the parameters as objects is a bit tedious and I only do it here so we can make a nice summary table of everything at the end.

Step 1 - Fit Distributions to Each Variable

ellip_fit <- fitdist(sample_data$ellip, "lnorm")
curv_fit <- fitdist(sample_data$curv, "lnorm")
pressure_fit <- fitdist(sample_data$pressure, "lnorm")

# store lognormal parameters of original data
ellip_meanlog <- ellip_fit$estimate[["meanlog"]]
ellip_sdlog <- ellip_fit$estimate[["sdlog"]]
curv_meanlog <- curv_fit$estimate[["meanlog"]]
curv_sdlog <- curv_fit$estimate[["sdlog"]]
pressure_meanlog <- pressure_fit$estimate[["meanlog"]]
pressure_sdlog <- pressure_fit$estimate[["sdlog"]]

# store correlations in original data
cor_ec <- cor(x = sample_data$ellip, y = sample_data$curv)
cor_ep <- cor(x = sample_data$ellip, y = sample_data$pressure)
cor_cp <- cor(x = sample_data$curv, y = sample_data$pressure)

# store covariances in original data
cov_ellip_curv <- cov(x = sample_data$ellip, y = sample_data$curv)
cov_ellip_ellip <- cov(x = sample_data$ellip, y = sample_data$ellip)
cov_curv_curv <- cov(x = sample_data$curv, y = sample_data$curv)
cov_ellip_pressure <- cov(x = sample_data$ellip, y = sample_data$pressure)
cov_pressure_pressure <- cov(x = sample_data$pressure, y = sample_data$pressure)
cov_curv_pressure <- cov(x = sample_data$curv, y = sample_data$pressure)

# summarize the parameters and reshape a bit
original_data_param_tbl <- tibble(
  ellip_meanlog = ellip_meanlog,
  ellip_sdlog = ellip_sdlog,
  curv_meanlog = curv_meanlog,
  curv_sdlog = curv_sdlog,
  pressure_meanlog = pressure_meanlog,
  pressure_sdlog = pressure_sdlog,
  ellip_curv_correlation = cor_ec,
  ellip_pressure_correlation = cor_ep,
  curv_pressure_correlation = cor_cp,
  ellip_ellip_covariance = cov_ellip_ellip,
  ellip_curv_covariance = cov_ellip_curv,
  curv_curv_covariance = cov_curv_curv,
  ellip_pressure_covariance = cov_ellip_pressure,
  pressure_pressure_covariance = cov_pressure_pressure,
  curv_pressure_covariance = cov_curv_pressure
) %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
  mutate(dataset = "original_data") %>%
  mutate_if(is.character, as_factor)

# View summary table of original data
original_data_param_tbl %>%
  kable(align = "c", digits = 3)

feature	value	dataset
ellip_meanlog	0.193	original_data
ellip_sdlog	0.064	original_data
curv_meanlog	1.158	original_data
curv_sdlog	0.309	original_data
pressure_meanlog	4.783	original_data
pressure_sdlog	0.191	original_data
ellip_curv_correlation	0.268	original_data
ellip_pressure_correlation	0.369	original_data
curv_pressure_correlation	0.213	original_data
ellip_ellip_covariance	0.006	original_data
ellip_curv_covariance	0.022	original_data
curv_curv_covariance	1.157	original_data
ellip_pressure_covariance	0.659	original_data
pressure_pressure_covariance	530.683	original_data
curv_pressure_covariance	5.285	original_data

Step 2 - Transform all variables to normal

A simple log operation brings the lognormal variable to normal.

# transform original, lognormal data to normal
normal_sample_data <- sample_data %>%
  mutate(
    n_ellip = log(ellip),
    n_curv = log(curv),
    n_pressure = log(pressure)
  )

normal_sample_data %>%
  head() %>%
  kable(align = "c", digits = 3)

ellip	curv	pressure	n_ellip	n_curv	n_pressure
1.255	4.506	92.739	0.228	1.505	4.530
1.285	5.019	182.970	0.251	1.613	5.209
1.289	4.027	153.858	0.254	1.393	5.036
1.234	2.139	108.669	0.210	0.760	4.688
1.133	3.673	123.633	0.125	1.301	4.817
1.219	2.373	113.944	0.198	0.864	4.736

Step 3 - Fit normal distributions to each transformed variable

We don’t actually have to formally fit normal distributions since it is convenient to obtain the mean and standard deviation at any time using the mean() or sd() functions. But we will extract and store correlations and covariances for the simulation to come.

# get correlations of transformed, normal data
ncor_ec <- cor(
  x = normal_sample_data$n_ellip,
  normal_sample_data$n_curv
)
ncor_ep <- cor(
  x = normal_sample_data$n_ellip,
  normal_sample_data$n_pressure
)
ncor_cp <- cor(
  x = normal_sample_data$n_curv,
  normal_sample_data$n_pressure
)

# get covariance of transformed, normal data
n_cov_ellip_curv <- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_curv
)
n_cov_ellip_ellip <- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_ellip
)
n_cov_curv_curv <- cov(
  x = normal_sample_data$n_curv,
  y = normal_sample_data$n_curv
)

n_cov_ellip_pressure <- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_pressure
)
n_cov_pressure_pressure <- cov(
  x = normal_sample_data$n_pressure,
  y = normal_sample_data$n_pressure
)
n_cov_curv_pressure <- cov(
  x = normal_sample_data$n_curv,
  y = normal_sample_data$n_pressure
)

Step 4 - Draw joint distribution using mvrnorm() or equivalent function

Time to actually draw the correlated values. I store them here in an object called mult_norm.

# draw from multivariate normal with parameters from transformed normal distributions and correlation
set.seed(0118)

mult_norm <- as_tibble(MASS::mvrnorm(
  10000, c(
    mean(normal_sample_data$n_ellip),
    mean(normal_sample_data$n_curv),
    mean(normal_sample_data$n_pressure)
  ),
  matrix(c(
    n_cov_ellip_ellip,
    n_cov_ellip_curv,
    n_cov_ellip_pressure,
    n_cov_ellip_curv,
    n_cov_curv_curv,
    n_cov_curv_pressure,
    n_cov_ellip_pressure,
    n_cov_curv_pressure,
    n_cov_pressure_pressure
  ), 3, 3)
)) %>%
  rename(
    n_ellip_sim = V1,
    n_curv_sim = V2,
    n_pressure_sim = V3
  )

Step 5 - Back-transform simulated data to original distribution

Exponentiating the data brings it back to lognormal.

# convert back to lognormal
log_norm <- mult_norm %>%
  mutate(
    ellip_sim = exp(n_ellip_sim),
    curv_sim = exp(n_curv_sim),
    pressure_sim = exp(n_pressure_sim)
  )

log_norm %>%
  head() %>%
  kable(align = "c", digits = 3)

n_ellip_sim	n_curv_sim	n_pressure_sim	ellip_sim	curv_sim	pressure_sim
0.254	1.600	5.248	1.290	4.952	190.266
0.233	1.038	5.107	1.262	2.823	165.178
0.236	1.152	4.812	1.266	3.165	123.018
0.313	1.003	5.048	1.368	2.727	155.636
0.224	1.622	5.192	1.251	5.066	179.912
0.197	1.486	4.822	1.218	4.422	124.185

Step 6 - Evaluate parameters and marginal distributions of the back-transfomed data

# evaluate the marginal distributions of the simulated data
ellip_sim_fit <- fitdistrplus::fitdist(log_norm$ellip_sim, "lnorm")
curv_sim_fit <- fitdistrplus::fitdist(log_norm$curv_sim, "lnorm")
pressure_sim_fit <- fitdistrplus::fitdist(log_norm$pressure_sim, "lnorm")

Obtain and store the correlation, covariances, and parameters of simulated set:

# get correlation and covariances of simulated data
sim_cor_ec <- cor(x = log_norm$ellip_sim, log_norm$curv_sim)
sim_cor_ep <- cor(x = log_norm$ellip_sim, log_norm$pressure_sim)
sim_cor_cp <- cor(x = log_norm$curv_sim, log_norm$pressure_sim)

sim_cov_ellip_curv <- cov(x = log_norm$ellip_sim, y = log_norm$curv_sim)
sim_cov_ellip_ellip <- cov(x = log_norm$ellip_sim, y = log_norm$ellip_sim)
sim_cov_curv_curv <- cov(x = log_norm$curv_sim, y = log_norm$curv_sim)

sim_cov_ellip_pressure <- cov(x = log_norm$ellip_sim, y = log_norm$pressure_sim)
sim_cov_pressure_pressure <- cov(x = log_norm$pressure_sim, y = log_norm$pressure_sim)
sim_cov_curv_pressure <- cov(x = log_norm$curv_sim, y = log_norm$pressure_sim)

# store parameters of simulated data
ellip_sim_meanlog <- ellip_sim_fit$estimate[["meanlog"]]
ellip_sim_sdlog <- ellip_sim_fit$estimate[["sdlog"]]
curv_sim_meanlog <- curv_sim_fit$estimate[["meanlog"]]
curv_sim_sdlog <- curv_sim_fit$estimate[["sdlog"]]
pressure_sim_meanlog <- pressure_sim_fit$estimate[["meanlog"]]
pressure_sim_sdlog <- pressure_sim_fit$estimate[["sdlog"]]

# collect parameters from simulated data
sim_data_param_tbl <- tibble(
  ellip_meanlog = ellip_sim_meanlog,
  ellip_sdlog = ellip_sim_sdlog,
  curv_meanlog = curv_sim_meanlog,
  curv_sdlog = curv_sim_sdlog,
  pressure_meanlog = pressure_sim_meanlog,
  pressure_sdlog = pressure_sim_sdlog,

  ellip_curv_correlation = sim_cor_ec,
  ellip_pressure_correlation = sim_cor_ep,
  curv_pressure_correlation = sim_cor_cp,

  ellip_curv_covariance = sim_cov_ellip_curv,
  ellip_ellip_covariance = sim_cov_ellip_ellip,
  curv_curv_covariance = sim_cov_curv_curv,

  ellip_pressure_covariance = sim_cov_ellip_pressure,
  pressure_pressure_covariance = sim_cov_pressure_pressure,
  curv_pressure_covariance = sim_cov_curv_pressure
) %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
  mutate(dataset = "simulated_data") %>%
  mutate_if(is.character, as_factor)

sim_data_param_tbl %>%
  kable(align = "c")

feature	value	dataset
ellip_meanlog	0.1932042	simulated_data
ellip_sdlog	0.0630117	simulated_data
curv_meanlog	1.1626798	simulated_data
curv_sdlog	0.3092643	simulated_data
pressure_meanlog	4.7878497	simulated_data
pressure_sdlog	0.1900026	simulated_data
ellip_curv_correlation	0.2505145	simulated_data
ellip_pressure_correlation	0.3644292	simulated_data
curv_pressure_correlation	0.1956149	simulated_data
ellip_curv_covariance	0.0203344	simulated_data
ellip_ellip_covariance	0.0058779	simulated_data
curv_curv_covariance	1.1209300	simulated_data
ellip_pressure_covariance	0.6534943	simulated_data
pressure_pressure_covariance	547.0647415	simulated_data
curv_pressure_covariance	4.8440727	simulated_data

Compare Original Data to Simulated Data

A bit more wrangling let’s us compare the feature of the original dataset to the new, simulated population to see if they agree.

compare_tbl <- bind_rows(original_data_param_tbl, sim_data_param_tbl) %>%
  pivot_wider(id_cols = everything(), names_from = dataset)

compare_tbl %>%
  kable(align = "c", digits = 3)

feature	original_data	simulated_data
ellip_meanlog	0.193	0.193
ellip_sdlog	0.064	0.063
curv_meanlog	1.158	1.163
curv_sdlog	0.309	0.309
pressure_meanlog	4.783	4.788
pressure_sdlog	0.191	0.190
ellip_curv_correlation	0.268	0.251
ellip_pressure_correlation	0.369	0.364
curv_pressure_correlation	0.213	0.196
ellip_ellip_covariance	0.006	0.006
ellip_curv_covariance	0.022	0.020
curv_curv_covariance	1.157	1.121
ellip_pressure_covariance	0.659	0.653
pressure_pressure_covariance	530.683	547.065
curv_pressure_covariance	5.285	4.844

Everything appears to align well, but this sure took a while. Wouldn’t it be nice if there was a faster way than to manually fit and extract values from mvrnorm() ? Fortunately, we’re in the R ecosystem, where somebody smart has usually tackled the problem and provided the tools to the community. The tool that I had success with is the AnySim package.³

Here’s how to perform the simulation in a much more efficient way. Note: this is how I created the original dataset of 300 that we’ve been working with.

Approach 2 - AnySim

The AnySim workflow:

It may look similar in the flowchart but in practice its way easier than Method 1. Here’s how it works in only a few lines of code:

set.seed(13)
# Define the target distribution functions (ICDFs) of each random variable.

ellip_dist <- "qlnorm"
curv_dist <- "qlnorm"
pressure_dist <- "qlnorm"

# store the 3 ICDFs in a vector
dist_vec <- c(ellip_dist, curv_dist, pressure_dist)

# Define the parameters of the target distribution functions - store them in a list
ellip_params <- list(meanlog = 0.20, sdlog = .067)
curv_params <- list(meanlog = 1.15, sdlog = 0.3)
pressure_params <- list(meanlog = 4.80, sdlog = 0.2)

# this is a weird way to do it but I'm following along with an example from AnySim vignette :)
params_list <- list(NULL)
params_list[[1]] <- ellip_params
params_list[[2]] <- curv_params
params_list[[3]] <- pressure_params

# Define the target correlation matrix.
corr_matrix <- matrix(c(
  1, 0.21, 0.4,
  0.21, 1, .21,
  0.4, 0.21, 1
),
ncol = 3,
nrow = 3,
byrow = T
)
# Estimate the parameters of the auxiliary Gaussian model.
aux_gaussion_param_tbl <- EstCorrRVs(
  R = corr_matrix, dist = dist_vec, params = params_list,
  NatafIntMethod = "GH", NoEval = 9, polydeg = 8
)


# Generate 10000 synthetic realizations of the 3 correlated RVs.
correlated_ln_draws_tbl <- as_tibble(SimCorrRVs(n = 10000, paramsRVs = aux_gaussion_param_tbl)) %>%
  rename(
    ellip = V1,
    curv = V2,
    pressure = V3
  )

correlated_ln_draws_tbl %>%
  head(10) %>%
  kable(align = "c")

ellip	curv	pressure
1.267618	2.158739	133.3040
1.198681	4.437176	127.1982
1.375663	2.649938	160.2052
1.236829	3.621202	141.5427
1.318572	2.541340	129.3427
1.255885	3.236722	146.9361
1.326279	3.387357	167.8579
1.240926	4.254830	145.7207
1.191865	1.853877	87.7240
1.315273	4.144667	90.5485

Let’s see if we were able to produce the desired relationships between the variables:

# Fit synthetic data
ellip_ln_fit <- tidy(fitdistr(correlated_ln_draws_tbl$ellip, "log-normal")) %>% mutate(
  var = "ellipticity",
  dataset = "sim_draws"
)
curv_ln_fit <- tidy(fitdistr(correlated_ln_draws_tbl$curv, "log-normal")) %>% mutate(
  var = "curvature",
  dataset = "sim_draws"
)
pressure_ln_fit <- tidy(fitdistr(correlated_ln_draws_tbl$pressure, "log-normal")) %>% mutate(
  var = "pressure",
  dataset = "sim_draws"
)
recovered_params_tbl <- bind_rows(ellip_ln_fit, curv_ln_fit, pressure_ln_fit)

# Show target values
target_tbl <- tibble(
  term = rep(c("meanlog", "sdlog"), 3),
  estimate = c(.20, .067, 1.15, .3, 4.80, .20),
  var = c("ellipticity", "ellipticity", "curvature", "curvature", "pressure", "pressure"),
  dataset = "target values"
)

# Compare simulated values to target values
bind_rows(recovered_params_tbl, target_tbl) %>%
  select(-std.error) %>%
  pivot_wider(id_cols = everything(), names_from = dataset, values_from = estimate) %>%
  mutate_if(is.numeric, round, 2) %>%
  kable(align = rep("c"))

term	var	sim_draws	target values
meanlog	ellipticity	0.20	0.20
sdlog	ellipticity	0.07	0.07
meanlog	curvature	1.15	1.15
sdlog	curvature	0.30	0.30
meanlog	pressure	4.80	4.80
sdlog	pressure	0.20	0.20

Excellent agreement. Now check to see if the correlations were preserved:

# Check correlations
correlated_ln_draws_tbl %>% ggcorr(
  high = "#20a486ff",
  low = "#fde725ff",
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  label_round = 3,
  nbreaks = 3
) +
  labs(
    title = "Correlation Matrix",
    subtitle = "Pearson Method Using Pairwise Observations"
  )

Again, perfect agreement to 2 decimal places. Thank you AnySim!

Now that I’ve shown 2 ways to preserve the correlation structure in a simulation, we can return to the original question: what are the worst-case (or most common) values of this (simulated) population?⁴

Using the Simulated Population to Define Desired Test Conditions and Groups of Interest

The simulated population is useful because its density properties provide a means to determine how extreme any values of interest are. The trick is that we have to define the boundary of interest based on the question we are trying to answer. Two common questions we have pertain to finding the most common set of patients, or the most extreme patients relative to some point or region of interest. I discuss both below.

Identify a percentage of worst-case patients relative to some value of interest

Consider a 2d test case where we are interested in the joint distribution of 2 variables: curvature and pressure. These would be appropriate for something like a migration test, where the pressure wants to pull the implant out of place a curved configuration is worse than straight (like a pipe elbow).

If we know that the worst-case conditions that are physiologically relevant occur when the curvature is 1 mm (radius-of-curvature) and the pressure is 300 mm Hg (extremely high). How could we identify the worst-case 5% of patients using our simulated population data from above? In this case we don’t need an algorithm, just a bit of geometry.

We can leverage the standard geometric formula for distance between two points:

\[ {\displaystyle d(p,q)={\sqrt {(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}}}} \]

Here’s how to calculate this number for each value in the joint distribution for curvature and pressure:

# specify theoretical worst-case point in space
theoretical_worst_curv <- 1
theoretical_worst_pressure <- 300

# calculated each point's distance from the theoretical worst-case
d_data <- log_norm %>%
  rowwise() %>%
  mutate(d = ((theoretical_worst_curv - curv_sim)^2 + (theoretical_worst_pressure - pressure_sim)^2)^.5) %>%
  arrange(desc(d)) %>%
  ungroup()

# make ecdf to map points to percentiles of empirical distribution
d_ECDF_fcn <- ecdf(d_data$d)

# map the ecdf over all the calculated distances to convert them to percentiles
d_pct_data <- d_data %>%
  mutate(d_pct = map_dbl(d, d_ECDF_fcn)) %>%
  mutate(in_out = case_when(
    d_pct <= .05 ~ "5% Worst-Case Points",
    TRUE ~ "95% Less Severe Population"
  )) %>%
  mutate(in_out = as_factor(in_out))

wc_tbl <- tibble(x = 1, y = 300)

Now visualize.

# visualize
d_plt <- d_pct_data %>%
  ggplot(aes(x = curv_sim, y = pressure_sim, color = in_out)) +
  geom_point(alpha = .5, size = 1) +
  geom_point(aes(x = 1, y = 300), color = "firebrick") +
  geom_label_repel(
    data = wc_tbl, aes(x, y),
    label = "Worst-Case Combination of \n vessel curvature and pressure",
    fill = "firebrick",
    color = "white",
    segment.color = "black",
    segment.size = 1,
    #                   min.segment.length = unit(1, "lines"),
    nudge_y = 50,
    nudge_x = 2
  ) +
  labs(
    title = "Joint Distribution of Curvature and Pressure",
    subtitle = "5% of points nearest to worst-case are identified",
    x = "Radius of Curvature (mm)",
    y = "Blood Presure (mm Hg)"
  ) +
  theme_classic() +
  ylim(c(0, 300)) +
  theme(
    legend.position = "bottom",
    legend.title = element_blank()
  ) +
  scale_color_viridis_d(option = "D", end = .7)

d_plt

It’s a bit of an illusion since the axes aren’t scaled 1:1. Here’s a look at a scaled version showing the teal points closest to the point of interest.

Neat! We just split the data into the most severe 5% and least severe 95% based on a known point we want to avoid. You could imagine a similar calculation if we had a line that defined a worst-case boundary like we use for the failure threshold in a Goodman Diagram.⁵

In other cases we just want to know the most common values (values in the region of highest probability) based on the multi-dimensional density calculation. In other words, we want to identify contours based on how close the points are to each other, not some arbitrary point in space or line. A kernel density estimate is going to be our tool of choice in all but the most basic cases. The KDE is non-parametric and can accommodate very complex joint distributions and density shapes.⁶

Typically, the default contours on a density plot show an output called “level” but they can be converted to denote upper percentages of highest density regions. This is the method I show below.⁷

Identify probability contours using ks package and kernel density estimation

This first chunk converts the data and generated the KDE. The bandwidth parameters controls the “smoothness” or granularity of the estimate and can be hard to specify in multiple dimensions. Hscv() provides a method of determining a reasonable bandwidth through cross-validation; see documentation in footnotes for more information if interested.

# convert simulated data tibble to matrix
d3m <- correlated_ln_draws_tbl %>%
  as.matrix()

# cross-validated bandwidth for kd (takes a while to calculate)
# hscv1 <- Hscv(correlated_ln_draws_tbl)
# hscv1 %>% write_rds(here::here("hscv1.rds"))

hscv1 <- read_rds(here::here("hscv1.rds"))

# generate kernel density estimate from simulated population
kd_d3m <- ks::kde(d3m, H = hscv1, compute.cont = TRUE)

Density percentiles from kde estimate

The KDE estimate provides density percentiles that can be used to generate the contours the define the density regions in multiple dimensions.

# see the kde's calculated density thresholds for specified proportions
cont_vals_tbl <- tidy(kd_d3m$cont) %>%
  mutate(n_row = row_number()) %>%
  mutate(probs = 100 - n_row) %>%
  select(probs, x)

reference_grid_probs_tbl <- cont_vals_tbl %>%
  rename(estimate = x)

reference_grid_probs_tbl %>%
  head(10) %>%
  kable(align = rep("c"))

probs	estimate
99	0.0332949
98	0.0321061
97	0.0310773
96	0.0303446
95	0.0295937
94	0.0288550
93	0.0282270
92	0.0276995
91	0.0271418
90	0.0266409

reference_grid_probs_tbl %>%
  tail(10) %>%
  kable(align = rep("c"))

probs	estimate
10	0.0018141
9	0.0016195
8	0.0014430
7	0.0012654
6	0.0010786
5	0.0008795
4	0.0007005
3	0.0005288
2	0.0003976
1	0.0002657

KDE estimates in the range of the variables

By default the KDE provides density estimates for a grid of points that covers the space of the variables.

kd_grid_estimates <- kd_d3m

If we want to know the value at each point in the simulated population we use the eval.points argument.

mc_estimates <- ks::kde(
  x = d3m, H = hscv1,
  compute.cont = TRUE,
  eval.points = correlated_ln_draws_tbl %>% as.matrix()
)

Here are a couple different ways to convert the kde object features into a tibble:

mc_est_tbl_10000 <- tibble(estimate = mc_estimates$estimate) %>%
  bind_cols(correlated_ln_draws_tbl)
kd_grid_est_tbl_29k <- broom:::tidy.kde(kd_grid_estimates) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  rename(ellip = x1, curv = x2, pressure = x3) %>%
  select(-obs)
mc_est_tbl_10000 %>%
  head(10) %>%
  kable(align = "c")

estimate	ellip	curv	pressure
0.0140082	1.267618	2.158739	133.3040
0.0118702	1.198681	4.437176	127.1982
0.0026530	1.375663	2.649938	160.2052
0.0194232	1.236829	3.621202	141.5427
0.0122134	1.318572	2.541340	129.3427
0.0167723	1.255885	3.236722	146.9361
0.0055484	1.326279	3.387357	167.8579
0.0112270	1.240926	4.254830	145.7207
0.0082627	1.191865	1.853877	87.7240
0.0019651	1.315273	4.144667	90.5485

kd_grid_est_tbl_29k %>%
  head(10) %>%
  kable(align = "c")

ellip	curv	pressure
0.8880910	0.1331512	31.02793
0.9131427	0.1331512	31.02793
0.9381944	0.1331512	31.02793
0.9632461	0.1331512	31.02793
0.9882978	0.1331512	31.02793
1.0133495	0.1331512	31.02793
1.0384012	0.1331512	31.02793
1.0634528	0.1331512	31.02793
1.0885045	0.1331512	31.02793
1.1135562	0.1331512	31.02793

Identify the 5% threshold value:

# 5% contour line from kd grid based on 10k MC data
percentile_5 <- kd_d3m[["cont"]]["5%"]

Verify that 5% (500/10,000) values fall below the threshold:

mc_est_tbl_10000 %>% filter(estimate <= percentile_5)
## # A tibble: 500 x 4
##     estimate ellip  curv pressure
##        <dbl> <dbl> <dbl>    <dbl>
##  1 0.000243   1.19  6.67    166. 
##  2 0.000371   1.37  2.29     81.7
##  3 0.000356   1.22  2.73    200. 
##  4 0.000134   1.27  7.91    114. 
##  5 0.000560   1.47  3.89    163. 
##  6 0.0000732  1.47  4.67    282. 
##  7 0.000527   1.29  6.13    185. 
##  8 0.000254   1.48  2.44    136. 
##  9 0.000644   1.19  6.63    135. 
## 10 0.000856   1.43  4.98    156. 
## # ... with 490 more rows

If we wanted to know the nearest probability contour line for every point we could make a function to do so.

get_probs_fcn <- function(value) {
  t <- reference_grid_probs_tbl %>%
    mutate(value = value) %>%
    mutate(dif = abs(estimate - value)) %>%
    filter(dif == min(dif))

  t[[1, 1]]
}

Map the function over each value in the dataset.

# mc_1_to_99_tbl <- mc_est_tbl_10000 %>%
#   mutate(nearest_prob = map_dbl(estimate, get_probs_fcn))

# mc_1_to_99_tbl %>% write_rds(here::here("mc_1_to_99_tbl.rds"))
mc_1_to_99_tbl <- read_rds(here::here("mc_1_to_99_tbl.rds"))

mc_1_to_99_tbl
## # A tibble: 10,000 x 5
##    estimate ellip  curv pressure nearest_prob
##       <dbl> <dbl> <dbl>    <dbl>        <dbl>
##  1  0.0140   1.27  2.16    133.            59
##  2  0.0119   1.20  4.44    127.            52
##  3  0.00265  1.38  2.65    160.            14
##  4  0.0194   1.24  3.62    142.            75
##  5  0.0122   1.32  2.54    129.            53
##  6  0.0168   1.26  3.24    147.            68
##  7  0.00555  1.33  3.39    168.            28
##  8  0.0112   1.24  4.25    146.            50
##  9  0.00826  1.19  1.85     87.7           39
## 10  0.00197  1.32  4.14     90.5           11
## # ... with 9,990 more rows

Let’s see what this actually did by comparing the probability contours estimated from the function above vs. the reference contours produced during generation of the kde object:

n <- mc_1_to_99_tbl %>%
  select(nearest_prob, estimate) %>%
  mutate(set = as_factor("manual_fit_10k"))

p <- cont_vals_tbl %>%
  mutate(set = as_factor("kde_output")) %>%
  rename(estimate = x, nearest_prob = probs)

bind_rows(n, p) %>%
  filter(estimate < .005 & estimate > 0) %>%
  ggplot(aes(x = estimate, y = nearest_prob)) +
  geom_point(aes(color = set)) +
  geom_vline(xintercept = percentile_5)

So it appropriately rounded each value in the simulated population to the nearest whole percentile, as desired.

Density Plot with Probability Contours in 2d

Now the fun part - Visualization. Here’s how I’d go about creating a 2d density plots. If you don’t care about specific probability contours, you can use the built in ggplot method. All you need is the raw data points, not the kde object or estimates. ggplot can do that for you.

Here’s ellipticity vs. curvature:

cp_plt <- correlated_ln_draws_tbl %>%
  ggplot(aes(x = ellip, y = curv)) +
  geom_point(alpha = .3, size = .5) +
  geom_density2d(size = 1.3) +
  theme_classic() +
  xlim(c(.9, 1.6)) +
  ylim(c(1, 7.5)) +
  labs(
    title = "Joint Distribution of Vessel Ellipticity and Curvature",
    subtitle = "Density Contours at Default Settings",
    x = "Ellipticity (unitless)",
    y = "Radius of Curvature (mm)"
  )

cp_plt

This is pretty good. But for specifying specific contours and specifically we’ll need a bit more. Here’s a solution that I adapted from one I found on Cross Validated.

First, select the 2 variables of interest.

d <- correlated_ln_draws_tbl %>% select(ellip, curv)

## density function
kd <- ks::kde(d, compute.cont = TRUE, h = 0.05)

Now a a function to extract the points of the contour line from the kde:

get_contour <- function(kd_out = kd, prob = "5%") {
  contour_95 <- with(kd_out, contourLines(
    x = eval.points[[1]], y = eval.points[[2]],
    z = estimate, levels = cont[prob]
  )[[1]])
  as_tibble(contour_95) %>%
    mutate(prob = prob)
}

Map it over the kd object.

dat_out <- map_dfr(c("5%", "20%", "40%", "60%", "80%", "95%"), ~ get_contour(kd, .)) %>%
  group_by(prob) %>%
  mutate(n_val = 1:n()) %>%
  ungroup()

dat_out %>%
  head(10) %>%
  kable(align = "c")

level	x	y	prob	n_val
0.1050379	1.024919	2.053556	5%	1
0.1050379	1.023034	2.112579	5%	2
0.1050379	1.022015	2.176684	5%	3
0.1050379	1.021665	2.240789	5%	4
0.1050379	1.021762	2.304894	5%	5
0.1050379	1.022205	2.368999	5%	6
0.1050379	1.022956	2.433104	5%	7
0.1050379	1.024009	2.497208	5%	8
0.1050379	1.024919	2.540507	5%	9
0.1050379	1.025311	2.561313	5%	10

Clean kde output

kd_df <- expand_grid(x = kd$eval.points[[1]], y = kd$eval.points[[2]]) %>%
  mutate(z = c(kd$estimate %>% t()))

Now visualize again, this time with probability contours at specified values and the 5% curve labeled with geom_label_repel().

label_tbl <- dat_out %>%
  filter(
    prob == "5%",
    n_val == 100
  )

# visualize
ellip_curv_2plt <- ggplot(data = kd_df, aes(x, y)) +
  geom_tile(aes(fill = z)) +
  geom_point(data = d, aes(x = ellip, y = curv), alpha = .4, size = .4, colour = "white") +
  geom_path(aes(x, y, group = prob),
    data = dat_out %>% filter(prob %in% c("5%", "20%", "40%", "60%", "80%", "95%")), colour = "white", size = 1.2, alpha = .8
  ) +
  #  geom_text(aes(label = prob), data =
  #              filter(dat_out, (prob %in% c("5%") & n_val==1)), # | (prob %in% c("90%") & n_val==20)),
  #            colour = "yellow", size = 5)+
  geom_label_repel(
    data = label_tbl, aes(x, y),
    label = label_tbl$prob[1],
    fill = "yellow",
    color = "black",
    segment.color = "yellow",
    #    segment.size = 1,
    min.segment.length = unit(1, "lines"),
    nudge_y = .5,
    nudge_x = -.025
  ) +
  xlim(c(.95, 1.5)) +
  ylim(c(0, 7.5)) +
  labs(
    title = "Joint Distribution [Ellipticity and Radius of Curvature]",
    subtitle = "Simulated Data",
    caption = "Density Contours shown at 5%, 20%, 40%, 60%, 80%, 95%"
  ) +
  scale_fill_viridis_c(end = .9) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Ellipticity (unitless)", y = "Radius of Curvature (mm)")
ggExtra::ggMarginal(ellip_curv_2plt, type = "density", fill = "#403891ff", alpha = .7)

Nice! Now we have a clear delineation of the most common patients and most extreme patients and we can group them as desired. Almost done - the last thing to do is to see how to extend into 3d.

Density Plot with Probability Contours in 3d

Honestly, this part is pretty easy thanks to a built in plot.kde method. Just use the cont argument to specify with probability contours you want.

plot(x = kd_d3m, cont = c(45, 70, 95), drawpoints = FALSE, col.pt = 1)

Add points using the points3d function. In this case I add 2 sets, 1 for the 5% most extreme and 1 for the 95% most common.

# plot(x = kd_d3m, cont = c(95) ,drawpoints = FALSE, col.pt = 1)
mc_lowest_5_tbl <- mc_1_to_99_tbl %>% filter(estimate < percentile_5)
mc_6_to_100_tbl <- mc_1_to_99_tbl %>% filter(estimate >= percentile_5)

# points3d(x = mc_lowest_5_tbl$ellip, y = mc_lowest_5_tbl$curv, z = mc_lowest_5_tbl$pressure, color = "dodgerblue",  size = 3, alpha = 1)

# points3d(x = mc_6_to_100_tbl$ellip, y = mc_6_to_100_tbl$curv, z = mc_6_to_100_tbl$pressure, color = "black",  size = 3, alpha = 1)

And there we have it! A 3d joint distribution with 95% probability density contour and groups separated by color and confirmed to cover 95/5% of the population.

If you’ve made it this far, I thank you. Here is a brief summary of what I tried to capture in this post:

Manual method to simulate a population from a sample with known correlation structure using mvrnorm()

Simplified / Efficient way to do the same using AnySim package

Method for identifying worst-case 5% of patients when there is a known target point to stay away from

Method for identifying most common 95% of patients with respect the their density

Simple way to plot 2d density with ggplot, showing “levels”

Generate 2d density plots with specified probability density contours and labels

Method for creating 3d plots with specified probability density contours and in/out groups

Hamdan et. al. Journal of the American College of Cardiology, Volume 59, Issue 2, 2012, Pages 119-127↩︎
The units of measure aren’t really important for the analysis but just for calibration the ellipticity is generally unitless (it is a ratio of the long axis over the short axis); curvature is usually in mm and represents the radius of curvature on the inner curve of the vessel; pressure is usually in mm Hg and can be assumed to be the systolic blood pressure.↩︎
see Water 2020, 12, 1645; doi:10.3390/w12061645 ↩︎
simulated population conditional on the model - no sampling error is considered here. For population inference that takes into account sampling error, see tolerance intervals and the tolerance package in R↩︎
https://en.wikipedia.org/wiki/Goodman_relation ↩︎
https://bookdown.org/egarpor/NP-UC3M/#welcome ↩︎
Adapted from this Stack Overflow response: https://stackoverflow.com/questions/23437000/how-to-plot-a-contour-line-showing-where-95-of-values-fall-within-in-r-and-in ↩︎

To leave a comment for the author, please follow the link and comment on their blog: [R]eliability.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Boundary Conditions and Anatomy – Exploring Correlated Data Simulation in R

Correlations in the Original Dataset

Approach 1 - Manually Transform Everything to Normal

Step 1 - Fit Distributions to Each Variable

Step 2 - Transform all variables to normal

Step 3 - Fit normal distributions to each transformed variable

Step 4 - Draw joint distribution using mvrnorm() or equivalent function

Step 5 - Back-transform simulated data to original distribution

Step 6 - Evaluate parameters and marginal distributions of the back-transfomed data

Compare Original Data to Simulated Data

Approach 2 - AnySim

Using the Simulated Population to Define Desired Test Conditions and Groups of Interest

Identify a percentage of worst-case patients relative to some value of interest

Identify probability contours using ks package and kernel density estimation

Density percentiles from kde estimate

KDE estimates in the range of the variables

Density Plot with Probability Contours in 2d

Density Plot with Probability Contours in 3d

Related

Correlations in the Original Dataset

Approach 1 - Manually Transform Everything to Normal

Step 1 - Fit Distributions to Each Variable

Step 2 - Transform all variables to normal

Step 3 - Fit normal distributions to each transformed variable

Step 4 - Draw joint distribution using mvrnorm() or equivalent function

Step 5 - Back-transform simulated data to original distribution

Step 6 - Evaluate parameters and marginal distributions of the back-transfomed data

Compare Original Data to Simulated Data

Approach 2 - AnySim

Using the Simulated Population to Define Desired Test Conditions and Groups of Interest

Identify a percentage of worst-case patients relative to some value of interest

Identify probability contours using ks package and kernel density estimation

Density percentiles from kde estimate

KDE estimates in the range of the variables

Density Plot with Probability Contours in 2d

Density Plot with Probability Contours in 3d

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)