Can we rely on synthetic data to overcome data governance issue in healthcare?

[This article was first published on R tips – NHS-R Community, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently came across a term synthetic data. I start wondering what does it mean? I found that it is different from the dummy data, but in what ways and how is it different, I began to wonder?

I become curious to find out more about it, as nowadays it is difficult to get hold of healthcare data (ie., NHS). The most prominent issues seem to link to data governance and access, as this information is personal sensitive data.

I investigated methods for creating ‘synthetic data’ as a tool that might help to develop better prediction models, as data could be available for a much larger pool of people, who can tackle these data governance and other challenging healthcare issues.

What is Synthetic data?

The goal is to generate a data set which contains no real units, therefore safe for public release and retains the structure of the data.

In other words, one can say that synthetic data contains all the characteristics of original data minus the sensitive content.

Synthetic data is generally made to validate mathematical models. This data is used to compare the behaviour of the real data against the one generated by the model.

How we generate synthetic data?

The principle is to observe real-world statistic distributions from the original data and reproduce fake data by drawing simple numbers.

Consider a data set with p variables. In a nutshell, synthesis follows these steps:

  1. Take a simple random sample of x1,obs and set as x1,syn
  2. Fit model f(x2,obs|x1,obs) and draw x2,syn from f(x2,syn|x1,syn)
  3. Fit model f(x3,obs|x1,obs , x2,obs ) and draw x3,syn  from f(x3,syn|x1,syn , x2,syn )
  4. And so on, until f(xp,syn|x1,syn , x2,syn , … , xp-1,syn)

Fitting statistical models to the original data and generating completely new records for public release.
Joint distribution f(x1, x2, x3, …, xp) is approximated by a set of conditional distributions f(x2|x1).

For instance, we have the following original (real) data.

We can generate synthetic data using the algorithm described above.

We can compare the distribution of original data with synthetic data as follows:

These charts were created using the shiny app

National early warning score (NEWS) example in R:

### Load the observed NEWS data
setwd("C:/Users/mfaisal1/OneDrive - University of Bradford/projects/NHS_R/NHSR_synpop/")
df <- read.csv("observed_news_data.csv",header=TRUE)[,-1]

#### original data
df[1:10,]

##    male age NEWS syst dias temp pulse resp sat sup alert died
## 1     0  68    3  150   98 36.8    78   26  96   0     0    0
## 2     1  94    1  145   67 35.0    62   18  96   0     0    0
## 3     0  85    0  169   69 36.2    54   18  96   0     0    0
## 4     1  44    0  154  106 36.9    80   17  96   0     0    0
## 5     0  77    1  122   67 36.4    62   20  95   0     0    0
## 6     0  58    1  146  106 35.3    73   20  98   0     0    0
## 7     0  25    4   65   42 35.6    72   12  99   0     0    0
## 8     0  69    0  116   56 37.2    90   16  97   0     0    0
## 9     0  91    1  162   72 35.5    60   16  99   0     0    0
## 10    0  70    1  132   96 35.3    67   16  97   0     0    0

summary(df) 

##       male            age              NEWS             syst      
##  Min.   :0.000   Min.   : 17.00   Min.   : 0.000   Min.   : 65.0  
##  1st Qu.:0.000   1st Qu.: 60.00   1st Qu.: 1.000   1st Qu.:118.0  
##  Median :0.000   Median : 74.00   Median : 2.000   Median :134.0  
##  Mean   :0.476   Mean   : 69.65   Mean   : 2.444   Mean   :135.7  
##  3rd Qu.:1.000   3rd Qu.: 84.00   3rd Qu.: 4.000   3rd Qu.:150.0  
##  Max.   :1.000   Max.   :102.00   Max.   :12.000   Max.   :220.0  
##       dias             temp           pulse            resp      
##  Min.   : 17.00   Min.   :33.10   Min.   : 40.0   Min.   :10.00  
##  1st Qu.: 63.00   1st Qu.:35.80   1st Qu.: 70.0   1st Qu.:16.00  
##  Median : 74.00   Median :36.20   Median : 84.0   Median :18.00  
##  Mean   : 74.63   Mean   :36.31   Mean   : 85.8   Mean   :18.39  
##  3rd Qu.: 84.00   3rd Qu.:36.70   3rd Qu.: 98.0   3rd Qu.:20.00  
##  Max.   :124.00   Max.   :40.20   Max.   :200.0   Max.   :43.00  
##       sat              sup            alert            died     
##  Min.   : 82.00   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.: 95.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.00  
##  Median : 97.00   Median :0.000   Median :0.000   Median :0.00  
##  Mean   : 96.38   Mean   :0.123   Mean   :0.071   Mean   :0.07  
##  3rd Qu.: 98.00   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.00  
##  Max.   :100.00   Max.   :1.000   Max.   :3.000   Max.   :1.00

### Generate the synthetic NEWS data using synthpop R package

setwd("C:/Users/mfaisal1/OneDrive - University of Bradford/projects/NHS_R/NHSR_synpop/")
library(synthpop)  
syn_df <- syn(df,seed=4321)

## Synthesis
## -----------
##  male age NEWS syst dias temp pulse resp sat sup
##  alert died

#### synthetic data
syn_df$syn[1:10,]

##    male age NEWS syst dias temp pulse resp sat sup alert died
## 1     1  56    1  126   84 35.7    72   17  98   0     0    0
## 2     1  50    2  115   84 36.8    94   14  97   0     0    0
## 3     0  74    6  143   86 36.5    82   21  93   0     0    0
## 4     1  56    1  122   60 36.3    94   12  98   0     0    0
## 5     1  52    0  153   89 36.2    78   12  96   0     0    0
## 6     0  21    2  164   92 35.5    97   20  99   0     0    0
## 7     0  37    1  101   57 35.6    76   15  98   0     0    0
## 8     1  81    2  125   74 36.6    71   17  97   0     0    0
## 9     1  67    5  182  103 37.1    95   18  94   1     0    0
## 10    1  67    0  160   80 36.2    86   18  98   0     0    0

summary(syn_df$syn) 

##       male           age              NEWS             syst      
##  Min.   :0.00   Min.   : 17.00   Min.   : 0.000   Min.   : 65.0  
##  1st Qu.:0.00   1st Qu.: 60.00   1st Qu.: 1.000   1st Qu.:118.0  
##  Median :0.00   Median : 74.00   Median : 1.000   Median :135.0  
##  Mean   :0.47   Mean   : 69.99   Mean   : 2.414   Mean   :136.2  
##  3rd Qu.:1.00   3rd Qu.: 84.00   3rd Qu.: 4.000   3rd Qu.:150.2  
##  Max.   :1.00   Max.   :102.00   Max.   :11.000   Max.   :219.0  
##       dias            temp           pulse             resp      
##  Min.   : 17.0   Min.   :33.10   Min.   : 43.00   Min.   :12.00  
##  1st Qu.: 63.0   1st Qu.:35.80   1st Qu.: 70.00   1st Qu.:16.00  
##  Median : 74.0   Median :36.20   Median : 83.00   Median :18.00  
##  Mean   : 74.6   Mean   :36.26   Mean   : 85.04   Mean   :18.57  
##  3rd Qu.: 84.0   3rd Qu.:36.70   3rd Qu.: 97.00   3rd Qu.:20.00  
##  Max.   :124.0   Max.   :40.20   Max.   :200.00   Max.   :43.00  
##       sat              sup            alert            died      
##  Min.   : 82.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 95.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median : 97.00   Median :0.000   Median :0.000   Median :0.000  
##  Mean   : 96.45   Mean   :0.125   Mean   :0.059   Mean   :0.062  
##  3rd Qu.: 98.00   3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :100.00   Max.   :1.000   Max.   :3.000   Max.   :1.000

write.csv("synthetic_news_data.csv")

## "","x"
## "1","synthetic_news_data.csv"

This data set is now on available in NHSRDatasets R package . For more discussion about synthpop R package

Summary

In many ways, synthetic data reflects George Box’s observation that “all models are wrong, but some are useful” while providing a “useful approximation [of] those found in the real world,”

The connection between the clinical outcomes of a patient visits and costs rarely exist in practice, so being able to assess these trade-offs in synthetic data allow for measurement and enhancement of the value of care – cost divided by outcomes.

Synthetic data is likely not a 100% accurate depiction of real-world outcomes, like cost and clinical quality, but rather a useful approximation of these variables. Moreover, synthetic data is constantly improving, and methods like validation and calibration will continue to make these data sources more realistic.

Besides synthetic data used to protect the privacy and confidentiality of set of data, it can be used for testing fraud detection systems by creating realistic behaviour profiles for users and attackers. In machine learning, it can also be used to train and test models. The synthetic data can aid in creating a baseline for future testing or studies such as clinical trial studies.

Dr Muhammad Faisal and Gary Hutson

The post Can we rely on synthetic data to overcome data governance issue in healthcare? appeared first on NHS-R Community.

To leave a comment for the author, please follow the link and comment on their blog: R tips – NHS-R Community.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)