Can we rely on synthetic data to overcome data governance issue in healthcare?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently came across a term synthetic data. I start wondering what does it mean? I found that it is different from the dummy data, but in what ways and how is it different, I began to wonder?
I become curious to find out more about it, as nowadays it is difficult to get hold of healthcare data (ie., NHS). The most prominent issues seem to link to data governance and access, as this information is personal sensitive data.
I investigated methods for creating ‘synthetic data’ as a tool that might help to develop better prediction models, as data could be available for a much larger pool of people, who can tackle these data governance and other challenging healthcare issues.
What is Synthetic data?
The goal is to generate a data set which contains no real units, therefore safe for public release and retains the structure of the data.
In other words, one can say that synthetic data contains all the characteristics of original data minus the sensitive content.
Synthetic data is generally made to validate mathematical models. This data is used to compare the behaviour of the real data against the one generated by the model.
How we generate synthetic data?
The principle is to observe real-world statistic distributions from the original data and reproduce fake data by drawing simple numbers.
Consider a data set with p variables. In a nutshell, synthesis follows these steps:
- Take a simple random sample of x1,obs and set as x1,syn
- Fit model f(x2,obs|x1,obs) and draw x2,syn from f(x2,syn|x1,syn)
- Fit model f(x3,obs|x1,obs , x2,obs ) and draw x3,syn from f(x3,syn|x1,syn , x2,syn )
- And so on, until f(xp,syn|x1,syn , x2,syn , … , xp-1,syn)
Fitting statistical models to the original data and generating completely new records for public release.
Joint distribution f(x1, x2, x3, …, xp) is approximated by a set of conditional distributions f(x2|x1).
For instance, we have the following original (real) data.
We can generate synthetic data using the algorithm described above.
We can compare the distribution of original data with synthetic data as follows:
These charts were created using the shiny app
National early warning score (NEWS) example in R:
### Load the observed NEWS data setwd("C:/Users/mfaisal1/OneDrive - University of Bradford/projects/NHS_R/NHSR_synpop/") df <- read.csv("observed_news_data.csv",header=TRUE)[,-1] #### original data df[1:10,] ## male age NEWS syst dias temp pulse resp sat sup alert died ## 1 0 68 3 150 98 36.8 78 26 96 0 0 0 ## 2 1 94 1 145 67 35.0 62 18 96 0 0 0 ## 3 0 85 0 169 69 36.2 54 18 96 0 0 0 ## 4 1 44 0 154 106 36.9 80 17 96 0 0 0 ## 5 0 77 1 122 67 36.4 62 20 95 0 0 0 ## 6 0 58 1 146 106 35.3 73 20 98 0 0 0 ## 7 0 25 4 65 42 35.6 72 12 99 0 0 0 ## 8 0 69 0 116 56 37.2 90 16 97 0 0 0 ## 9 0 91 1 162 72 35.5 60 16 99 0 0 0 ## 10 0 70 1 132 96 35.3 67 16 97 0 0 0 summary(df) ## male age NEWS syst ## Min. :0.000 Min. : 17.00 Min. : 0.000 Min. : 65.0 ## 1st Qu.:0.000 1st Qu.: 60.00 1st Qu.: 1.000 1st Qu.:118.0 ## Median :0.000 Median : 74.00 Median : 2.000 Median :134.0 ## Mean :0.476 Mean : 69.65 Mean : 2.444 Mean :135.7 ## 3rd Qu.:1.000 3rd Qu.: 84.00 3rd Qu.: 4.000 3rd Qu.:150.0 ## Max. :1.000 Max. :102.00 Max. :12.000 Max. :220.0 ## dias temp pulse resp ## Min. : 17.00 Min. :33.10 Min. : 40.0 Min. :10.00 ## 1st Qu.: 63.00 1st Qu.:35.80 1st Qu.: 70.0 1st Qu.:16.00 ## Median : 74.00 Median :36.20 Median : 84.0 Median :18.00 ## Mean : 74.63 Mean :36.31 Mean : 85.8 Mean :18.39 ## 3rd Qu.: 84.00 3rd Qu.:36.70 3rd Qu.: 98.0 3rd Qu.:20.00 ## Max. :124.00 Max. :40.20 Max. :200.0 Max. :43.00 ## sat sup alert died ## Min. : 82.00 Min. :0.000 Min. :0.000 Min. :0.00 ## 1st Qu.: 95.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.00 ## Median : 97.00 Median :0.000 Median :0.000 Median :0.00 ## Mean : 96.38 Mean :0.123 Mean :0.071 Mean :0.07 ## 3rd Qu.: 98.00 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.00 ## Max. :100.00 Max. :1.000 Max. :3.000 Max. :1.00 ### Generate the synthetic NEWS data using synthpop R package setwd("C:/Users/mfaisal1/OneDrive - University of Bradford/projects/NHS_R/NHSR_synpop/") library(synthpop) syn_df <- syn(df,seed=4321) ## Synthesis ## ----------- ## male age NEWS syst dias temp pulse resp sat sup ## alert died #### synthetic data syn_df$syn[1:10,] ## male age NEWS syst dias temp pulse resp sat sup alert died ## 1 1 56 1 126 84 35.7 72 17 98 0 0 0 ## 2 1 50 2 115 84 36.8 94 14 97 0 0 0 ## 3 0 74 6 143 86 36.5 82 21 93 0 0 0 ## 4 1 56 1 122 60 36.3 94 12 98 0 0 0 ## 5 1 52 0 153 89 36.2 78 12 96 0 0 0 ## 6 0 21 2 164 92 35.5 97 20 99 0 0 0 ## 7 0 37 1 101 57 35.6 76 15 98 0 0 0 ## 8 1 81 2 125 74 36.6 71 17 97 0 0 0 ## 9 1 67 5 182 103 37.1 95 18 94 1 0 0 ## 10 1 67 0 160 80 36.2 86 18 98 0 0 0 summary(syn_df$syn) ## male age NEWS syst ## Min. :0.00 Min. : 17.00 Min. : 0.000 Min. : 65.0 ## 1st Qu.:0.00 1st Qu.: 60.00 1st Qu.: 1.000 1st Qu.:118.0 ## Median :0.00 Median : 74.00 Median : 1.000 Median :135.0 ## Mean :0.47 Mean : 69.99 Mean : 2.414 Mean :136.2 ## 3rd Qu.:1.00 3rd Qu.: 84.00 3rd Qu.: 4.000 3rd Qu.:150.2 ## Max. :1.00 Max. :102.00 Max. :11.000 Max. :219.0 ## dias temp pulse resp ## Min. : 17.0 Min. :33.10 Min. : 43.00 Min. :12.00 ## 1st Qu.: 63.0 1st Qu.:35.80 1st Qu.: 70.00 1st Qu.:16.00 ## Median : 74.0 Median :36.20 Median : 83.00 Median :18.00 ## Mean : 74.6 Mean :36.26 Mean : 85.04 Mean :18.57 ## 3rd Qu.: 84.0 3rd Qu.:36.70 3rd Qu.: 97.00 3rd Qu.:20.00 ## Max. :124.0 Max. :40.20 Max. :200.00 Max. :43.00 ## sat sup alert died ## Min. : 82.00 Min. :0.000 Min. :0.000 Min. :0.000 ## 1st Qu.: 95.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 ## Median : 97.00 Median :0.000 Median :0.000 Median :0.000 ## Mean : 96.45 Mean :0.125 Mean :0.059 Mean :0.062 ## 3rd Qu.: 98.00 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000 ## Max. :100.00 Max. :1.000 Max. :3.000 Max. :1.000 write.csv("synthetic_news_data.csv") ## "","x" ## "1","synthetic_news_data.csv"
This data set is now on available in NHSRDatasets R package . For more discussion about synthpop R package
Summary
In many ways, synthetic data reflects George Box’s observation that “all models are wrong, but some are useful” while providing a “useful approximation [of] those found in the real world,”
The connection between the clinical outcomes of a patient visits and costs rarely exist in practice, so being able to assess these trade-offs in synthetic data allow for measurement and enhancement of the value of care – cost divided by outcomes.
Synthetic data is likely not a 100% accurate depiction of real-world outcomes, like cost and clinical quality, but rather a useful approximation of these variables. Moreover, synthetic data is constantly improving, and methods like validation and calibration will continue to make these data sources more realistic.
Besides synthetic data used to protect the privacy and confidentiality of set of data, it can be used for testing fraud detection systems by creating realistic behaviour profiles for users and attackers. In machine learning, it can also be used to train and test models. The synthetic data can aid in creating a baseline for future testing or studies such as clinical trial studies.
Dr Muhammad Faisal and Gary Hutson
The post Can we rely on synthetic data to overcome data governance issue in healthcare? appeared first on NHS-R Community.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.