Primary Data Heterogeneity

[This article was first published on Blog –, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It seems like every week we see another headline highlighting the promise of data to improve healthcare, from convolutional neural networks beating cardiologists at detecting cardiac arrhythmia to incredible advances in computer vision feeding speculation that radiologists will all soon be out of work. Given these developments, and the fact that machine learning now touches much of our day-to-day lives, you may wonder why aren’t all discussions with physicians informed by data-driven predictions about outcomes of care decisions? For example, suppose you’ve injured your knee skiing and are deciding whether to undergo surgical repair or physical therapy. Wouldn’t it be great if your doctor could run a model and tell you the probability the surgery will be successful and the estimated length of time until you’re skiing again for each treatment? So why aren’t we there? Since machines can now drive cars better than humans, why isn’t it yet common to get customized real-time predictions in healthcare?

This question was on my mind for much of the 2017 Healthcare Analytics Summit, and I think the answer is primarily data heterogeneity. An example might be the difference between data collected from self-driving cars and data collected to improve healthcare outcomes. A self-driving car collects various types of data from its environment, but developers of the driving algorithm know exactly what data is coming in from each sensor and what format it will be in. Tesla collects an incredible amount of data from each of its cars, but that data is consistent and is in a format that Tesla engineers designed. In contrast, healthcare data is wildly disparate and non-standardized.

To make predictions about knee surgery outcomes, for example, an algorithm would ideally learn from the patients’ health history, clinical history, genetics, socioeconomic status, physical activity, diet, and more—and none of that is standardized. To make a prediction for one patient, the algorithm would need to observe relationships in these data sources for many other patients; however, those patients all have different data histories—they have been seen by different providers, had different insurers, used different wearables, etc. To train the algorithm, all those various, disparate data sources need to be standardized and integrated. The algorithm needs to know that where one insurer codes patient physical activity as minutes per day in a column called “physActMinPerDay” in a table called “lifestyleFactors,” another codes the same information as on a 1-5 scale from inactive to very active in a column called “activity_level” in a table called “individual_attributes.” Before a model can use that information, those locations need to be identified and the 1-5 coding transformed into a minutes-per-day approximation. And this needs to happen for every variable in every table across all the different producers of healthcare encounter data, insurance data, wearable data, outcomes data, socioeconomic data, genetic data, and on and on and on.

Is it an impossible task? I don’t think so. In an excellent keynote at the Healthcare Analytics Summit, Dale Sanders described how the Health Catalyst Data Operating System (DOS) addresses these challenges. DOS is a data architecture that empowers reusable logic to integrate various types of data from a wide variety of sources. This makes it possible to develop algorithms that can learn from all of those different data sources across millions of patients, which produces massive predictive power. It also makes it possible to put those algorithms in the hands of physicians, to make real-time predictions on demand, and allow fully informed patients to take charge of their care.

Health Catalyst DOS schematic
The Health Catalyst Data Operating System (DOS)

To learn more about DOS, check out this webinar. If you have questions about this post or any machine learning topic, feel free to reach out to us directly or join our Slack community.

The post Primary Data Heterogeneity appeared first on

To leave a comment for the author, please follow the link and comment on their blog: Blog – offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)