Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Is it possible to derive useful and interpretable information from short doctors’ notes written during visits of patients? In our recently published paper, we propose a methodology based on Natural Language Processing tools to automatically process these data and show interesting patterns.

The story is a summary of the paper “Interpretable Segmentation of Medical Free-Text Records Based on Word Embeddings”, Adam Gabriel Dobrakowski, Agnieszka Mykowiecka, Małgorzata Marciniak, Wojciech Jaworski, Przemysław Biecek published in Journal of Intelligent Information Systems 2021. https://doi.org/10.1007/s10844-021-00659-4

Medical concepts embeddings

At the start, we have to clean and preprocess texts to identify medical concepts. We applied the Concraft tagger and TermoPL tools to normalize different forms of the same words (because analyzed texts were in Polish) and to identify key phrases. For English this step would be much easier because you can use a base of concepts from the Unified Medical Language System (UMLS).

On the extracted concepts we run GloVe algorithm to produce embeddings. To validate the quality of embeddings we designed a term analogy task and checked that the embeddings trained on our corpus are better than pretrained embeddings that are not specific to medical terminology. Below you can see PCA projections of the obtained embeddings:

Representations of visits

As each description is divided into three parts: an interview with a patient, a medical examination (e.g. results of tests, observations) and recommendations given by doctors, we use only the first two parts to create a vector representation of a visit. It is a concatenation of simple averages of embeddings of concepts that are contained in the texts.

Clustering of visits

When we have a vector representation of each visit, we can cluster it by any of the well-known clustering methods. We tested k-means and hierarchical clustering. We make segments inside each doctors’ specialty to obtain a structure of visits characteristic to this specialty. We can visualize the clusters by t-SNE (each dot corresponds to one visit):

Recommendations inside clusters

As we did not use the third part of descriptions, recommendations, in the generating visit representations, our methodology can be useful to predict this part and to support doctors in their decisions after assigning a visit to the appropriate cluster.

We obtained groups of visits with similar diagnoses expressed in ICD-10 codes. Below we show the Correspondence Analysis plot between clusters (blue dots) and ICD-10 codes (red triangles) for gynecology clustering. We identified two groups: the diseases of the genitourinary system (N), connected with Clusters 1 and 3; and pregnancy, childbirth and the puerperium (O), connected with Cluster 2.

Software for Interpretable Segmentation

The presented methods are implemented in the memr package for R. The name is an acronym for Multisource Embeddings for Medical Records. The package can be installed from the GitHub repository https://github.com/MI2DataLab/memr available under MIT license. The package allows for creating embeddings of medical free-text records written by doctors and provides a wide spectrum of tools for data visualization and segmentation of medical visits. These tools are intended to develop computer-supported medicine by facilitating medical data analysis and interpretation. The package can be exploited for many applications such as the recommendation prediction, patients’ clustering etc. that can aid doctors in their practice.

If you are interested in other posts about explainable, fair and responsible ML, follow #ResponsibleML on Medium.

Interpretable Segmentation of Medical Free-Text Records Based on Word Embeddings was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.