% # remove stop words in both columns (word1, word2), comment 2.26 filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% # merge those 2 columns back unite(bigram, word1, word2, sep = " ") full_bigram " />

Mandell O Mandell, Please Grant Me Some Insight!

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

TL;DR Bring a textbook to life by Using a simple Natural Language Processing method (Ngram) to guide focused reading and build a robust differential diagnosis

Introduction

What would you do if you encountered a clinical case with quite a few nuance symptoms? How do you know which article to look at to build a differential diagnosis? It is helpful to have ample experience to guide clinical reasoning. What if you have not encountered such a clinical case? The next step is usually reaching out to more experienced colleagues or mentors. What if we could also reach out to our buddy old pal, textbook? But how to read the book more efficiently, or perhaps smartly?

For example, we have a 50-year-old male who presented with a productive cough and diarrhea for three weeks. CT chest showed ground glass opacities. CT abdomen/pelvis showed splenomegaly and mesenteric lymphadenopathy. No risk factor for HIV. He has a pet bird. What would your differential diagnosis be? Psittacosis crossed any Infectious Disease providers’ minds, but what other conditions may have similar symptoms?

Let’s take ground glass, diarrhea, splenomegaly, lymphadenopathy, and bird as our five keywords. How can we quickly scan through Mandell and show us the chapters of interest?

Thought Process

List all mandell pdf files

# list all pdf, remember to change '/path/to/your/mandell'
files <- list.files(path = "/path/to/your/mandell", pattern = "pdf$")

pdf$ means look for files that end with pdf


Create an empty dataframe & Mine Texts

library(tidyverse)
library(pdftools)
library(tidytext)

# Create an empty dataframe - see comment 2.1
full_bigram <- tibble(title = as.character(""), chapter = as.integer(""), page = as.integer(""), bigram = as.character(""))

# Mine Text Section
## For loop to create a ngram - see comment on 'what is an ngram?'
## since there are 324 files/chapters, we are basically parsing and adding on to our dataframe file by file

for (i in (1:324)) {
  
# use pdftools package to parse its texts into a dataframe - see comment 2.21
# please remember to change /path/to/your/mandell
pdf <- pdf_text(paste0("/path/to/your/mandell",files[i])) %>%
  as_tibble()

# Extract the title of the chapter by its filename
title <- files[i] %>%
  str_extract(pattern = "(?<=[[:digit:]]_).*(?=.pdf)") # comment 2.22

# Get Chapter Number
num <- files[i] %>%
  str_extract(pattern = "^[[:digit:]]+(?=_)") # comment 2.23

# Text Mining
text <- pdf %>%
  mutate(title = title, # insert title variable above
         chapter = as.integer(num), # insert num variable above as integer
         page = row_number()) %>% # insert row number which is page num, see comment 2.24
  
  # using tidytext::unnest_tokens to create a bigram
  unnest_tokens(bigram, value, token = "ngrams", n = 2) %>%
  # separate the bigram into 2 columns
  separate(bigram, c("word1","word2"), sep = " ") %>%
  # remove stop words in both columns (word1, word2), comment 2.26
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  # merge those 2 columns back
  unite(bigram, word1, word2, sep = " ")

full_bigram <- bind_rows(full_bigram, text) # comment 2.25
}

2.1 Create an empty tibble dataframe with column names of title, chapter, page, bigram, with its respective data type

2.21 Parse all the pdf

2.22 Regular Expression "(?<=[[:digit:]]_).*(?=.pdf)"means extract everything that is sandwiched between a number and .pdffor example picture

2.23 Regular Expression ^[[:digit:]]+(?=_)" means ^ == start with, [[:digit:]] == all numbers, + == one or more, (?=_) == precede _

2.24 pdttools::pdf_text parse each pdf page as a row, hence we will insert row_number() as our page number

2.25 example of our bigram

2.26 Stop words examples


What is an ngram?

According to Wikipedia, an n-gram is a contiguous sequence of n items from a given sample of text or speech, where N is a number.

For example, if a phrase Mandell O Mandell, Please Grant Me Some Insight! will have the following 2-gram/bigram: Mandell O, O Mandell, Mandell Please, Please Grant, Grant Me, Me Some, Some Insight. For best practice, make sure the letters are all lower case.


Save dataframe for future use

save(full_bigram, file = "mandell_bigram.Rdata")

Save full_bigram dataframe as file mandell_bigram.Rdata in your working directory

Create function to search for keywords

library(tidyverse)
library(tidytext)

# load data
load(file = "mandell_bigram.Rdata")

# create a find function - comment 3.1
find <- function(a,b,c,d,e,ngram=bigram,data=full_bigram){
  # uses rlang check out https://rlang.r-lib.org/reference/enquo.html
  a1 <- enquo(a)
  b1 <- enquo(b)
  c1 <- enquo(c)
  d1 <- enquo(d)
  e1 <- enquo(e)
  ngram <- enquo(ngram)
  data <- data
  
find2 <- data %>%
  drop_na() %>%
  # group by chapter
  group_by(chapter) %>%
  # filter out any ROWS that have ANY of these 5 keywords
  filter(str_detect(!! ngram, paste0(!! a1,"|",!! b1,"|",!! c1,"|",!! d1,"|",!! e1))) %>%
  # filter out any CHAPTERS that have ALL of the 5 keywords
  filter(
      any(str_detect(!! ngram, !! a1)) & 
      any(str_detect(!! ngram, !! b1)) & 
      any(str_detect(!! ngram, !! c1)) &
      any(str_detect(!! ngram, !! d1)) &
      any(str_detect(!! ngram, !! e1))) %>%
  # count how many times each chapter mentions the 5 keywords
  # the higher the frequency, the more we need to check out its insight
  # btw, count will automatically add a column 'n' with its frequency
  count(title, chapter, !! ngram)

return(find2)
}

# This is where you would insert in your keywords, see comment 4.1 
a <- "ground glas"
b <- "bird"
c <- "lymph"
d <- "diarrhe"
e <- "spleno"

# execute the function 
result2 <- find(a,b,c,d,e)

result <- result2 %>%
  group_by(chapter) %>%
  # sum the total of 'n' (count from the function)
  mutate(total = sum(n)) %>%
  arrange(desc(total))

4.1 Ideally, you want to maximize return of Regular Expression search. If you would like to know more about Regular Expression ( RegEx), please click here.

Look at what we have here

# view the result without all the noise by filtering to just the title

result %>%
  ungroup() %>%
  arrange(desc(total)) %>%
  distinct(title, .keep_all = TRUE) %>%
  view()

Wow, I wouldn’t have thought about toxoplasma or MAI! Intersting result. But are they reliable? We would have to look into it a bit more. Please note that we have used distinct to shorten the result. So it is only showing distinct chapter/title and its total arranged descendingly.


Let’s look at Toxoplamsa and MAI in details

toxo <-
result %>%
  filter(str_detect(title, "Toxo"))

toxo %>% view()


Looks legit. Upon further reading the chapter, these five keywords should also prompt this condition. I don’t remember birds being a host for toxoplasma. This creates more questions, but I take it as an excellent opportunity to look further into its references and, of course, WHAT KIND OF BIRD !?!

Observe some words containing lymph such as lymphoma were also captured. Have to be mindful of this.


mai <-
result %>%
  filter(str_detect(title, "Myco"))

mai %>% view()

Now this is interesting. Observe that the bigram showed exotic birds and bird stain. How curious. Looking at the actualy text, it was actually referring to hypersentivity pneumonitis and the construct of the bigram bird stain was entire from a different paragraph but the same line. This is one of the limitation of using pdf to create ngram.


Something that is more interesting is that the pdf I used to extract words was an older edition. When I looked at the most recent Mandell edition, exotic bird is no longer included as part of the risk factor for hypersensitivity pneumonitis. I’m loving this! In just short period of time, one can make different connections and also appreciate the difference between edition of a textbook!

Opportunity for improvement / Future potential

  • Imagine you have ngrams of all available case reports, case series, systemic review etc. That would be very helpful for differential diagnosis building. Can be possible with pubmed API and perhaps ever growing text/topic modelling database
  • what if you have bigram of different textbooks (e.g., rheum, pulm, radiology etc). What a robust ddx we will be able to build
  • Scrape the website instead of pdf to have a more accurate ngram as we will not have the bird stain problem
  • Add marginal probability/frequency of all of the buzzwords to assess marginal contribution

Conclusion/Lessons Learnt

Phew, that was intense!

  • text mining with tidytext
  • extracting texts using pdftools
  • removing stopwords to improve relevant text extraction
  • NLP can be used to enhance learning and building a robust differential diagnosis

If you like this article:

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)