article extraction, doc2vec & health news coverage in online media

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


This post demonstrates a simple procedure for extracting articles from online news sources using the quicknews package. We also demonstrate methods for entity extraction based on a controlled vocabulary (here, the MeSH thesaurus & hierarchically-organized vocabulary), as well as a quick implementation of a doc2vec model.

Gather article metadata

While primarily an article extractor, quicknews also includes a function for scraping article details from Google News, qnews_get_newsmeta. Below we search for article metadata for eleven health-related terms.


terms <- c('covid-19', 
           'mental health',
           "alzeimer's disease",
           'substance abuse',
           'lung disease',
           'heart disease',
           'kidney disease')

metas <- lapply(terms,
                quicknews::qnews_get_newsmeta) %>%
  bind_rows() %>%
  distinct(link, .keep_all = TRUE) %>%
  mutate(nid = as.character(row_number())) %>%
  select(nid, term:link)

A summary of search results are detailed below; the function returns a maximum of 100 articles per query term.

## alzeimer's disease             cancer           covid-19           diabetes 
##                 59                 99                100                100 
##      heart disease     kidney disease       lung disease      mental health 
##                 98                 98                100                 98 
##            obesity    substance abuse            suicide 
##                100                100                100

A sample set of records and metadata features:

metas %>%
  select(-title) %>%
  sample_n(3) %>%
  DT::datatable(rownames = F,
                options = list(dom = 't'))

Article extractor

The qnews_extract_article function is designed for multi-threaded text extraction from HTML. A simple approach, with no Java dependencies. HTML markups, comments, extraneous text, etc. are removed mostly via node type, node-final punctuation, character length, and a small dictionary of “junk” phrases.

articles <- quicknews::qnews_extract_article(url = metas$link, 
                                             cores = 6)

Post article extraction, we add article metadata obtained from qnews_get_newsmeta.

articles1 <- articles %>% 
  left_join(metas %>% 
              select(nid, link, term, source), 
            by = c('doc_id' = 'link')) 

MeSH vocabulary

The PubmedMTK package includes as a data frame the MeSH thesaurus & hierarchically-organized vocabulary – comprised of 2021 versions of descriptor & trees files made available via NLM-NIH.

health_terms <- pmtk_tbl_mesh[!duplicated(pmtk_tbl_mesh$TermName),] %>%
  mutate(DescriptorName = gsub(' ', '_', DescriptorName),
         TermName = gsub(' ', '_', TermName)) %>%
  filter(!cats %in% c(## 'Chemicals and Drugs', 
                        'Publication Characteristics',
                        #'Named Groups',
                        'Information Science')) %>%
  filter(!TermName %in% c('who', 'will')) %>%
  filter(!grepl(',', TermName)) %>%
  select(DescriptorName, TermName)

The dictionary is designed with concept normalization in mind. Descriptors serve as category/concept names; terms are text instantiations of a given category/concept. The table below demonstrates this relationship for the concept “heart failure.”

hf <- health_terms %>%
  filter(DescriptorName == 'heart_failure') 

hf %>%
  head() %>%
  DT::datatable(rownames = F,
                options = list(dom = 't'))

Examples of “heart failure” from the online news corpus are highlighted below:

egs <- PubmedMTK::pmtk_locate_term(text = articles1$text,
                                   doc_id = c(1:nrow(articles1)),
                                   term = gsub('_', ' ', hf$TermName),
                                   stem = F,
                                   window = 10)

egs$kwic <- paste0('... ', egs$lhs, 
                   ' <span style="background-color:lightblue">', 
                   egs$instance, '</span> ', egs$rhs, ' ...')

egs %>%
  group_by(toupper(instance) )%>%
  slice(1:2) %>%
  ungroup() %>%
  select(doc_id, kwic) %>%
  DT::datatable(rownames = F,
                escape = F,
                options = list(dom = 't'))

Some text processing

Here, some simple text processing procedures to facilitate concept normalization via the MeSH lexicon. Below, we (1) tokenize news articles via the corpus::text_tokens function, and (2) cast tokens into a data frame. (A more complete summary of a generic NLP workflow can be found here.)

a1 <- corpus::text_tokens(x = articles1$text)
names(a1) <- articles1$nid

df <- reshape2::melt(a1)
                     c("value", "L1"),
                     skip_absent = TRUE) 

The next step is to recode multi-word phrases. MeSH includes thousands of terms/descriptors comprised of more than one word, eg, “heart failure.” The txt_recode_ngram function from the udpipe package can be used to “phrasify” these multi-word expressions, such that subsequent NLP processes will identify them as single units of meaning (as, eg, “heart_failure”).

health_terms$ngram <- stringr::str_count(health_terms$TermName,stringr::fixed('_')) + 1
multis <- subset(health_terms, ngram > 1)  

df$newness <- udpipe::txt_recode_ngram(tolower(df$token),
                                       compound = c(multis$TermName),
                                       ngram = c(multis$ngram),
                                       sep = '_')

To normalize MeSH terms (in our health-related news corpus) to MeSH descriptors (or concepts), we join the MeSH lexicon to the tokenized data frame. Moving forward, then, we have access to either term or descriptor (and any/all features included in the MeSH hierarchy).

## join lexicon
df1 <- data.table::setDT(health_terms)[df, on = "TermName == newness"]
df1[, token_id := seq_len(.N), by = doc_id]
df2 <- subset(df1, !

Lastly, we re-build corpus texts using descriptors instead of terms.

x3  <- df2 %>%
  mutate(DescriptorName =ifelse(, 
                                DescriptorName)) %>%
  group_by(doc_id) %>%
  summarise(text = paste0(DescriptorName, collapse = ' ')) %>%
  mutate(nwords = tokenizers::count_words(text)) %>%
  filter(nwords < 1000 & nchar(text) > 0)
## `summarise()` ungrouping output (override with `.groups` argument)


Per new normalized text, we use the doc2vec package to build a 100 dimension distributed memory “Paragraph Vector” model.

model.d2v <- doc2vec::paragraph2vec(x = x3, 
                                    type = "PV-DM", 
                                    dim = 100, 
                                    iter = 20,
                                    min_count = 3, 
                                    lr = 0.05, 
                                    threads = 1)

We then extract document & word embeddings from the model, and combine the two matrices. As the model situates documents and words in the same embedding space, we can investigate the relatedness among (representations of) both MeSH terms and health-related news articles.

embedding.words <- as.matrix(model.d2v, which = "words") <- as.matrix(model.d2v,   which = "docs")

## subset embeddings to vocabs
ebw <- embedding.words[rownames(embedding.words) %in% health_terms$DescriptorName,]
both <-, list(, ebw))

To visualize/explore these high dimensional representations, we implement tSNE via the Rtsne package.

tsne <- Rtsne::Rtsne(X = both, 
                     check_duplicates = T,
                     perplexity = 25)

tsne0 <- data.frame(nid = rownames(both), tsne$Y) %>%
  left_join(articles1) %>%
  mutate(type = ifelse(, 'term', 'doc'),
         nid = ifelse(!, title, nid),
         term = ifelse(, 'term', term)) 

Model visualization

Finally, a quick plot using plotly. MeSH terms represented as beige triangles; articles color-coded by search term (per results from qnews_get_newsmeta).

p <- tsne0 %>%
  mutate(id1 = stringr::str_wrap(string = nid,
                                    width = 30,
                                    indent = 1,
                                    exdent = 1)) %>%
  ggplot2::ggplot(aes(x = X1,
                      y = X2,
                      text = id1,
                      color = term,
                      shape = type
                      )) +
  geom_hline(yintercept = 0, color = 'gray') +
  geom_vline(xintercept = 0, color = 'gray') +
  ggplot2::geom_point(size = 1.25) +
    values = colorRampPalette(ggthemes::stata_pal()(8))(12)) +
  theme_minimal() +
  theme(legend.position = 'right')

                 height = 500, #width = 700,
                 tooltip = 'text') %>%
       plotly::layout(showlegend = T)

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)