covid19 & some computational-corpus linguistics

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A SHORT COURSE IN COMPUTATIONAL-CORPUS LINGUISTICS – using the R library text2vec – with a focus on working with (a controlled vocabulary of) multi-word expressions. Here we consider a Twitter corpus comprised of all tweets generated by the 535 voting members of the US Congress during the second session of the 116th Congress.

More specifically, we consider the usage of COVID19-related terms as a function of both time and party affiliation; we also investigate the conceptual relatedness of COVID19-relevant terms using a GloVe model and multi-dimensional scaling.

A cache of scalable & efficient methodologies for some common corpus-based tasks.

A COVID19 lexicon

tweets_dir <- '/home/jtimm/jt_work/GitHub/git_projects/us_lawmaker_tweets_2020/'
covid_dir <- '/home/jtimm/jt_work/GitHub/git_projects/A-covid19-lexicon/'
dictionary <-  readxl::read_xlsx ('covid_glossary_w_variants.xlsx') %>%
  filter(category != 'race-ethnicity') %>%

I have collated some COVID19-related terms from a few resources, most notably, this Yale Medicine glossary. Per this resource, each term has been categorized as one of the following:

## [1] "cv"                "interventions"     "medical_response" 
## [4] "prevention"        "socio-political"   "spread_of_disease"
## [7] "transmission"

As I have added terms, I have tried to fit them within this classification framework. I have also added a socio-political category to capture some of the civil-liberties-based rhetoric/protesting happening in the US in response to stay-at-home orders, as well as stimulus legislation, etc. COVID19 vocabulary as xlsx file. A good start, but could certainly be developed.

The table below illustrates the structure of the vocabulary for two COVID19-related concepts: ANTIVIRAL and HAND-HYGIENE. So, the descriptor_name column represents the higher-level concept; term_name column reflects the different (inflectional or orthographical) ways the concept can manifest in text. The actual form of the descriptor/concept is arbitrary.

So, as we move towards identifying/extracting COIV19-related terms from Twitter text, this vocabulary gives us the option to aggregate over terms to the higher-level concept (or descriptor). Some academic fields refer to this as the process of normalization.

dictionary %>% 
  filter(descriptor_name %in% c('antiviral', 'hand-hygiene')) %>%
  group_by(category, descriptor_name) %>%
  summarize(term_names = paste0(term_name, collapse = ' | '))  %>%
  DT::datatable(rownames = FALSE, options = list(dom = 't'))

Congressional Twitter Corpus (2020)

Again, our data set is a Twitter corpus comprised of all tweets generated by the 535 voting members of the US Congress during the second session of the 116th Congress. Code for extracting/building/updating the corpus using the R package rweet is available here, and the actual corpus as TIF/xlsx is available here. Updated 5/26/20

setwd(paste0(tweets_dir, 'tweets'))
tweets <- readxl::read_xlsx('us_lawmaker_tweets_full_2020-05-26.xlsx') %>%
  mutate(created_at = as.Date(created_at, format = "%Y-%B-%d"))

Corpus composition:

data.frame(tweets = format(nrow(tweets), big.mark = ','), 
           tokens = format(sum(tokenizers::count_words(tweets$text)), 
                           big.mark = ',')) %>%
tweets tokens
157,420 5,650,645

Next, we quickly grab some details about US lawmakers from the united states project. The Twitter corpus and lawmaker detail data sets can then be joined via Twitter handle.

leg_dets <- ''
twitters <- read.csv((url(leg_dets)), stringsAsFactors = FALSE) %>%
  rename (state_abbrev = state, district_code = district)

tweets1 <- tweets %>%
  mutate(screen_name = toupper(screen_name)) %>%
  left_join(twitters %>% 
              mutate(twitter = toupper(twitter)),
            by = c('screen_name' = 'twitter'))

Some sample tweets:

text2vec framework for NLP ::

text2vec is a beast of a text analysis R library. Here, we walk-through the building of some common text structures relevant to many downstream applications – using our congressional Twitter corpus. As text2vec implements R6 objects (a mystery to me), the framework is a bit funky. So, we present some hacks, etc. here, specifically for working with multi-word expressions – in the larger context of building document-term matrices, term-co-occurrence matrices, GloVe models & co-occurrence-based graph structures.

With the ultimate goal of investigating (1) some historical- and party-affiliation-based variation in the use of COVID19-related terms on Twitter, and (2) the conceptual relatedness of COVID19-related terms

Tokens & tokenizers

text2vec, like other text analysis frameworks, operates on a token object, which for a single document/tweet looks like the following:

tokenizers::tokenize_ptb(tweets$text[2], lowercase = TRUE)
## [[1]]
##  [1] "ohioans"           ":"                 "request"          
##  [4] "a"                 "mail-in"           "ballot"           
##  [7] "today"             "from"              "the"              
## [10] "secretary"         "of"                "state"            
## [13] "to"                "ensure"            "your"             
## [16] "vote"              "is"                "counted"          
## [19] "in"                "ohio"              "'s"               
## [22] "primary"           "election."         "the"              
## [25] "deadline"          "to"                "postmark"         
## [28] "your"              "ballot"            "is"               
## [31] "monday."           "https"             ":"                
## [34] "//"

Tokenization, even for English, is a non-trivial task. The tokenize_ptb function from the tokenizers package is pretty good (which is based on the Penn Treebank model). But there are still two instances above, eg, in which sentence-final punctuation is not tokenized: election. & monday.. So, when we go to build word-level models, election & election., eg, will be treated distinctly.

This bothers me. The code below sorts this and other issues out. Resulting/re-built text can then be fed to any simple space-based tokenizer, and things will be clean. Tokenize > clean tokens > rebuild text > re-tokenize.

## tokenizer --
t1 <- tokenizers::tokenize_ptb(tweets$text, lowercase = TRUE)
## Remove punct
t2 <- lapply(t1, gsub, 
             pattern = '([a-z0-9])([[:punct:]])', 
             replacement = '\\1 \\2') 
t3 <- lapply(t2, gsub, 
             pattern = '([[:punct:]])([a-z0-9])', 
             replacement = '\\1 \\2') 

t4 <- lapply(t3, paste0, collapse = ' ')
## Re-build
tweets$word_text <- unlist(t4)

Multi-word expressions & controlled vocabularies

Units of meaning often (ie, almost always) span multiple words and multiple grammatical categories. Here we briefly consider some supervised approaches to tricking tokenizers (and specifically text2vec) into treating a controlled vocabulary of multi-word expressions as single-units-of-meaning.

§ Some multi-word hacks

The spelling & inflectional variants of the COVID19-related concept FLATTEN THE CURVE are presented below:

flatten the curve | flatten_the_curve | flatten-the-curve | flattening the curve | flattening_the_curve | flattening-the-curve | flatteningthecurve | flattenthecurve

So, if a lawmaker on Twitter refers to the concept FLATTEN THE CURVE as flattenthecurve, without any spaces (& presumably prefixed with a hash tag), a space-based (or word-based) tokenizer will do right by the analyst investigating multi-word expressions. The same goes for flatten-the-curve and flatten_the_curve.

The form flatten the curve, however, will be tokenized as flatten and the and curve. Which is not helpful. Basically, we want to phrasify these three individual tokens as a single token. Such that in downstream applications, flatten the curve and flattenthecurve, eg, are (or can be) treated as instantiations of the same conceptual category.

The Collocations function/model from the text2vec package enables an unsupervised approach to identifying multi-word expressions, and results can be used to update token objects such that flatten the curve becomes flatten-the-curve. If flatten + the + curve is identified as an expression per the model.

Here, however, we are interested in a supervised (or controlled) approach, ie, we have our own multi-word lexicon of COVID19-related terms that we want phrasified. text2vec does not provide a straightforward way to do this. So, here we present a simple (albeit extended) hack.

multi_word_expressions <- subset(dictionary, grepl(' ', term_name))
sep = ' '
mas_que_dos <- subset(multi_word_expressions, grepl(' [a-z0-9]* ', term_name))

First: text2vec::Collocations builds out phrases in a piecemeal fashion. Long story short: in order to identify (or phrasify) flatten the curve as a multi-word expression, it must first identify (or phrasify), eg, flatten the. Then flatten-the and curve can be phrasified as flatten-the-curve. So, for multi-word expressions > 2, we have to build out some component parts. Some multi-word expressions in the COVID19 vocabulary >2 words:

##  [1] "flattening the curve"          "drive thru tests"             
##  [3] "personal protective equipment" "flatten the curve"            
##  [5] "front line worker"             "great american comeback"      
##  [7] "global economic cirsis"        "return to work"               
##  [9] "high risk population"          "god bless america"

A simple function for extracting component 2-word phrases from multi-word expressions >2:

new_two_grams <- lapply(mas_que_dos$term_name, function(x) {
             gregexpr("[^ ]+ [^ ]+", # sep = ' '
  )[[1]] }) %>%
  unlist() %>%

A look at the “pieces” of our mutli-word expressions composed of more than two words:

##  [1] "sars cov"            "drive through"       "drive thru"         
##  [4] "personal protective" "flatten the"         "flattening the"     
##  [7] "front line"          "high risk"           "shelter in"         
## [10] "long term"           "stay at"             "home order"         
## [13] "home orders"         "wash your"           "wear a"             
## [16] "work from"           "working from"        "dont bankrupt"      
## [19] "global economic"     "global health"       "god bless"          
## [22] "great american"      "made in"             "open up"            
## [25] "paycheck protection" "re open"             "return to"          
## [28] "person to"           "person transmission"

Then we add these “pieces” to the full multi-word portion of the COVID19 lexicon.

multi_word_expressions_replace <- gsub(' ', sep, multi_word_expressions$term_name)
multi_word_expressions_replace <- c(multi_word_expressions_replace,
                                    new_two_grams )

§ Some text2vec primitives

Before we can dupe text2vec into phrasifying our multi-word COVID19 terms, we first need to build two basic text2vec (data) structures: an itoken object (or iterator) & a vocabulary object. The former containing (among other things) a generic tokens object. Again, see this vignette for more technical details. Regardless of your text2vec objectives, these will (almost) always be your first two opening moves.

mo <- text2vec::itoken(tweets$word_text, 
                       preprocessor = tolower,
                       tokenizer = text2vec::space_tokenizer, 
                       n_chunks = 1,
                       ids = tweets$status_id) 
vocab <- text2vec::create_vocabulary(mo, stopwords = character(0)) #tm::stopwords()

Then we build a skeleton Collocations model per code below. But we never actually run the model.

model <- text2vec::Collocations$new(vocabulary = vocab, sep = sep) 

Instead, all we want to do is assign the parameter model$.__enclos_env__$private$phrases our list of multi-word expressions.

model$.__enclos_env__$private$phrases <- multi_word_expressions_replace

Using this dummy Collocations model, we then transform the itoken object built above. Here, transform means updating the token object to account for multi-word expressions.

it_phrases <- model$transform(mo) 
term_vocab <- text2vec::create_vocabulary(it_phrases) 
term_vocab1 <- text2vec::prune_vocabulary(term_vocab, term_count_min = 2)

## HACK 
ats <- attributes(term_vocab1)
term_vocab2 <- subset(term_vocab1, grepl('^[A-Za-z]', term) & nchar(term) > 2)
t2v_vocab <- term_vocab2
attributes(t2v_vocab) <- ats
#egs <- it_phrases$nextElem()$tokens
#egs1 <- lapply(egs, paste0, collapse = ' ')

And now we can investigate frequencies for all forms included in the congressional Twitter corpus, including (but not limited to) our multi-word expressions.

term_freqs <- term_vocab2 %>%
  left_join(dictionary , by = c('term' = 'term_name'))

descriptor_freq <- term_freqs %>%
  group_by(category, descriptor_name) %>%
  summarize(term_freq = sum(term_count)) %>%

Some relative frequencies for spelling & lexical variants for a sample of multi-word expressions from the COVID19 lexicon.

term_freqs %>%
  filter(descriptor_name %in% c('social-distancing', 'front-line-workers',
                                'flatten-the-curve')) %>%
  arrange(desc(term_count)) %>%
  mutate(tf = paste0(term, ' (', term_count, ')')) %>%
  group_by(descriptor_name) %>%
  summarize(relative = paste0(tf, collapse = ' | ')) %>%
  DT::datatable(rownames = FALSE, options = list(dom = 't'))

So, the trickier/hackier part is complete. The text2vec vocabulary object now recognizes the multi-word expressions in our COVID19 lexicon as single units of meaning. And we can carry on.

GloVe model & COVID19 semantic space

The next piece is to build a GloVe model to investigate semantic relatedness among concepts included in our COVID19 lexicon. The general workflow here is:

  1. Build a term-co-occurrence matrix (TCM),
  2. Build an n-dimensional GloVe model based on the TCM,
  3. Further reduce GloVe dimensions via tSNE, PCA, or MDS,
  4. Plot terms in a reduced 2D space.

Here, we have the additional task of aggregating the TCM from terms to descriptors (or concepts), before building the GloVe model. The code below creates a simple table that crosswalks terms to concepts.

term_vocab3 <- term_vocab2 %>%
  rename(term_name = term) %>%
  left_join(dictionary) %>%
  mutate(descriptor_name = ifelse(, 
           category = ifelse(, 
                             category)) %>%

§ Term-co-occurrence matrix

Utilizing previously constructed text2vec primitives, we use the text2vec::create_tcm function to construct a term-co-occurrence matrix, specifying a context window-size of 5 x 5.

tcm <- text2vec::create_tcm(it = it_phrases,
                            vectorizer = vectorizer,
                            skip_grams_window = 5L)

Then we implement the lvdr_aggregate_matrix function from the lexvarsdatr package to aggregate term vectors to a single descriptor vector (for forms included in the COVID19 lexicon).

tcm <- tcm[, order(colnames(tcm))]
tcm <- tcm[order(rownames(tcm)), ]

tcm1 <- lexvarsdatr::lvdr_aggregate_matrix(tfm = tcm, 
                                           group = term_vocab3$descriptor_name, 
                                           fun = 'sum')

Dimensions of TCM:

## [1] 42468 42468

§ GloVe model

We specify GloVe model parameters via the text2vec::GlobalVectors function, and build term vectors using fit_transform. Vectors are comprised of n = 128 dimensions.

glove <- text2vec::GlobalVectors$new(rank = 128, 
                                     #vocabulary = row.names(tcm1), 
                                     x_max = 10)
wv_main <- glove$fit_transform(tcm1, 
                               n_iter = 10, 
                               convergence_tol = 0.01)
wv_context <- glove$components
glove_vectors <- wv_main + t(wv_context)

§ Semantic & conceptual relatedness

With GloVe vectors in tow, options abound. Here, we demonstrate two fairly straightforward applications. The first – a quick look at nearest neighbors for a set of COVID19-related concepts. Via cosine similarity and the LSAfun::neighbors function.

eg_terms <- c('stay-at-home', 'outbreak', 
              'front-line-workers', 'vaccine',
              'relief' )

x <- lapply(eg_terms, 
            n = 10)

names(x) <- eg_terms

Top 10 nearest neighbors for stay-at-home, outbreak, front-line-workers, vaccine & relief. So, despite a relatively small corpus, some fairly nice results.

## $`stay-at-home`
##      stay-at-home          practice social-distancing flatten-the-curve 
##         1.0000000         0.5634366         0.5500531         0.4852880 
##              sick              stay             avoid           staying 
##         0.4788068         0.4424411         0.4312511         0.4096375 
##            unless               you 
##         0.3959533         0.3679505 
## $outbreak
##    outbreak coronavirus    pandemic       covid   covidー19      crisis 
##   1.0000000   0.6616114   0.5697251   0.5494554   0.5296838   0.4987309 
##    response      spread       virus     impacts 
##   0.4826594   0.4798837   0.4642697   0.4621730 
## $`front-line-workers`
## front-line-workers  essential-workers   first-responders                ppe 
##          1.0000000          0.5883052          0.5341892          0.4832370 
##             heroes      professionals             nurses          frontline 
##          0.4822017          0.4641108          0.4569190          0.4568646 
##            doctors          equipment 
##          0.4491088          0.4448107 
## $vaccine
##     vaccine  treatments     develop   therapies   treatment development 
##   1.0000000   0.5817043   0.4566783   0.4113966   0.4112393   0.3854895 
##        easy       tests     genesis    research 
##   0.3832985   0.3499214   0.3297194   0.3265058 
## $relief
##     relief    funding        aid additional    provide       bill    support 
##  1.0000000  0.6063464  0.5879983  0.5798526  0.5768965  0.5732163  0.5670530 
##    package assistance  cares-act 
##  0.5640401  0.5563580  0.5379026

For a smarter approach to visualizing nearest neighbors, as well as visualizing lexical semantic change historically, see this post. Or, a similar analysis utilizing Google n-gram data.

In the second application, we consider a two-dimensional perspective on the semantic relatedness of concepts included in our COVID19 lexicon. While we have built vectors for all forms attested in the Congressional Twitter Corpus, here we subset this matrix to just COVID concepts. Via classical multidimensional scaling, we project 128 features (per GloVe) into two-dimensional Euclidean space.

keeps <- descriptor_freq %>% filter(term_freq > 30)
glove1 <- glove_vectors[rownames(glove_vectors) %in% 
sim_mat <- text2vec::sim2(glove1, 
                             method = "cosine", 
                             norm = "l2")

# data set too small for tSNE ---

y1 <- cmdscale(1-sim_mat, eig = TRUE, k = 2)$points %>% 
  data.frame() %>%
  mutate (descriptor_name = rownames(sim_mat)) %>%
  left_join(dictionary %>% distinct(descriptor_name, category)) 

A semantic map of COVID19 concepts is presented below. Some intuitive structure for sure; some less so.

y1 %>% 
  ggplot(aes(X1,X2, label = descriptor_name)) +
  geom_point(aes(color = category), size = 3.5) +
    data  = y1,
    nudge_y =  0.025,
    segment.color = "grey80",
    direction = "y",
    hjust = 0, 
    size = 3 ) +
  ggthemes::scale_colour_stata() + 
  theme_minimal() +
  #theme_classic() +
  theme(legend.position = "bottom",
        plot.title = element_text(size=14))+ 
  labs(title="COVID19-related concepts in 2D semantic space")

See a similar application using co-occurrence and place-names to create a text-based map of the world.

Networks & lexical co-occurrence

Lastly, we build & visualize a co-occurrence network based on the previously constructed term-co-occurrence matrix. The lexvarsdatr package streamlines these processes, and enables straightforward extraction of sub-networks from large matrices. See package description for a more detailed discussion.

Below, we convert our count-based TCM to a positive point-wise mutual information-based matrix (via lvdr_calc_ppmi), and extract the 20 strongest collocates (via lvdr_extract_network) for five (cherry-picked) concepts included in the COVID19 lexicon.

network <- tcm1 %>% 
  lexvarsdatr::lvdr_calc_ppmi(make_symmetric = TRUE) %>%
  lexvarsdatr::lvdr_extract_network (
    target = c('contact-tracing', 
    n = 20)

And then visualize:

network %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +
  ggraph::geom_edge_link(color = 'darkgray') + 
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +
  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +
  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  theme_minimal() +
  ggtitle('A COVID19 co-occurrence network') +
  theme(legend.position = "none",
        plot.title = element_text(size=14))


So, more of a resource/guide than a post-proper. Mostly an attempt on my part to collate some scattered methods. And a bit of an ode totext2vec.

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)