corpus query and grammatical constructions

Jason Timm

4 years ago

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post demonstrates the use of a simple collection of functions from my R-package corpuslingr. Functions streamline two sets of corpus linguistics tasks:

annotated corpus search of grammatical constructions and complex lexical patterns in context, and
detailed summary and aggregation of corpus search results.

While still in development, the package should be useful to linguists and digital humanists interested in having BYU corpora-like search & summary functionality when working with (moderately-sized) personal corpora, as well as researchers interested in performing finer-grained, more qualitative analyses of language use and variation in context. The package is available for download at my github site.

library(tidyverse)
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")

Search syntax

Under the hood, corpuslingr search is regex-based & tuple-based — akin to the RegexpParser function in Python’s Natural Language Toolkit (NLTK) — which facilitates search of grammatical and lexical patterns comprised of:

different types of elements (eg, form, lemma, or part-of-speech),
contiguous and/or non-contiguous elements,
positionally fixed and/or free (ie, optional) elements.

Regex character matching is streamlined with a simple “corpus querying language” modeled after the more intuitive and transparent syntax used in the online BYU suite of English corpora. This allows for convenient specification of search patterns comprised of form, lemma, & pos, with all of the functionality of regex metacharacters and repetition quantifiers.

Example searches & syntax are presented below, which load with the package as clr_ref_search_egs. A full list of part-of-speech codes can be viewed here, or via clr_ref_pos_codes.

example search syntax

Corpus search

For demo purposes, we use the cdr_slate_ann corpus from my corpusdatr package. A simple description of the corpus is available here. Using the corpuslingr::clr_set_corpus function (which builds tuples and sets character onsets/offsets), we ready the corpus for search.

slate <- corpusdatr::cdr_slate_ann %>% 
  corpuslingr::clr_set_corpus()

SIMPLE SEARCH

The clr_search_gramx() function returns instantiations of a search pattern without context. It is for quick search. The function returns search results as a single dataframe.

ADJECTIVE and ADJECTIVE, eg “happy and healthy”

search1 <- "ADJ and ADJ"  
slate %>%
  corpuslingr::clr_search_gramx(search=search1)%>%
  select(doc_id,token,tag)%>%
  head()

SEARCH IN CONTEXT

The clr_search_context() function builds on clr_search_gramx() by adding surrounding context to the search phrase. Search windows can be specified using the LW/RW parameters. Function output includes a list of two data frames.

The first, BOW, presents results in a long format, which can be used to build word embeddings, for example. The second, KWIC, presents results with the surrounding context rebuilt in more or less a KWIC fashion. Both data frames serve as intermediary data structures for subsequent analyses.

VERB PRP$ way PREP NPHR, eg “make its way through the Senate”

Per CQL above, NPHR can be used as a generic noun phrase search.

search2 <- "VERB PRP$ way (through| into) NPHR" 
searchResults <- slate %>%
  corpuslingr::clr_search_context(search=search2, 
                                  LW=5, 
                                  RW = 5)

KWIC object:

searchResults$KWIC %>% head() %>% select(-eg)

BOW object:

searchResults$BOW %>% head()

Search summary

The clr_get_freq() function enables quick aggregation of search results. It calculates token and text frequency for search terms, and allows the user to specify how to aggregate counts with the agg_var parameter.

VERB up, eg “pass up”

search3 <- "VERB up"

The figure below illustrates the top 20 instantiations of the grammatical construction VERB up.

slate %>%
  corpuslingr::clr_search_gramx(search=search3)%>%
  corpuslingr::clr_get_freq(agg_var=c("lemma"),
                            toupper =TRUE) %>%
  slice(1:20)%>%
  ggplot(aes(x=reorder(lemma,txtf), y=txtf)) + 
    geom_col(width=.6, fill="steelblue") +  
    coord_flip()+
    labs(title="Top 20 instantiations of 'VERB up' by frequency")

Although search is quicker when searching for multiple search terms simultaneaously, in some cases it may be useful to treat multiple search terms distinctly using lapply():

search3a <- c("VERB across",
              "VERB through", 
              "VERB out", 
              "VERB down")
vb_prep <- lapply(1:length(search3a), function(y) {
    corpuslingr::clr_search_gramx(corp=slate, 
                                  search=search3a[y])%>%
    corpuslingr::clr_get_freq(agg_var=c("lemma"),
                              toupper =TRUE) %>%
    mutate(search = search3a[y])
    }) %>%  
  bind_rows()

Summary by search:

vb_prep %>%
  group_by(search) %>%
  summarize(gramx_freq = sum(txtf), 
            gramx_type = n())

Top 10 instantiations of each search pattern by search term:

KWIC & BOW

KEYWORD IN CONTEXT

clr_context_kwic() is a super simple function that rebuilds search contexts from the output of clr_search_context(). The include parameter allows the user to add details about the search pattern, eg. part-of-speech, to the table. It works nicely with DT tables.

VERB NOUNPHRASE into VERBING, eg “trick someone into believing”

search4 <- "VERB NPHR into VBG"  
slate %>%
  corpuslingr::clr_search_context(search=search4, 
                                  LW=10, 
                                  RW = 10) %>%
  corpuslingr::clr_context_kwic(include=c('doc_id')) %>%
  DT::datatable(class = 'cell-border stripe', 
                rownames = FALSE,width="100%", 
                escape=FALSE) %>%
  DT::formatStyle(c(1:2),Size = '85%')

BAG OF WORDS

The clr_context_bow() function returns a co-occurrence vector for each search term based on the context window-size specified in clr_search_context(). Again, how features are counted can be specified using the agg_var parameter. Additionally, features included in the vector can be filtered to content words using the content_only parameter.

Multiple search terms

search5 <- c("Clinton", "Lewinsky", 
             "Bradley", "McCain", 
             "Milosevic", "Starr",  
             "Microsoft", "Congress", 
             "China", "Russia")

Here we search for some prominent players of the late 90s (when articles in the cdr_slate_ann corpus were published), and plot the most frequent co-occurring features of each search term.

co_occur <- slate %>%
  corpuslingr::clr_search_context(search=search5, 
                                  LW=15, 
                                  RW = 15)%>%
  corpuslingr::clr_context_bow(content_only=TRUE,
                               agg_var=c('searchLemma','lemma','pos'))

Plotting facets in ggplot is problematic when within-facet categories contain some overlap. We add a couple of hacks to address this.

co_occur %>%
  filter(pos=="NOUN")%>%
  arrange(searchLemma,cofreq)%>%
  group_by(searchLemma)%>%
  top_n(n=10,wt=jitter(cofreq))%>%
  ungroup()%>%
  #Hack1 to sort order within facet
  mutate(order = row_number(), 
         lemma=factor(paste(order,lemma,sep="_"), 
                      levels = paste(order, lemma, sep = "_")))%>%
  ggplot(aes(x=lemma, 
             y=cofreq, 
             fill=searchLemma)) + 
    geom_col(show.legend = FALSE) +  
    facet_wrap(~searchLemma, scales = "free_y", ncol = 2) +
  #Hack2 to modify labels
    scale_x_discrete(labels = function(x) gsub("^.*_", "", x))+
    theme_fivethirtyeight()+ 
    scale_fill_stata() +
    theme(plot.title = element_text(size=13))+ 
    coord_flip()+
    labs(title="Co-occurrence frequencies for some late 20th century players")

Summary and shiny

So, a quick demo of some corpuslingr functions for annotated corpus search & summary of complex lexical-grammatical patterns in context.

I have built a Shiny app to search/explore the Slate Magazine corpus available here. Code for building the app is available here. Swapping out the Slate corpus for a personal one should be fairly straightforward, with the caveat that the annotated corpus needs to be set/“tuple-ized” using the clr_set_corpus function from corpuslingr.

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.