[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post outlines a simple framework for identifying and extracting keyphrases from component texts of a corpus. We first consider some functional characteristics of descriptive keyphrases, as well as some more formal (ie, regex-based) definitions.

We then demonstrate the use of corpuslingr for identifying potential keyphrases in an annotated corpus, and present an unsupervised (and well-established) methodology (tf-idf weights) for extracting descriptive keyphrases for each text.

The Slate Magazine corpus from the corpusdatr package is used here for demo purposes.

library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(tidyverse)
library(DT)

## Defining potential keyphrases

Ideally, keyphrases are semantically rich noun phrases that shed light on the content of a particular text. For illustrative purposes, three noun phrases of varying degrees of complexity and informativeness are presented below:

• flowers
• pretty flowers
• pretty flowers in suburban Poughkeepsie

The first is comprised solely of a plural noun form; the second is comprised of a noun form modified by an adjective; the third is comprised of a noun phrase modified by a prepositional phrase (which contains another noun phrase). By virtue of specifying both the type and “where” of flower, the latter would seem to have the most descriptive utility to someone perusing content (or some algorithm classifying texts) via keyphrases.

From a regex perspective, then, we want to create a search template that is as greedy as possible when it comes to noun phrase constituency; in other words, while we will settle for unmodified noun forms as keyphrases, we prefer highly modified ones. And we are not interested in pronominal forms.

So, we define a noun phrase as “zero or more adjectives followed by one or more nouns” and define potential keyphrases as follows

• Noun phrase + ( preposition + Noun phrase )

where the prepositional phrase is optional. This schema maps generically to the regex below (per Penn Treebank POS codes):

nounPhrase <- "(JJ[A-Z]* )*(NN[A-Z]* )+"
prepPhrase <- paste0("((IN )",nounPhrase,")?")
keyPhrase <- paste0(nounPhrase,prepPhrase)
keyPhrase
## [1] "(JJ[A-Z]* )*(NN[A-Z]* )+((IN )(JJ[A-Z]* )*(NN[A-Z]* )+)?"

Using the simplifying CQL made available via corpuslingr, the above regex is written as:

keyPhrase <- "(ADJ )*(NOUNX )+((PREP )(ADJ )*(NOUNX )+)?"

## Corpus search for potential keyphrases

Per this definition, the next step is to search the Slate magazine corpus for potential keyphrases. So, we first set the corpus for search (within the corpuslingr framework) using the clr_set_corpus function:

slate <- corpusdatr::cdr_slate_ann %>%
corpuslingr::clr_set_corpus()

Then we use the corpuslingr::clr_search_gramx function to extract all potential keyphrases from each text in the corpus:

kps <- corpuslingr::clr_search_gramx(search=keyPhrase, corp=slate)

Example output returned by clr_search_gramx:

##    doc_id              token      tag            lemma
## 1:      1             rulers      NNS            ruler
## 2:      1              world       NN            world
## 3:      1 populous countries   JJ NNS populous country
## 4:      1      hold on power NN IN NN    hold on power
## 5:      1          Indonesia      NNP        Indonesia
## 6:      1            Suharto      NNP          Suharto

The plot below illustrates the top fifteen instantiations of our keyphrase regex search in the Slate Magazine corpus. While the top two instantiations are unmodified noun phrases, multi-word noun phrases constitute a sizable portion of potential keyphrases as well.

kps%>%
corpuslingr::clr_get_freq(agg_var = 'tag')%>%
top_n(15,txtf)%>%
ggplot(aes(x=reorder(tag, txtf), y=txtf)) +
geom_col(width=.65, fill="cyan4") +
coord_flip()+
theme_bw() +
labs(title = "Top 15 keyphrases by pattern type")

## Selecting descriptive keyphrases with the tf-idf statisitic

The term frequency - inverse document frequency (tf-idf) statistic is a super simple, unsupervised approach to keyphrase extraction. As a metric it is meant to capture (or weigh) how frequent a given phrase is in a text (ie, text frequency) relative to how dispersed the phrase is across documents comprising the corpus (ie, document frequency).

Phrases occurring more frequently in a given text than we would expect based on their document frequency receive higher weights; theoretically, such phrases shed light on the content of a given text (relative to the content of the corpus as a whole).

The tf-idf weight for a given keyphrase in a given document, then, is calculated as the product of token frequency, tf, and inverse document frequency, idf, where the latter is logarithmically transformed.

Here we compute text frequency and document frequency for each keyphrase, and join metadata from cdr_slate_meta, which includes texts titles and descriptives. Frequencies are aggregated by keyphrase lemma constituents.

kpsAgg <- kps %>%
group_by(lemma)%>%
mutate(docf=length(unique(doc_id)))%>%
group_by(doc_id,lemma,docf)%>%
summarize(txtf=n())%>%
left_join(corpusdatr::cdr_slate_meta)%>%
mutate(docsInCorpus=length(cdr_slate_ann))%>%
select(doc_id,title,lemma,txtf,textLength,docf,docsInCorpus)

Based on these two (relative) frequencies, we compute td-idf values for each keyphrase in each document, collapse the top five phrases into a single cell, and wrap the table up with some DT.

kpsAgg%>%
mutate(td_idf = (txtf/textLength)*log(docsInCorpus/(docf+1)))%>%
mutate(td_idf=jitter(td_idf))%>%
group_by(doc_id,title)%>%
top_n(5,td_idf)%>%
arrange(doc_id,desc(td_idf))%>%
group_by(doc_id,title)%>%
summarise(key_phrases = paste(lemma, collapse=" | "))%>%
arrange(as.numeric(doc_id))%>%
DT::datatable(class = 'cell-border stripe',
rownames = FALSE,
width="450",
escape=FALSE)