place from text: geography & distributional semantics
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, we demonstrate some different methodologies for exploring the geographical information found in text. First, we address some of the practical issues of extracting places/place-names from an annotated corpus, and demonstrate how to (1) map their geospatial distribution via geocoding and (2) append additional geographic detail to these locations via spatial joins.
We then consider how these locations “map” in semantic space by comparing context-based word embeddings for each location. Ultimately, the endgame is to investigate the extent to which geospatial proximity is reflected (or not) in distributional similarity in a corpus. In the process, we demonstrate some methods for getting from lexical co-occurrence to a 2D semantic map via latent semantic analysis (LSA) and classical multi-dimensional scaling (MDS).
library(tidyverse) library(ggthemes) library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr") library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr") library(knitr)
From text to map
Slate corpus & geopolitical entities
For demo purposes, we use the annotated Slate magazine corpus made available as
cdr_slate_ann via the
corpusdatr. Content of articles comprising the corpus is largely political in nature, so lots of reference to place and location, namely foreign and domestic political entities. The first task, then, is to get a rollcall of the geopolitical entities included in the corpus.
The Slate Magazine corpus has been annotated using the
spacyr package, and contains named entity tags, including geopolitical entities (GPEs). Here we collapse multi-word entities (eg, “New” “York”) to single tokens (eg, “New_York”), and ready the corpus for search using
slate <- corpusdatr::cdr_slate_ann %>% spacyr::entity_consolidate() %>% corpuslingr::clr_set_corpus(ent_as_tag=TRUE)
Next, we obtain text and document frequencies for GPEs included in the corpus, and filter to only those occurring in 1% or greater of articles comprising the corpus.
slate_gpe <- slate %>% bind_rows()%>% filter(tag == 'NNGPE')%>% corpuslingr::clr_get_freq(agg_var='lemma',toupper=TRUE) %>% filter(txtf>9 & !grepl('US|USA|AMERICA|UNITED_STATES|THE_UNITED_STATES|U.S.|U.S.A',lemma))
The most frequently referenced GPEs in the Slate corpus (not including the US):
To visualize the geographical distribution of GPEs in the Slate Magazine corpus, we use the
geocode function from the
ggmap package to transform our corpus locations to lat/lon coordinates that can be mapped. While
ggmap works best with proper addresses (eg, street, city, zip, etc), country and city names can be geolocated as well.
Note that while GPEs are geographical areas, this method approximates GPE location as a single point in lat/long space at the center (or centroid) of these areas. For our purposes here, this approximation is fine.
The following pipe geocodes the GPEs, removes GPEs that Google Maps cannot geocode, and transforms the new dataframe with lat/lon coordinates into an
sf spatial object. The last step enables convenient mapping/geospatial processing within the
library(ggmap) library(sf) slate_gpe_geo <- ggmap::geocode(slate_gpe$lemma, output = c("latlon"), messaging = FALSE) %>% bind_cols(slate_gpe)%>% filter(complete.cases(.))%>% sf::st_as_sf(coords = c("lon", "lat"), crs = 4326)
We then map the geolocated GPEs using the
leaflet package; circle radius reflects frequency of occurrence in the slate corpus.
library(leaflet) library(widgetframe) x <- slate_gpe_geo %>% leaflet(width="450") %>% setView(lng = -5, lat = 31, zoom = 2) %>% addProviderTiles ("CartoDB.Positron", options = providerTileOptions (minZoom = 2, maxZoom = 4))%>% addCircleMarkers( radius = ~txtf/25, stroke = FALSE, fillOpacity = .75, label=~lemma) frameWidget(x)
spData package conveniently makes available a variety of shapefiles/geopolitical polygons as
sf objects, including a world country map. Having geocoded the GPEs, we can add features from this country map (eg, country, subregion, continent) to our GPE points via a spatial join. We use the
st_join function from the
sf package to accomplish this task.
library(spData) slate_gpe_details <- sf::st_join(slate_gpe_geo, spData::world)
Per the spatial join, we now have information regarding country, continent, and subregion for each GPE from the Slate Magazine corpus.
|5||ARGENTINA||Argentina||South America||South America|
|6||ARIZONA||United States||North America||Northern America|
|7||ARKANSAS||United States||North America||Northern America|
|8||ARLINGTON||United States||North America||Northern America|
We can use this information, for example, to aggregate GPE text and document frequencies to the subregion level:
slate_gpe_details %>% data.frame()%>% group_by(subregion) %>% summarize (txtf=sum(txtf),docf=sum(docf))%>% filter(subregion!='Northern America')%>% ggplot(aes(x=docf, y=txtf)) + geom_text(aes(label=toupper(subregion)), size=3, check_overlap = TRUE, hjust = "inward")+ labs(title = "Document vs. text frequency for GPEs outside of Northern America", subtitle="By Subregion")
Corpus search and context
So, our next task is to map the GPEs in 2D (semantic) space by comparing context-based word embeddings for each location. What does a map derived from patterns of lexical co-occurrence in text look like?
The first step in accomplishing this task is to search the Slate Magazine corpus for GPEs in context. For each occurrence of each GPE in the corpus, then, token and surrounding context are extracted using the
corpuslingr::clr_search_context function. Here, context is defined as the 15x15 window of words surrounding a given token of a GPE. We limit our search to the 100 most frequent GPEs.
gpe_search <- data.frame(slate_gpe_geo) %>% arrange(desc(txtf))%>% slice(1:100)%>% mutate(lemma=paste0(lemma,'~GPE'))
gpe_contexts <- corpuslingr::clr_search_context( search = gpe_search$lemma, corp=slate, LW=15, RW=15)
A small random sample of the search results are presented below in context. The
clr_context_kwic function quickly rebuilds the original user-specified search context, with the search term highlighted.
gpe_contexts %>% corpuslingr::clr_context_kwic(include=c('doc_id')) %>% sample_n(5)%>% DT::datatable(class = 'cell-border stripe', rownames = FALSE, width="450", escape=FALSE)
LSA, MDS, and semantic space
So, having extracted all contexts from the corpus, we can now build a GPE-feature matrix (ie, word embeddings by GPE) by applying the
clr_context_bow function to the output of
clr_search_context. We limit our definition of features to only content words, and aggregate feature frequencies by lemma.
term_feat_mat <- gpe_contexts %>% corpuslingr::clr_context_bow( agg_var = c('searchLemma','lemma'), content_only=TRUE)%>% spread (searchLemma,cofreq)%>% replace(is.na(.), 0)
Some of the matrix:
Next, we create a cosine-based similarity matrix using the
library(lsa) sim_mat <- term_feat_mat %>% select(2:ncol(term_feat_mat)) %>% data.matrix()%>% lsa::cosine(.)
lsa::cosine function computes cosine measures between all GPE vectors of the term-feature matrix. The higher the cosine measure between two vectors, the greater their similarity in composition. The top-left portion of this matrix is presented below:
## AFGHANISTAN ALABAMA ALASKA ## AFGHANISTAN 1.0000000 0.1663644 0.1089837 ## ALABAMA 0.1663644 1.0000000 0.1570805 ## ALASKA 0.1089837 0.1570805 1.0000000
The last step is to transform the similarities between co-occurrence vectors into two-dimensional space, such that context-based (ie, semantic) similarity is reflected in spatial proximity.
To accomplish this task, we apply classical scaling to the similarity matrix using the base r function
cmdscale. Two-dimensional coordinates are then extracted from the
points element of the
cmdscale output. We join the
slate_gpe_details object to the ouput in order to color GPEs in the plot by continent.
As the plot demonstrates, we get a fairly good sense of geo-political space (from the perspective of Slate Magazine contributors) by comparing word embeddings derived from a corpus of only 1 million words.
cmdscale(1-sim_mat, eig = TRUE, k = 2)$points %>% data.frame() %>% mutate (lemma = rownames(sim_mat))%>% left_join(slate_gpe_details)%>% ggplot(aes(X1,X2)) + geom_text(aes(label=colnames(sim_mat),col=continent), size=2.5, check_overlap = TRUE)+ scale_colour_stata() + theme_fivethirtyeight() + theme(legend.position = "none", plot.title = element_text(size=14))+ labs(title="Slate GPEs in semantic space", subtitle="A two-dimensional solution")
The first dimension (x-axis) seems to do a very nice job capturing a “Domestic - Foreign” distinction, with some obvious exceptions. The second dimension (y-axis) seems to capture a “City - State” distinction, or an “Urban - Non-urban” distinction. Also, there seems to be a “Europe - Non-Europe” element to the second dimension on the “Foreign” side of the plot.
Someone better versed in the geo-political happenings of the waning 20th century could likely provide a more detailed analysis here. Suffice it to say, there is some very intuitive structure to the plot above derived from co-occcurence vectors. While not exclusively geospatial, as a “map” of the geo-political “lay of the land” it certainly has utility.
So, we have weaved together here a set of methodologies that are often discussed in different classrooms, and demonstrated some different approaches to extracting and analyzing the geospatial information contained in text. Maps and “maps.”
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.