psychological and geographical distance in text

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post considers a super-clever study presented in Snefjella and Kuperman (2015), in which the authors investigate the relationship between psychological distance and geographical distance using geolocated tweets. General idea/hypothesis:

The more we perceive an event/entity as (geographically) proximal to self, the more concrete our language when referencing said event/entity; the more we perceive an event/entity as (geographically) distant from self, the more abstract our language when referencing said event/entity. In other words, perceived distance is reflected in the language that we use, and in a graded way.

In support of this hypothesis, the authors demonstrate that tweets referencing location become more abstract (ie, less concrete) as the distance between a tweeter’s location and the referenced location increases. In this post, then, we perform a similar (yet decidedly less rigorous) analysis using the Slate Magazine corpus (ca 1996-2000, 1K texts, 1m words) from the corpusdatr package.

Slate Magazine predominantly covers American politics and is headquartered in Washington DC. So, instead of the distance between tweeter location and referenced location in a tweet, we consider the distance between Washington, DC and referenced location in the corpus; instead of the abstractness of language in a tweet, we consider the abstractness of language in the context surrounding the referenced location in the corpus. Imperfect, but sufficient to demonstrate some methodologies.

library(tidyverse) 
library(sf)
library(spacyr)
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(corpusdatr) #devtools::install_github("jaytimm/corpusdatr")
library(lexvarsdatr) #devtools::install_github("jaytimm/lexvarsdatr")
library(knitr)

Concreteness ratings and the lexvarsdatr package

Snefjella and Kuperman (2015) score the abstractness of tweets in their study using concreteness ratings for 40,000 English words derived in Brysbaert, Warriner, and Kuperman (2014). Ratings are made available as supplemental material, and included in a data package I have developed called lexvarsdatr. A more complete description of the package and its contents is available here.

Per Brysbaert, Warriner, and Kuperman (2014), concreteness ratings range from 1 (abstract) to 5 (concrete); ratings reflect averages based on 30 participants. In the lexvarsdatr package, concreteness ratings (along with age-of-acquisition ratings and response times in lexical decision) are housed in the lvdr_behav_data table.

concreteness <- lvdr_behav_data%>%
  select(Word, concRating)%>%
  mutate(Word = toupper(Word))%>%
  na.omit()
## Warning: package 'bindrcpp' was built under R version 3.4.4

Some random examples of concreteness ratings from the dataset:

set.seed(111)
concreteness%>%
  sample_n(30)%>%
  mutate(rank=rank(concRating))%>%
  ggplot(aes(x=concRating, y=rank)) +
  geom_text(aes(label=toupper(Word)), 
            size=2.5, 
            check_overlap = TRUE,
            hjust = "inward")

The distribution of concreteness ratings for the 40,000 word forms:

ggplot(concreteness, aes(concRating)) +
  geom_histogram(binwidth = 0.1)

Context & concreteness scores

So, the goal here is to extract all corpus references to location, along with the surrounding context, and score the contexts based on the concreteness of their constituent features (akin to dictionary-based sentiment analysis, for example). Here, context is defined as the 15×15 window of words surrounding a given reference to location.

As the annotated Slate Magazine corpus contains named entity tags, including geopolitical entities (GPEs), identifying references to location has already been sorted. Here we collapse multi-word entities to single tokens, and ready the corpus for search using clr_set_corpus.

slate <- corpusdatr::cdr_slate_ann %>%
  spacyr::entity_consolidate() %>%
  corpuslingr::clr_set_corpus(ent_as_tag=TRUE)

Per a previous post, GPEs in the cdr_slate_ann corpus have been geocoded, and included in the corpusdatr package as an sf geometry, cdr_slate_gpe. GPEs in cdr_slate_gpe are limited to those occurring in at least 1% of texts. “USA” and synonyms have been excluded as well.

Using the search syntax described here, we translate the GPEs to a searchable form, and extract all GPE contexts using the clr_search_context from my corpuslingr package.

gpe_search <- corpusdatr::cdr_slate_gpe %>%
  top_n(100,txtf)%>%
  mutate(lemma=paste0(lemma,'~GPE'))

gpe_contexts <- corpuslingr::clr_search_context(
  search = gpe_search$lemma, 
  corp = slate, 
  LW=15, RW=15) 

Results from clr_search_context include both a BOW object and a KWIC object. Here, the former is aggregated by GPE, context, and lemma using the clr_context_bow function; concreteness ratings are then joined (by lemma). Finally, concreteness scores are calculated as the average concreteness rating of all (non-proper/non-entity) words in a given context. Words not included in the normed dataset are assigned a concreteness value of zero.

conc_by_eg <- gpe_contexts %>%
  corpuslingr::clr_context_bow(
    agg_var = c('doc_id','eg','searchLemma','lemma','tag'),
    content_only=TRUE)%>%
  left_join(concreteness,by = c('lemma'='Word'))%>%
  replace_na(list(concRating=0))%>%
  group_by(doc_id,eg,searchLemma) %>%
  summarise(n = n(), conc = sum(cofreq*concRating)) %>%
  mutate(ave_conc= round(conc/n,2))

Distribution of concreteness scores for all contexts containing reference to a GPE (n = 5,539):

ggplot(conc_by_eg, aes(ave_conc)) +
  geom_histogram(binwidth = 0.1)

Joining these scores to the KWIC object from the results of our original search, we can investigate how an example set of contexts has been scored.

corpuslingr::clr_context_kwic(gpe_contexts,include=c('doc_id','eg','lemma')) %>%
  left_join(conc_by_eg)%>%
  sample_n(5)%>%
  select(kwic,ave_conc)%>%
  arrange(desc(ave_conc))%>%
  DT::datatable(class = 'cell-border stripe', 
                rownames = FALSE,
                width="450", 
                escape=FALSE)

Lastly, we aggregate contextual concreteness scores to obtain a single concretness score for each GPE.

conc_by_term <-  conc_by_eg%>%
  group_by(searchLemma) %>%
  summarise(n = sum(n), conc = sum(conc)) %>%
  mutate(ave_conc= round(conc/n,2))

Results are presented in the table below, and can be sorted using column headers.

conc_by_term %>%
   DT::datatable(class = 'cell-border stripe', 
                 rownames = FALSE,
                 width="450", 
                 escape=FALSE)%>%
  DT::formatStyle('ave_conc',
    background = DT::styleColorBar(conc_by_term$ave_conc, 'cornflowerblue'),
    backgroundSize = '80% 70%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'right') 

Geographical distance

Method

The final piece is calculating how far each GPE is from the presumed epicenter of the Slate Magazine corpus, Washington, D.C. So, we first create an sf geometry for the nations’s capital.

dc = st_sfc(st_point(c(-77.0369, 38.9072)),crs=4326)

Then we compute distances between DC and each GPE in our dataset using the st_distance function from the sf package. Distances (in miles) are simple “as the crow flies” approximations.

NOTE: Lat/Long coordinates represent the center (or centroid) of a given GPE (eg, France is represented as the geographical center of the country of France). Important as well, Paris, eg, is treated as a distinct GPE from France. This could clearly be re-thought.

gpe_dists <- corpusdatr::cdr_slate_gpe %>% 
  mutate(miles_from_dc = round(as.numeric(st_distance(geometry,dc))*0.000621371,0)) 
#Convert to miles.

Distances from DC:

lemma miles_from_dc geometry
AFGHANISTAN 6935 c(67.709953, 33.93911)
ALABAMA 717 c(-86.902298, 32.3182314)
ALASKA 3333 c(-149.4936733, 64.2008413)
ALBANIA 4858 c(20.168331, 41.153332)
ARGENTINA 5388 c(-63.616672, -38.416097)
ARIZONA 1914 c(-111.0937311, 34.0489281)

Plot

Finally, we join the concreteness scores and distances from DC, and plot the former as a function of the latter.

gpe_dists %>%
  inner_join(conc_by_term,by=c('lemma'='searchLemma'))%>%
  ggplot(aes(x=miles_from_dc, y=ave_conc)) + 
  #geom_point(size=.75)+
  geom_smooth(method="loess", se=T)+
  geom_text(aes(label=lemma), 
            size=2.5, 
            check_overlap = TRUE)+
  theme(legend.position = "none")+
  labs(title = "Concreteness scores vs. distance from Washington, D.C.",
       subtitle="By geo-political entity")

Some observations

  • Concreteness scores tend to be higher in the US, and distance from DC seems to have no effect on concreteness scores for GPEs in the US.
  • Crossing the Atlantic into Western Europe, concreteness scores show a marked and graded decrease until ~Eastern Europe/Northern Africa.
  • From the Middle East to locations in Southeast Asia, concreteness scores gradually return to US-like levels (although the plot gets a bit sparse at these distances).

Some very cursory explanations

Collectively, the first two observations could reflect the influence of perceived distance on language use, at least from a “here in the (concrete) USA” versus “over there in (abstract) Europe” perspective. This particular interpretation would recast the epicenter of Slate magazine narrative as the US instead of Washington DC, which probably makes sense.

The up-swing in the use of concrete language in reference to ~Asian GPEs runs counter to theory, but perhaps could be explained by the content of the conversation surrounding some of these GPEs, eg, Indonesian occupation of East Timor (ca. late 20th century), in which the (presumably more concrete) language of conflict trumps the effects of perceived distance. Or any number of other interpretations.

FIN

Indeed, some interesting results; ultimately, however, the focus here should be methodology, as our corpus and sample of GPEs are both relatively small. From this perspective, hopefully we have demonstrated the utility of Snefjella and Kuperman (2015)’s cross-disciplinary approach to testing psychological theory using a combination of text, behavioral, and geographical data.

References

Brysbaert, Marc, Amy Beth Warriner, and Victor Kuperman. 2014. “Concreteness Ratings for 40 Thousand Generally Known English Word Lemmas.” Behavior Research Methods 46 (3). Springer: 904–11.

Snefjella, Bryor, and Victor Kuperman. 2015. “Concreteness and Psychological Distance in Natural Language Use.” Psychological Science 26 (9). Sage Publications Sage CA: Los Angeles, CA: 1449–60.

To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)