A Frosty Deal?

[This article was first published on R | Quantum Jitter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reading news articles on the will-they-won’t-they post-Brexit trade negotiations with the EU sees days of optimism jarred by days of gloom. Do negative news articles, when one wants a positive outcome, leave a deeper impression?

I wondered if I could get a more objective view from quantitative analysis of textual data. To do this, I’m going to look at hundreds of articles published in the Guardian newspaper over the course of the year to see how trade-talk sentiment has changed week-to-week.

library(tidyverse)
library(rebus)
library(wesanderson)
library(kableExtra)
library(lubridate)
library(GuardianR)
library(quanteda)
library(scales)
library(tictoc)
library(patchwork)
library(text2vec)
library(topicmodels)
theme_set(theme_bw())

cols <- wes_palette(name = "Chevalier1")

The Withdrawal Agreement between the UK and the European Union was signed on the 24th of January 2020. I’ll import Brexit-related newspaper articles from that date.

The Guardian newspaper asks for requests to span no more than 1 month at a time. So I’ll first create a set of monthly date ranges.

dates_df <- tibble(start_date = seq(ymd("2020-01-24"), today(), by = "1 month")) %>%
  mutate(end_date = start_date + months(1) - 1)

dates_df %>%
  kable()
start_date end_date
2020-01-24 2020-02-23
2020-02-24 2020-03-23
2020-03-24 2020-04-23
2020-04-24 2020-05-23
2020-05-24 2020-06-23
2020-06-24 2020-07-23
2020-07-24 2020-08-23
2020-08-24 2020-09-23

I’ll import the newspaper articles in monthly chunks. Note, access to the Guardian’s API requires a key which may be requested here.

tic()

article_df <-
  dates_df %>%
  pmap_dfr(., function(start_date, end_date) {
    Sys.sleep(1)
    get_guardian(
      "brexit",
      from.date = start_date,
      to.date = end_date,
      api.key = key
    )
  })

toc()

The data need a little cleaning, for example, to remove multi-topic articles, html tags and non-breaking spaces.

trade_df <-
  article_df %>%
  filter(!str_detect(id, "/live/"), sectionId %in% c("world", "politics", "business")) %>% 
  mutate(
    body = str_remove_all(body, "<.*?>") %>% str_to_lower(),
    body = str_remove_all(body, "[^a-z0-9 .-]"),
    body = str_remove_all(body, "nbsp")
  )

A corpus then gives me a collection of texts whereby each document is a newspaper article.

trade_corp <- trade_df %>% 
  corpus(docid_field = "shortUrl", text_field = "body")

Although I’ve only imported articles mentioning Brexit since the Withdrawal Agreement was signed, some of these articles will not be related to trade negotiations with the EU. For example, there are on-going negotiations with many countries around the world. So, I’m going to use word embeddings to help narrow the focus to the specific context of the UK-EU trade deal.

The chief negotiator for the EU is Michel Barnier, so I’ll quantitatively identify words in close proximity to “Barnier” in the context of these Brexit news articles.

window <- 5

trade_fcm <- 
  trade_corp %>% 
  fcm(context = "window", window = window, count = "weighted", weights = window:1)

glove <- GlobalVectors$new(rank = 60, x_max = 10)

set.seed(42)

wv_main <- glove$fit_transform(trade_fcm, n_iter = 10)
## INFO  [10:06:33.114] epoch 1, loss 0.3817 
## INFO  [10:06:34.959] epoch 2, loss 0.2510 
## INFO  [10:06:36.759] epoch 3, loss 0.2225 
## INFO  [10:06:38.577] epoch 4, loss 0.2021 
## INFO  [10:06:40.438] epoch 5, loss 0.1847 
## INFO  [10:06:42.303] epoch 6, loss 0.1710 
## INFO  [10:06:44.124] epoch 7, loss 0.1605 
## INFO  [10:06:45.936] epoch 8, loss 0.1524 
## INFO  [10:06:47.754] epoch 9, loss 0.1457 
## INFO  [10:06:49.594] epoch 10, loss 0.1403
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)

search_coord <- 
  word_vectors["barnier", , drop = FALSE]

word_vectors %>% 
  sim2(search_coord, method = "cosine") %>% 
  as_tibble(rownames = NA) %>% 
  rownames_to_column("term") %>% 
  rename(similarity = 2) %>% 
  arrange(desc(similarity)) %>% 
  slice(1:10) %>%
  kable()
term similarity
barnier 1.0000000
negotiator 0.7966461
michel 0.7587372
frost 0.7093119
eus 0.6728152
chief 0.6365480
brussels 0.5856139
negotiators 0.5598537
team 0.5488111
accused 0.5301669

Word embedding is a learned modelling technique placing words into a multi-dimensional vector space such that contextually-similar words may be found close by. Not surprisingly, the closest word contextually is “Michel”. And as he is the chief negotiator for the EU, we find “eu’s”, “chief”, and “negotiator” also in the top most contextually-similar words.

The word embeddings algorithm, through word co-occurrence, has identified the name of Michel Barnier’s UK counterpart David Frost. So filtering articles for “Barnier”, “Frost” and “UK-EU” should help narrow the focus.

context_df <- 
  trade_df %>% 
  filter(str_detect(body, "barnier|frost|uk-eu")) 

context_corp <- 
  context_df %>% 
  corpus(docid_field = "shortUrl", text_field = "body")

I can then use quanteda’s kwic function to review the key phrases in context to ensure I’m homing in on the texts I want. Short URLs are included below so I can click on any to read the actual article as presented by The Guardian.

set.seed(123)

context_corp %>%
  tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_numbers = TRUE
  ) %>%
  kwic(pattern = phrase(c("trade negotiation", "trade deal", "trade talks")), 
       valuetype = "regex", window = 7) %>%
  as_tibble() %>%
  left_join(article_df, by = c("docname" = "shortUrl")) %>% 
  slice_sample(n = 10) %>% 
  select(docname, pre, keyword, post, headline) %>%
  kable()
docname pre keyword post headline
https://gu.com/p/ee3qc ecj unless we have such a thin trade deal that its not worth the paper its Brexit: Boris Johnson faces Eurotunnel test
https://gu.com/p/end82 london a separate process to the troubled trade talks that got under way in london on Irish MEP in line for EU finance role vacated due to lockdown scandal
https://gu.com/p/ezjdz said the downsides with the eu free trade deal the us free trade deal and our Brexit bill hugely damaging to UK’s reputation, says ex-ambassador
https://gu.com/p/d7d9t people we have who have been negotiating trade deals forever she said while people criticise the Brexit trade talks: EU to back Spain over Gibraltar claims
https://gu.com/p/eyzhq played down the prospect of reaching a trade deal with the eu in time for december No 10 blames EU and plays down prospects of Brexit trade deal
https://gu.com/p/ez2v6 it will make it harder to strike trade deals going forward he told channel news after Brexit: UK negotiators ‘believe brinkmanship will reboot trade talks’
https://gu.com/p/d7n4t alignment with eu rules in any brexit trade deal while brussels threatened to put tariffs on Pound falls as Boris Johnson takes tough line on EU trade deal
https://gu.com/p/dnvbj personal rapport when communicating remotely related post-brexit trade talks with eu on course to fail johnson Fears Brexit talks could collapse in June but UK still optimistic
https://gu.com/p/d94j9 this situation and we work on a trade deal with them of course the united kingdom Ursula von der Leyen mocks Boris Johnson’s stance on EU trade deal
https://gu.com/p/ezkxc it threatens to damage british prospects of trade deals with the us and eu it puts Tuesday briefing: Rancour as law-breaking bill goes forward

Quanteda provides a sentiment dictionary which, in addition to identifying positive and negative words, also finds negative-negatives and negative-positives such as, for example, “not effective”. For each week’s worth of articles, I’ll calculate the proportion of positive sentiments.

tic()

sent_df <- 
  context_corp %>% 
  dfm(dictionary = data_dictionary_LSD2015) %>% 
  as_tibble() %>%
  left_join(context_df, by = c("doc_id" = "shortUrl")) %>% 
  mutate(
    date = ceiling_date(as_date(webPublicationDate), "week"),
    pct_pos = (positive + neg_negative) / (positive + neg_negative + negative + neg_positive)
  )

sent_df %>% 
  select(doc_id, starts_with("pos"), starts_with("neg")) %>% 
  slice(1:10) %>% 
  kable()
doc_id positive negative neg_positive neg_negative
https://gu.com/p/d6qhb 40 22 0 0
https://gu.com/p/d9e9j 27 15 0 0
https://gu.com/p/d6kzd 51 27 0 1
https://gu.com/p/d6bt2 37 7 0 0
https://gu.com/p/d9vjq 13 23 0 0
https://gu.com/p/d7n8b 57 34 1 0
https://gu.com/p/d79cn 56 48 3 1
https://gu.com/p/d6t3c 28 26 0 0
https://gu.com/p/d9xtf 33 13 1 0
https://gu.com/p/d696t 15 21 1 0
summary_df <- sent_df %>% 
  group_by(date) %>% 
  summarise(pct_pos = mean(pct_pos), n = n())

toc()
## 0.708 sec elapsed

Plotting the changing proportion of positive sentiment over time did surprise me a little. The outcome was more balanced than I expected which perhaps confirms the deeper impression left on me by negative articles.

The upper violin plot shows the average weight of the sentiment across multiple articles for each week. Individually the articles range from 20% to 80% positive, with discernible periods of relatively negative and relatively positive sentiment.

The lower plot shows the volume of articles. As we draw closer to the crunch-point the volume appears to be picking up.

p1 <- sent_df %>% 
  ggplot(aes(date, pct_pos)) +
  geom_violin(aes(group = date), alpha = 0.5, fill = cols[1]) +
  geom_line(data = summary_df, aes(date, pct_pos), colour = cols[1], linetype = "dashed") +
  geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[4]) +
  scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
  labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
       subtitle = "Week-to-week Since the Withdrawal Agreement",
       x = NULL, y = "Positive Sentiment")

p2 <- summary_df %>% 
  ggplot(aes(date, n)) +
  geom_line(colour = cols[1]) +
  labs(x = "Weeks", y = "Article Count",
       caption = "Source: Guardian Newspaper")

p1 / p2 + 
  plot_layout(heights = c(2, 1))

Some writers exhibit more sentiment variation than others.

byline_df <- 
  sent_df %>% 
  mutate(byline = word(byline, 1, 2) %>% str_remove_all(PUNCT)) %>% 
  group_by(byline, date) %>% 
  summarise(pct_pos = mean(pct_pos), n = n())

top_3 <- byline_df %>% 
  count(byline, sort = TRUE) %>% 
  ungroup() %>% 
  filter(!is.na(byline)) %>% 
  slice(c(3, 2)) %>% 
  pull(byline)

byline_df %>% 
  filter(byline %in% top_3) %>% 
  ggplot(aes(date, pct_pos, colour = byline)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[2]) +
  scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) +
  scale_colour_manual(values = cols[c(1, 4)]) +
  labs(title = "Changing Sentiment Towards a UK-EU Trade Deal",
       subtitle = "Week-to-week Since the Withdrawal Agreement",
       x = "Weeks", y = "Positive Sentiment", colour = "Byline", 
       caption = "Source: Guardian Newspaper")

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base library[12]; c[8]; function[2]; mean[2]; set.seed[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]; seq[1]; sum[1]; Sys.sleep[1]
dplyr filter[8]; mutate[8]; as_tibble[4]; group_by[3]; if_else[3]; n[3]; select[3]; slice[3]; summarise[3]; tibble[3]; arrange[2]; desc[2]; left_join[2]; starts_with[2]; count[1]; pull[1]; rename[1]; slice_sample[1]; ungroup[1]
ggplot2 aes[5]; geom_line[3]; ggplot[3]; labs[3]; geom_hline[2]; scale_y_continuous[2]; geom_violin[1]; scale_colour_manual[1]; theme_bw[1]; theme_set[1]
GuardianR get_guardian[1]
kableExtra kable[5]
lubridate date[3]; as_date[1]; ceiling_date[1]; today[1]; ymd[1]
patchwork plot_layout[1]
purrr map[1]; map2_dfr[1]; pmap_dfr[1]; possibly[1]; set_names[1]
quanteda corpus[2]; data_dictionary_LSD2015[1]; dfm[1]; fcm[1]; kwic[1]; phrase[1]; t[1]; tokens[1]
readr read_lines[1]
rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]; PUNCT[1]
scales percent_format[2]
stringr str_detect[5]; str_remove_all[5]; str_c[2]; str_remove[2]; str_count[1]; str_to_lower[1]; word[1]
text2vec sim2[1]
tibble enframe[1]; rownames_to_column[1]
tictoc tic[2]; toc[2]
tidyr unnest[1]
wesanderson wes_palette[1]

To leave a comment for the author, please follow the link and comment on their blog: R | Quantum Jitter.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)