Cyril’s Speeches

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The transcripts for the South African President’s speeches are available here. I’ve just added these data to the {saffer} package.


Let’s take a look.

Rows: 621
Columns: 6
$ date     <date> 2016-01-07, 2016-01-21, 2016-01-23, 2016-02-06, 2016-02-09,…
$ position <chr> "Deputy President", "President", "Deputy President", "Presid…
$ person   <chr> "Cyril Ramaphosa", "Jacob Zuma", "Cyril Ramaphosa", "Jacob Z…
$ language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", …
$ title    <chr> "Deputy President Cyril Ramaphosa’s Address to the Extra-Ord…
$ text     <chr> "Comrade Chairperson of the SPLM and President of the Republ…

We’ll focus on speeches made in English by Cyril Ramaphosa in his position as President. We’ll also retain only the date and text fields.

ramaphosa <- president_speeches %>%
    person == "Cyril Ramaphosa",
    position == "President",
    language == "en"
  ) %>%
  select(date, text)

# How many speeches?
[1] 296

We’re going to use the {tidytext} package to perform some simple analyses.


Break the text into tokens.

ramaphosa <- ramaphosa %>%
    to_lower = TRUE
# A tibble: 475,603 x 2
   date       word       
   <date>     <chr>      
 1 2018-02-16 speaker    
 2 2018-02-16 of         
 3 2018-02-16 the        
 4 2018-02-16 national   
 5 2018-02-16 assembly   
 6 2018-02-16 ms         
 7 2018-02-16 baleka     
 8 2018-02-16 mbete      
 9 2018-02-16 chairperson
10 2018-02-16 of         
# … with 475,593 more rows

I can already see that there are some terms in there that I’d like to exclude. Let’s load the stop word list that comes with {tidytext} and add in some custom stop words.


stop_words <- rbind(
  stop_words %>% select(word),
    word = c(

Now remove the stop words, punctuation and all numbers.

ramaphosa <- ramaphosa %>%
  anti_join(stop_words, by = "word") %>%
    word = str_replace_all(word, "[:punct:]", "")
  ) %>%
    !str_detect(word, "^[:digit:]+$")

What are the most common words and how often do they occur?

(ramaphosa_count <- ramaphosa %>% count(word, sort = TRUE))
# A tibble: 15,044 x 2
   word            n
   <chr>       <int>
 1 south        2650
 2 people       2322
 3 africa       2047
 4 economic     1291
 5 country      1221
 6 african      1206
 7 development  1155
 8 government   1019
 9 investment    970
10 women         886
# … with 15,034 more rows

Who can resist a word cloud, right? We’ll create one using the versatile {ggwordcloud} package.

Interesting. But this doesn’t give us any indication of how topical issues have changed over time. Let’s look at this in another way. The plot below shows the cumulative proportional contribution of individual terms over time.

But I still like the word cloud. Let’s settle for a compromise between word cloud and time resolution.

Nice! Quickly picking out a few themes:

  • In August 2020 he had a lot to say about women, which makes sense since that was Women’s Month.
  • In April 2020 cornonavirus and people dominate, with health ascending in May 2020.
  • The emphasis turned to investment and the economy in Octover and November 2020.


Looking forward to updating this data over the course of 2021 and seeing how the monologue changes.

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)