twitter for historic memory
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Thoughts
A previous post here detailed a simple function (from the uspols
package) for extracting the timeline of the Trump presidency from Wikipedia. In this post, then, we turn this timeline into a Twitter bot – one that remembers daily what happened four years ago in the Trump presidency.
A fairly trivial task; what is a bit tricky, however, is (1) dealing with Twitter’s 280 character limit, and (2) automating the posting of Twitter threads, as Wikipedia’s daily accounting of the last four years can be fairly detailed. Via the rtweet
package. For good measure, we demonstrate an unsupervised approach to adding hashtags to our tweet-threads in the form of named entities via spacy
/ spacyr
.
Timeline content
The Trump timeline can be extracted using the uspols::uspols_wiki_timeline()
function. A static table will shortly suffice.
# devtools::install_github("jaytimm/uspols") library(tidyverse) locals1 <- uspols::uspols_wiki_timeline() ## feature-izing Events -- locals1$nsent <- tokenizers::count_sentences(locals1$Events) locals1$nchar1 <- nchar(locals1$Events) ## -- locals1 <- subset(locals1, nchar1 > 0)
Details of table content returned from uspols_wiki_timeline()
: we consider the 699th day of the Trump presidency. A Thursday.
eg <- locals1 %>%filter(daypres == 699) eg %>% select(quarter:dow) %>% slice(1) %>% knitr::kable()
quarter | weekof | daypres | date | dow |
---|---|---|---|---|
2018_Q4 | 2018-12-16 | 699 | 2018-12-20 | Thursday |
As detailed below, Wikipedia summarizes the day’s happenings as individual “Events.” Some days were more eventful than others during the past four years. I have added a bullet
column for simple enumeration; day 699, then, included four events. Per above, we have added some Event-level features relevant to a tweet-bot: sentence count and character count.
eg %>% select(bullet, nsent, nchar1, Events) %>% DT::datatable(rownames = FALSE, options = list(dom = 't', pageLength = nrow(eg), scrollX = TRUE))
In this example, then, Events #3 and #4 will present problems from a character-limit perspective.
Automated tweet-sizing of a thought
Sentence-based
One approach to “tweet-sizing” Wikipedia-based events is simply to extract sentences up until we exceed some character threshold. An easy solution. Below we eliminate stops as markers of abbreviation (eg, honorifics Ms.
or Dr.
) – which makes sentence tokenizers infinitely more useful.
locals1$Events <- gsub('([A-Z])(\\.)', '\\1', locals1$Events)
The function below extracts sentences from a larger text until the cumulative character count exceeds some number, as specified by the chars
parameter. As we want to add some affixal matter to our tweet threads, we set this at 250.
extract_sent1 <- function(x, chars = 250) { z1 <- data.frame(ts = tokenizers::tokenize_sentences(x)[[1]]) z1$nchar_sent <- nchar(z1$ts) + 1 z1$cum_char <- cumsum(z1$nchar_sent) z2 <- subset(z1, cum_char < chars) paste0(z2$ts, collapse = ' ') }
The table below details the Events of Day 699 in tweet-ready length per our sentence extraction procedure. Per this method, we lose some (perhaps useful) detail from our thread.
Events <- unlist(lapply(eg$Events, extract_sent1, chars = 250)) data.frame(nsent = tokenizers::count_sentences(Events), nchar1 = nchar(Events), Events = Events) %>% DT::datatable(rownames = FALSE, options = list(dom = 't', pageLength = nrow(eg), scrollX = TRUE))
Via ellipses
If, instead, we wanted to preserve full event content, one approach would be to split text into ~280 character chunks (ideally respecting word boundaries), and piece thread together via ellipses. The function below is taken directly from this SO post.
cumsum_reset <- function(x, thresh = 4) { ans <- numeric() i <- 0 while(length(x) > 0) { cs_over <- cumsum(x) ntimes <- sum( cs_over <= thresh ) x <- x[-(1:ntimes)] ans <- c(ans, rep(i, ntimes)) i <- i + 1 } return(ans) }
The second function implements cumsum_reset
in the context of counting characters at the word level – once a certain character-count threshold is reached, counting resets.
to_thread <- function(x, chars = 250){ # no thread counts at present -- x1 <- data.frame(text = unlist(strsplit(x, ' '))) x1$chars <- nchar(x1$text) + 1 x1$cs1 <- cumsum(x1$chars) x1$sub_text <- cumsum_reset(x1$chars, thresh = chars) x2 <- aggregate(x1$text, list(x1$sub_text), paste, collapse = " ") x2$ww <- 'm' x2$ww[1] <- 'f' x2$ww[nrow(x2)] <- 'l' x2$x <- ifelse(x2$ww %in% c('f', 'm'), paste0(x2$x, ' ...'), x2$x) x2$x <- ifelse(x2$ww %in% c('l', 'm'), paste0('... ', x2$x), x2$x) paste0(x2$x, collapse = ' || ') }
For a demonstration of how these functions work, we use a super long event from the Wikipedia timeline – from 12 August 2018 (day 573) – which is over 1200 characters in length.
eg1 <- locals1 %>% filter(nchar1 == max(nchar1))
Function output is summarized below. Also, we add a thread counter, and check in on character counts.
eg2 <- eg1 %>% select(Events) %>% mutate(Events = to_thread(Events)) %>% ##### --- !! separate_rows(Events, sep = ' \\|\\| ') eg3 <- eg2 %>% mutate(Events = paste0(Events, ' [', row_number(), ' / ', nrow(eg2), ']'), nchar1 = nchar(Events)) %>% select(nchar1, Events)
Posting thread using rtweet
We can piece together these different workflows as a simple script to be run daily via cron. The daily script (not detailed here) filters the Event timeline to the day’s date four years ago, and then applies our tweet re-sizing functions to address any potential character-count issues. The code below illustrates how threads are composed based on Event features (ie, sentence and character counts).
rowwise() %>% mutate (THREAD = case_when (nchar1 < 251 ~ Events, nchar1 > 250 & nsent > 1 ~ extract_sent1(Events), nchar1 > 250 & nsent == 1 ~to_thread(Events)
Lastly, we build threads by looping through our list of tweet-readied events, replying to each previously posted tweet via the in_reply_to_status_id
parameter of the rtweet::post_tweet()
function.
rtweet::post_tweet(tred2$txt3[1], token = tk) if(nrow(tred1) > 1) { for(i in 2:length(tred2$txt3)) { Sys.sleep(1) last_tweet <- rtweet::get_timeline(user = 'MemoryHistoric')$status_id[1] rtweet::post_tweet(tred2$txt3[i], in_reply_to_status_id = last_tweet, token = tk) } } else {NULL}
Summary
See and follow MemoryHistoric on Twitter!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.