Given that I do quite like twitter, I thought it would be a good idea to right about R’s interface to the twitter API; rtweet. As usual, we can grab the package in the usual way. We’re also going to need the tidyverse for the analysis, rvest for some initial webscraping of twitter names, lubridate for some date manipulation and stringr for some minor text mining.
install.packages(c("rtweet", "tidyverse", "rvest", "lubridate"))
library("rtweet") library("tidyverse") library("rvest") library("lubridate")
Getting the tweets
So, I could just write the names of twitter’s 10 most followed world leaders, but what would be the fun in that? We’re going to scrape them from twiplomacy using rvest and a chrome extension called selector gadget:
world_leaders = read_html("https://twiplomacy.com/ranking/the-50-most-followed-world-leaders-in-2017/") lead_r = world_leaders %>% html_nodes(".ranking-entry .ranking-user-name") %>% html_text() %>% str_replace_all("\\t|\\n|@", "") head(lead_r)
##  "realdonaldtrump" "pontifex" "narendramodi" "pmoindia" ##  "potus" "whitehouse"
The string inside
html_nodes() is gathered using selector gadget. See this great tutorial on rvest and for more on selector gadget read
vignette("selectorgadget"). Tabs (
\t) and linebreaks (
\n) are removed with
str_replace_all() from the stringr package.
Now we can collect the twitter data using rtweet. We can use the function
lookup_users() to grab basic user info such as number of tweets, friends, favourites and followers. Obviously analysing all 50 leaders at once would be a pain. So we’re only going to take the top 10 (WARNING: this could take a while)
lead_r_info = lookup_users(lead_r[1:10]) lead_r_info
## # A tibble: 10 x 20 ## user_id name screen_name location description url protected ## <chr> <chr> <chr> <chr> <chr> <chr> <lgl> ## 1 25073877 Donal… realDonaldT… Washing… 45th President o… http… FALSE ## 2 5007043… Pope … Pontifex Vatican… Welcome to the o… http… FALSE ## 3 18839785 Naren… narendramodi India Prime Minister o… http… FALSE ## 4 4717417… PMO I… PMOIndia "India " Office of the Pr… http… FALSE ## 5 8222156… Presi… POTUS Washing… 45th President o… http… FALSE ## 6 8222156… The W… WhiteHouse Washing… Welcome to @Whit… http… FALSE ## 7 68034431 Recep… RT_Erdogan Ankara,… Türkiye Cumhurba… <NA> FALSE ## 8 2196174… Sushm… SushmaSwaraj New Del… Minister of Exte… <NA> FALSE ## 9 3669871… Joko … jokowi Jakarta Akun resmi Joko … <NA> FALSE ## 10 44335525 HH Sh… HHShkMohd Dubai, … Official Tweets … http… FALSE ## # ... with 13 more variables: followers_count <int>, friends_count <int>, ## # listed_count <int>, statuses_count <int>, favourites_count <int>, ## # account_created_at <dttm>, verified <lgl>, profile_url <chr>, ## # profile_expanded_url <chr>, account_lang <chr>, ## # profile_banner_url <chr>, profile_background_url <chr>, ## # profile_image_url <chr>
We only want the columns of interest (name, followers_count, friends_count, statuses_count and favourites_count) and then we want the data in long format. To do this we’re going to use
(lead_r_info = lead_r_info %>% select(name, followers_count, friends_count, statuses_count, favourites_count, screen_name) %>% gather(type, value, -name, -screen_name))
## # A tibble: 40 x 4 ## name screen_name type value ## <chr> <chr> <chr> <int> ## 1 Donald J. Trump realDonaldTrump followers_count 48426576 ## 2 Pope Francis Pontifex followers_count 16858642 ## 3 Narendra Modi narendramodi followers_count 40710139 ## 4 PMO India PMOIndia followers_count 25156203 ## 5 President Trump POTUS followers_count 22324998 ## 6 The White House WhiteHouse followers_count 16713369 ## 7 Recep Tayyip Erdoğan RT_Erdogan followers_count 12513987 ## 8 Sushma Swaraj SushmaSwaraj followers_count 11461412 ## 9 Joko Widodo jokowi followers_count 9693534 ## 10 HH Sheikh Mohammed HHShkMohd followers_count 8764455 ## # ... with 30 more rows
Now we can use the fantastic ggplot to plot the respective counts for each world leader
ggplot(data = lead_r_info, aes(x = reorder(name, value), y = value,fill = type, colour = type)) + geom_col() + facet_wrap(~type, scales = "free") + theme_minimal() + theme( strip.background = element_blank(), strip.text = element_blank(), title = element_blank(), axis.text.x = element_blank() ) + coord_flip() + geom_text(aes(y = value, label = value), colour = "black", hjust = "inward")
Notice Donald trumps everyone in the followers and status area (from what I here he’s quite a prevalent tweeter), however Sushma Swaraj and Narendra Modi trump everyone when it comes to favourites and friends respectively.
Now, we’re going to use the function
get_timelines() to retrieve the last 2000 tweets by each leader. Again this may take a while!
lead_r_tl = get_timelines(lead_r, n = 2000)
get_timelines() only gives us their twitter handle and doesn’t return their actual name. So I’m going to use
left_join() to add the column of names to make for easier reading on the upcoming graphs
names = select(lead_r_info, name, screen_name) lead_r_twitt = left_join(lead_r_tl, names, by = "screen_name")
get_timelines() gives us the source of a persons tweet, i.e. iPhone, iPad, Android etc. So, what is the post popular tweet source among world leaders?
lead_r_twitt %>% count(source) %>% ggplot(aes(x = reorder(source, n), y = n)) + geom_col(fill = "cornflowerblue") + theme_minimal() + theme( strip.background = element_blank(), axis.text.x = element_blank() ) + labs( x = NULL, y = NULL, title = "Tweet sources for world leaders" ) + coord_flip() + geom_text(aes(y = n, label = n), hjust = "inward")
Either world leaders really love iPhones or their social media / security teams do. Probably the latter. I can hear you all begging the question, using which source is more likely to give a world leader more retweets and favourites? To do this we’re going to summarise each source by it’s mean number of retweets and favourites and then gather the data into a long format for plotting
lead_r_twitt %>% group_by(source) %>% summarise(Retweet = mean(retweet_count), Favourite = mean(favorite_count)) %>% gather(type,value,-source) %>% ggplot(aes(x = reorder(source, value), y = value, fill = type)) + geom_col() + facet_wrap(~type, scales = "free") + theme_minimal() + labs( x = "Source", y = NULL, title = "Which source is more likely to get more retweets and favourites?", subtitle = "Values are the mean in each group" ) + theme( legend.position = "none", axis.text.x = element_blank() ) + geom_text(aes(y = value, label = round(value, 0)), colour = "black", hjust = "inward") + coord_flip()
Naturally this leads me to the question of which leader, over their previous 2000 tweets, has the most overall retweets and favourites, and who has the highest average number of retweets and favourites?
lead_r_twitt %>% group_by(name) %>% summarise(rt_total = sum(retweet_count), fav_total = sum(favorite_count), rt_mean = mean(retweet_count), fav_mean = mean(favorite_count)) %>% gather(type, value, -name) %>% ggplot(aes(x = reorder(name, value), y = value, fill = type)) + geom_col() + labs( x = NULL, y = NULL, title= "Mean and total retweets/favourites for each world leader" ) + coord_flip() + facet_wrap(~type, scales = "free") + theme_minimal()
What about the mean retweets and favourites per month?
ts_plot() provides us with a quick way to turn the data into a time series plot. However this wouldn’t work for me so I’m doing it the dplyr way. I’m going to a monthly time series so first we need to aggregate our data into months. The function
rollback(), from lubridate, is fantastic for this. It will roll a date back to the first day of that month whilst also getting rid of the time information.
lead_r_twit2 = lead_r_twitt %>% mutate(year_month = rollback(created_at, roll_to_first = TRUE, preserve_hms = FALSE)) %>% group_by(name, year_month) %>% mutate(fav_mean = mean(favorite_count), rt_mean = mean(retweet_count))
We now have two columns, fav_mean and rt_mean, that have in them the mean number of retweets and favourites for each leader in each month. We can use
gather() to select the variables we want then turn this into long data for plotting
lead_r_twit2 = lead_r_twit2 %>% select(name, year_month, fav_mean, rt_mean) %>% gather(type, value, -name, -year_month)
Now we plot
lead_r_twit2 %>% ggplot(aes(x = year_month, y = value, colour = name)) + geom_line() + facet_wrap(~type, scales = "free", nrow = 2) + labs( x = NULL, y = NULL, title = "Mean number of favourites/month for world leaders" ) + theme_minimal()
Are world leaders actually bots?
botrnot is a fantastic package that uses machine learning to calculate the probability that a twitter user is a bot. So the obvious next question is, are our world leaders a bot or not?
We need to install the development package from GitHub and we also need to install the GitHub version of rtweet
The only function,
botornot(), works on either given user names, or the output of the
get_timelines() function from rtweet. To keep the inline with the rest of the blog, we’re going to use the output we’ve already created from
get_timelines(), stored in
bot = botornot(lead_r_tl) %>% arrange(prob_bot)
For a clearer look at the probabilities I’m going to plot them with their actual names instead of the screen names
bot %>% rename(screen_name = user) %>% inner_join(distinct(names), by = "screen_name") %>% select(name, prob_bot) %>% arrange(prob_bot) %>% ggplot() + geom_col(aes(x = reorder(name, -prob_bot), y = prob_bot), fill = "cornflowerblue") + coord_flip() + labs(y = "Probability of being a bot", x = "World leader", title = "Probability of world leaders being a bot") + theme_minimal()
So apparently we are almost certain Donald J. Trump isn’t a bot and very very nearly certain the Pope is a bot!
That’s all for this time, thanks for reading!