Site icon R-bloggers

How do you feel about Last Week Tonight?

[This article was first published on d4tagirl, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !--more-->

Welcome, welcome, welcome!

One thing my husband and I enjoy a lot is watching Last Week Tonight with John Oliver every week. It is an HBO political talk-show that airs on Sunday nights, and we usually watch it while we have dinner sometime during the week. We love the show because it covers a huge amount of diverse topics and news from all over the world, plus we laugh a lot (bittersweet laughs mostly ???????? ♀️ ).

I think John has a fantastic sense of humor and he is a spectacular communicator, but only if you share the way he sees the world. And because he is so enthusiastic about his views, I believe it is a character you either love or hate. I suspect he (as well as the controversial topics he proposes) arouses strong feelings in people and I want to check it by analyzing the comments people leave on his Youtube videos and his Facebook ones as well.

I’ve been wanting to try Julia Silge and David Robinson’s tidytext package for a while now, and after I read Erin’s text analysis on the Lizzie Bennet Diaries’ Youtube captions I thought about giving Youtube a try ????



Fetching Youtube videos and comments

Every episode has one main story and many short stories that are mostly available to watch online via Youtube.

I’m using the Youtube Data API and the tuber package to get the info from Youtube (I found a bug in the get_comment_thread function on the CRAN version, so I recommend you use the GitHub one instead, where that is fixed). The first time you need to do some things to obtain authorization credentials so your application can submit API requests (you can follow this guide to do so). Then you just use the tuber::yt_oauth function that launches a browser to allow you to authorize the application and you can start retrieving information.

First I search for the Youtube channel, I select the correct one and then I retrieve the playlist_id that I’m going to use to fetch all videos.

library(tuber)

app_id <- "####"
app_password <- "####"
yt_oauth(app_id, app_password)

search_channel <- yt_search("lastweektonight")
channel <- "UC3XTzVzaHQEd30rQbuvCtTQ"

channel_resources <- list_channel_resources(filter = c(channel_id = channel),
                                                part =  "contentDetails")

playlist_id <- channel_resources$items[[1]]$contentDetails$relatedPlaylists$uploads

Fetching the videos

To get all videos I use the get_playlist_items function, but it only retrieve the first 50 elements. I know soodoku is planning on implementing an argument ala “get_all”, but in the meantime I have to implement this myself to get all the videos (I took more than a few ideas from Erin’s script!).

I should warn you ⚠️ : The tuber package is all about lists, and not tidy dataframes, so I dedicate a lot of effort to tidying this data.

library(dplyr)
library(tuber)
library(purrr)
library(magrittr)
library(tibble)

get_videos <- function(playlist) {
  # pass NA as next page to get first page
  nextPageToken <- NA
  videos <- {}

  # Loop over every available page
  repeat {
    vid      <- get_playlist_items(filter = c(playlist_id = playlist),
                                   page_token = nextPageToken)

    vid_id   <- map(vid$items, "contentDetails") %>%
      map_df(magrittr::extract, c("videoId", "videoPublishedAt"))

    titles   <- lapply(vid_id$videoId, get_video_details) %>%
      map("localized") %>%
      map_df(magrittr::extract, c("title", "description"))

    videos   <- videos %>% bind_rows(tibble(id          = vid_id$videoId,
                                            created     = vid_id$videoPublishedAt,
                                            title       = titles$title,
                                            description = titles$description))

    # get the token for the next page
    nextPageToken <- ifelse(!is.null(vid$nextPageToken), vid$nextPageToken, NA)

    # if no more pages then done
    if (is.na(nextPageToken)) {
      break
    }
  }
  return(videos)
}

videos <- get_videos(playlist_id)

Then I extract the first part from the title and description (the rest is just advertisement), and format the video’s creation date,

library(stringr)

videos <- videos %>%
  mutate(short_title = str_match(title, "^([^:]+).+")[,2],
         short_desc  = str_match(description, "^([^\n]+).+")[,2],
         vid_created = as.Date(created)) %>%
  select(-created)

Lets take a look at the videos.

library(DT)
datatable(videos[, c(4:6)], rownames = FALSE,
          options = list(pageLength = 5)) %>% 
  formatStyle(c(1:3), `-size` = '15px')
< !--html_preserve-->
< !--/html_preserve-->


Fetching the comments

Now I get the comments for every video. I make my own functions for the same reason as before. The function get_video_comments retrieves comments from a given video_id, receiving the n parameter as the maximum of comments we want.

get_video_comments <- function(video_id, n = 5) {
  nextPageToken <- NULL
  comments <- {}

  repeat {
    com <- get_comment_threads(c(video_id  = video_id),
                               part        = "id, snippet",
                               page_token  = nextPageToken,
                               text_format = "plainText")

    for (i in 1:length(com$items)) {
      com_id      <- com$items[[i]]$snippet$topLevelComment$id
      com_text    <- com$items[[i]]$snippet$topLevelComment$snippet$textDisplay
      com_video   <- com$items[[i]]$snippet$topLevelComment$snippet$videoId
      com_created <- com$items[[i]]$snippet$topLevelComment$snippet$publishedAt

      comments    <- comments %>% bind_rows(tibble(video_id    = com_video,
                                                   com_id      = com_id,
                                                   com_text    = com_text,
                                                   com_created = com_created))
      if (nrow(comments) == n) {
        break
      }

      nextPageToken <- ifelse(!is.null(com$nextPageToken), com$nextPageToken, NA)
    }

    if (is.na(nextPageToken) | nrow(comments) == n) {
      break
    }
  }
  return(comments)
}

The function get_videos_comments receives a vector of video_ids and returns n comments for every video, using the previous get_video_comments function. Then I remove empty comments, join with the video information and remove videos with less than 100 comments.

get_videos_comments <- function(videos, n = 10){
  comments <- pmap_df(list(videos, n), get_video_comments)
}

raw_yt_comments <- get_videos_comments(videos$id, n = 300)

yt_comments <- raw_yt_comments %>%
  filter(com_text != "") %>% 
  left_join(videos, by = c("video_id" = "id")) %>%
  group_by(short_title) %>%
  mutate(n = n(),
         com_created = as.Date(com_created)) %>%
  ungroup() %>%
  filter(n >= 100)

And looking at the first rows we can already see some of that passion I was talking about ????

datatable(head(yt_comments[, c(7, 9, 3)], 30), rownames = FALSE,
          options = list(pageLength = 5)) %>% 
  formatStyle(c(1:3), `-size` = '15px')
< !--html_preserve-->
< !--/html_preserve-->


Most used words and sentiment

In the tidy text world, a tidy dataset is a table with one-token-per-row. I start by tidying the yt_comments dataframe, and removing the stop words (the stop_word dictionary is already included in the tidytext package).

library(tidytext)

tidy_yt_comments <- yt_comments %>%
  tidytext::unnest_tokens(word, com_text) %>%
  anti_join(stop_words, by = "word")

Positive and Negative words in comments

I’m using the bing lexicon to evaluate the emotion in the word, that categorizes it into positive and negative. I join the words in the tidy_yt_comments dataset with the sentiment on the bing lexicon, and then count how many times each word appears.

So let’s find out the most used words in the comments!

library(ggplot2)

yt_pos_neg_words <- tidy_yt_comments %>%  
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, nn)) %>%
  ggplot(aes(word, nn, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("red2", "green3")) +
  facet_wrap(~sentiment, scales = "free_y") +
  ylim(0, 2500) +
  labs(y = NULL, x = NULL) +
  coord_flip() +
  theme_minimal()

There is a lot of strong words here! And I’m pretty sure this trump positive word we are seeing is not quite the same Trump John has been talking about non stop for the last two years… and not precisely in a positive way… I could include this word in a custom_stop_words dataframe, but I’m going leave it like that for now.

Also… not sure why funny is in the negative category ???? I know it can be used as weird or something like that, but I think this happens because I’m not a native English speaker ???????? ♀️

Are there more positive or negative words?

tidy_yt_comments %>%
  inner_join(get_sentiments("bing"), by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(sentiment = reorder(sentiment, nn)) %>%
  ggplot(aes(sentiment, nn)) +
  geom_col(aes(fill = sentiment), show.legend = FALSE) +
  scale_fill_manual(values = c("green3", "red2")) +
  ylab(NULL) +
  xlab(NULL) +
  coord_flip() +
  theme_minimal()

Definitely more negative than positive words. OK.

More sentiments in comments

There is a different lexicon, the nrc one, that classifies the words into more categories: two sentiments: positive and negative, and eight basic emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Let’s check what we find!

tidy_yt_comments  %>%
  inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, nn)) %>%
  ggplot(aes(word, nn, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  xlab(NULL) +
  ylab(NULL) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip() +
  theme_minimal()

OK… a few comments here.

custom_stop_words <- bind_rows(data_frame(word = c("john"), 
                                          lexicon = c("custom")),
                               stop_words)

yt_comments %>%
  tidytext::unnest_tokens(word, com_text) %>%
  anti_join(custom_stop_words, by = "word") %>%
  inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, nn)) %>%
  ggplot(aes(word, nn, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_y_continuous(breaks = c(0, 1000, 2000)) +
  xlab(NULL) +
  ylab(NULL) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip() +
  theme_minimal()

There are very controversial classifications on this nrc lexicon, especially with the terms black, classified as negative (and sadness) and white as positive (and joy, anticipation and trust). I don’t like this at all…

I also have some comments:

Anyway, these are side comments because they are about the lexicon (or the human nature!) and not this analysis. Bottom line: I don’t like this lexicon ????

What is the most present sentiment/emotion?

yt_comments %>%
  tidytext::unnest_tokens(word, com_text) %>%
  anti_join(custom_stop_words, by = "word") %>%
  inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(pos_neg = ifelse(sentiment %in% c("positive", "anticipation", "joy", "trust", "surprise"), 
                          "Positive", "Negative")) %>%
  ggplot(aes(reorder(sentiment, nn), nn)) +
  geom_col(aes(fill = pos_neg), show.legend = FALSE) +
  scale_fill_manual(values = c("red2", "green3")) +
  xlab(NULL) +
  ylab(NULL) +
  coord_flip()

According to this lexicon, there are more positive than negative words! The opposite of what we found using the bing lexicon. The thing about this one is that allows us to analyze other sentiments as well. But of course I’m not going to use it anymore ????

Most used n-grams

Other interesting thing to do is find the most common n-grams (threads of n amount of words that tend to co-occur).

yt_comments %>%
  tidytext::unnest_tokens(five_gram, com_text, token = "ngrams", n = 5) %>%
  count(five_gram, sort = TRUE) %>%
  top_n(10) %>%
  mutate(five_gram = reorder(five_gram, nn)) %>%
  ggplot(aes(five_gram, nn)) +
  geom_col(fill = "red", show.legend = FALSE) +
  xlab(NULL) +
  ylab(NULL) +
  coord_flip() +
  theme_minimal()

“how is this still a” and “is this still a thing” of course ring a bell for those of us who watch the show, since it has a section called “How is this still a thing?” questioning certain traditions or things that for some reason seemed adequate at some point in time, but now are totally absurd. Like voting for the US Presidential Elections on Tuesday, or the swimsuit issue of the Sports Illustrated magazine ????

The “am i the only one”, “i the only one who” and “is it just me or” 5-grams shows us how much people love rethorical questions! Like a lot! I’m going to take a peek at these comments!

am_i_the_only_one <- yt_comments %>%
  tidytext::unnest_tokens(five_gram, com_text, token = "ngrams", n = 5) %>%
  filter(five_gram == "am i the only one") %>%
  select(com_id)

datatable(head(yt_comments[yt_comments$com_id %in% am_i_the_only_one$com_id, c(7, 3)], 30), 
          rownames = FALSE,
          options = list(pageLength = 5)) %>% 
  formatStyle(c(1:2), `-size` = '15px')
< !--html_preserve-->
< !--/html_preserve-->


And a very strange 5-gram: “great great great great great”… I have to check what this is about!

great_great_great_great_great <- yt_comments %>%
  tidytext::unnest_tokens(five_gram, com_text, token = "ngrams", n = 5) %>%
  filter(five_gram == "great great great great great") %>%
  select(com_id) 

datatable(head(yt_comments[yt_comments$com_id %in% great_great_great_great_great$com_id, c(7, 3)], 1), 
          rownames = FALSE,
          options = list(pageLength = 5)) %>% 
  formatStyle(c(1:2), `-size` = '15px') 
< !--html_preserve-->
< !--/html_preserve-->


Just like I suspected, this is one very long concatenation of the word “great”. This guy is a very, very enthusiastic atheist who is referring to a very old ancestor, so it doesn’t count for this analysis.

Moving on…

Sentiment Analysis on comments

Similar to what I did for every word, now I join the words in the tidy_yt_comments dataset with the sentiment on the bing lexicon, and then count how many positive and negative words are in every comment. Then compute the sentiment as positivenegative, to finally join this to the yt_comment dataset.

library(tidyr) 

yt_comment_sent <- tidy_yt_comments  %>%
  inner_join(get_sentiments("bing"), by = "word") %>% 
  count(com_id, sentiment) %>%
  spread(sentiment, nn, fill = 0) %>%
  mutate(sentiment = positive - negative) %>% 
  ungroup() %>% 
  left_join(yt_comments, by = "com_id") %>% 
  arrange(sentiment) 

The longer the comment, the higher potential for higher sentiment. Let’s take a look at the extremes. The most negative comments according to the bing lexicon are:

datatable(head(yt_comment_sent[, c(10, 12, 6)], 30), 
          rownames = FALSE,
          options = list(pageLength = 1)) %>% 
  formatStyle(c(1:3), `-size` = '15px')
< !--html_preserve-->

To leave a comment for the author, please follow the link and comment on their blog: d4tagirl.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.