How do you feel about Last Week Tonight?

[This article was first published on d4tagirl, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Welcome, welcome, welcome!

One thing my husband and I enjoy a lot is watching Last Week Tonight with John Oliver every week. It is an HBO political talk-show that airs on Sunday nights, and we usually watch it while we have dinner sometime during the week. We love the show because it covers a huge amount of diverse topics and news from all over the world, plus we laugh a lot (bittersweet laughs mostly 🤷🏻
♀️ ).

I think John has a fantastic sense of humor and he is a spectacular communicator, but only if you share the way he sees the world. And because he is so enthusiastic about his views, I believe it is a character you either love or hate. I suspect he (as well as the controversial topics he proposes) arouses strong feelings in people and I want to check it by analyzing the comments people leave on his Youtube videos and his Facebook ones as well.

I’ve been wanting to try Julia Silge and David Robinson’s tidytext package for a while now, and after I read Erin’s text analysis on the Lizzie Bennet Diaries’ Youtube captions I thought about giving Youtube a try 😃

Fetching Youtube videos and comments

Every episode has one main story and many short stories that are mostly available to watch online via Youtube.

I’m using the Youtube Data API and the tuber package to get the info from Youtube (I found a bug in the get_comment_thread function on the CRAN version, so I recommend you use the GitHub one instead, where that is fixed). The first time you need to do some things to obtain authorization credentials so your application can submit API requests (you can follow this guide to do so). Then you just use the tuber::yt_oauth function that launches a browser to allow you to authorize the application and you can start retrieving information.

First I search for the Youtube channel, I select the correct one and then I retrieve the playlist_id that I’m going to use to fetch all videos.

<span class="n">library</span><span class="p">(</span><span class="n">tuber</span><span class="p">)</span><span class="w">

</span><span class="n">app_id</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"####"</span><span class="w">
</span><span class="n">app_password</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"####"</span><span class="w">
</span><span class="n">yt_oauth</span><span class="p">(</span><span class="n">app_id</span><span class="p">,</span><span class="w"> </span><span class="n">app_password</span><span class="p">)</span><span class="w">

</span><span class="n">search_channel</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">yt_search</span><span class="p">(</span><span class="s2">"lastweektonight"</span><span class="p">)</span><span class="w">
</span><span class="n">channel</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"UC3XTzVzaHQEd30rQbuvCtTQ"</span><span class="w">

</span><span class="n">channel_resources</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">list_channel_resources</span><span class="p">(</span><span class="n">filter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">channel_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">channel</span><span class="p">),</span><span class="w">
                                                </span><span class="n">part</span><span class="w"> </span><span class="o">=</span><span class="w">  </span><span class="s2">"contentDetails"</span><span class="p">)</span><span class="w">

</span><span class="n">playlist_id</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">channel_resources</span><span class="o">$</span><span class="n">items</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">contentDetails</span><span class="o">$</span><span class="n">relatedPlaylists</span><span class="o">$</span><span class="n">uploads</span><span class="w">
</span>

Fetching the videos

To get all videos I use the get_playlist_items function, but it only retrieve the first 50 elements. I know soodoku is planning on implementing an argument ala “get_all”, but in the meantime I have to implement this myself to get all the videos (I took more than a few ideas from Erin’s script!).

I should warn you ⚠️ : The tuber package is all about lists, and not tidy dataframes, so I dedicate a lot of effort to tidying this data.

<span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tuber</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">purrr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tibble</span><span class="p">)</span><span class="w">

</span><span class="n">get_videos</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">playlist</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="c1"># pass NA as next page to get first page
</span><span class="w">  </span><span class="n">nextPageToken</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
  </span><span class="n">videos</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">{}</span><span class="w">

  </span><span class="c1"># Loop over every available page
</span><span class="w">  </span><span class="k">repeat</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">vid</span><span class="w">      </span><span class="o"><-</span><span class="w"> </span><span class="n">get_playlist_items</span><span class="p">(</span><span class="n">filter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">playlist_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">playlist</span><span class="p">),</span><span class="w">
                                   </span><span class="n">page_token</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nextPageToken</span><span class="p">)</span><span class="w">

    </span><span class="n">vid_id</span><span class="w">   </span><span class="o"><-</span><span class="w"> </span><span class="n">map</span><span class="p">(</span><span class="n">vid</span><span class="o">$</span><span class="n">items</span><span class="p">,</span><span class="w"> </span><span class="s2">"contentDetails"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
      </span><span class="n">map_df</span><span class="p">(</span><span class="n">magrittr</span><span class="o">::</span><span class="n">extract</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"videoId"</span><span class="p">,</span><span class="w"> </span><span class="s2">"videoPublishedAt"</span><span class="p">))</span><span class="w">

    </span><span class="n">titles</span><span class="w">   </span><span class="o"><-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">vid_id</span><span class="o">$</span><span class="n">videoId</span><span class="p">,</span><span class="w"> </span><span class="n">get_video_details</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
      </span><span class="n">map</span><span class="p">(</span><span class="s2">"localized"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
      </span><span class="n">map_df</span><span class="p">(</span><span class="n">magrittr</span><span class="o">::</span><span class="n">extract</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"title"</span><span class="p">,</span><span class="w"> </span><span class="s2">"description"</span><span class="p">))</span><span class="w">

    </span><span class="n">videos</span><span class="w">   </span><span class="o"><-</span><span class="w"> </span><span class="n">videos</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">id</span><span class="w">          </span><span class="o">=</span><span class="w"> </span><span class="n">vid_id</span><span class="o">$</span><span class="n">videoId</span><span class="p">,</span><span class="w">
                                            </span><span class="n">created</span><span class="w">     </span><span class="o">=</span><span class="w"> </span><span class="n">vid_id</span><span class="o">$</span><span class="n">videoPublishedAt</span><span class="p">,</span><span class="w">
                                            </span><span class="n">title</span><span class="w">       </span><span class="o">=</span><span class="w"> </span><span class="n">titles</span><span class="o">$</span><span class="n">title</span><span class="p">,</span><span class="w">
                                            </span><span class="n">description</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">titles</span><span class="o">$</span><span class="n">description</span><span class="p">))</span><span class="w">

    </span><span class="c1"># get the token for the next page
</span><span class="w">    </span><span class="n">nextPageToken</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="o">!</span><span class="nf">is.null</span><span class="p">(</span><span class="n">vid</span><span class="o">$</span><span class="n">nextPageToken</span><span class="p">),</span><span class="w"> </span><span class="n">vid</span><span class="o">$</span><span class="n">nextPageToken</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span><span class="w">

    </span><span class="c1"># if no more pages then done
</span><span class="w">    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">nextPageToken</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="k">break</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">videos</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">videos</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_videos</span><span class="p">(</span><span class="n">playlist_id</span><span class="p">)</span><span class="w">
</span>

Then I extract the first part from the title and description (the rest is just advertisement), and format the video’s creation date,

<span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">

</span><span class="n">videos</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">videos</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">short_title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">str_match</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="s2">"^([^:]+).+"</span><span class="p">)[,</span><span class="m">2</span><span class="p">],</span><span class="w">
         </span><span class="n">short_desc</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">str_match</span><span class="p">(</span><span class="n">description</span><span class="p">,</span><span class="w"> </span><span class="s2">"^([^\n]+).+"</span><span class="p">)[,</span><span class="m">2</span><span class="p">],</span><span class="w">
         </span><span class="n">vid_created</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">created</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">created</span><span class="p">)</span><span class="w">
</span>

Lets take a look at the videos.

<span class="n">library</span><span class="p">(</span><span class="n">DT</span><span class="p">)</span><span class="w">
</span><span class="n">datatable</span><span class="p">(</span><span class="n">videos</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="o">:</span><span class="m">6</span><span class="p">)],</span><span class="w"> </span><span class="n">rownames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
          </span><span class="n">options</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">pageLength</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">formatStyle</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">`font-size`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'15px'</span><span class="p">)</span><span class="w">
</span>

\n \n \n
short_title<\/th>\n short_desc<\/th>\n vid_created<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"pageLength":5,"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100],"rowCallback":"function(row, data) {\nvar value=data[0]; if (value!==null) $(this.api().cell(row, 0).node()).css({'font-size':'15px'});\nvar value=data[1]; if (value!==null) $(this.api().cell(row, 1).node()).css({'font-size':'15px'});\nvar value=data[2]; if (value!==null) $(this.api().cell(row, 2).node()).css({'font-size':'15px'});\n}"},"selection":{"mode":"multiple","selected":null,"target":"row"}},"evals":["options.rowCallback"],"jsHooks":[]}

Fetching the comments

Now I get the comments for every video. I make my own functions for the same reason as before. The function get_video_comments retrieves comments from a given video_id, receiving the n parameter as the maximum of comments we want.

<span class="n">get_video_comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">video_id</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">nextPageToken</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
  </span><span class="n">comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">{}</span><span class="w">

  </span><span class="k">repeat</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">com</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_comment_threads</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">video_id</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">video_id</span><span class="p">),</span><span class="w">
                               </span><span class="n">part</span><span class="w">        </span><span class="o">=</span><span class="w"> </span><span class="s2">"id, snippet"</span><span class="p">,</span><span class="w">
                               </span><span class="n">page_token</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">nextPageToken</span><span class="p">,</span><span class="w">
                               </span><span class="n">text_format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"plainText"</span><span class="p">)</span><span class="w">

    </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">com</span><span class="o">$</span><span class="n">items</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="n">com_id</span><span class="w">      </span><span class="o"><-</span><span class="w"> </span><span class="n">com</span><span class="o">$</span><span class="n">items</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">topLevelComment</span><span class="o">$</span><span class="n">id</span><span class="w">
      </span><span class="n">com_text</span><span class="w">    </span><span class="o"><-</span><span class="w"> </span><span class="n">com</span><span class="o">$</span><span class="n">items</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">topLevelComment</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">textDisplay</span><span class="w">
      </span><span class="n">com_video</span><span class="w">   </span><span class="o"><-</span><span class="w"> </span><span class="n">com</span><span class="o">$</span><span class="n">items</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">topLevelComment</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">videoId</span><span class="w">
      </span><span class="n">com_created</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">com</span><span class="o">$</span><span class="n">items</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">topLevelComment</span><span class="o">$</span><span class="n">snippet</span><span class="o">$</span><span class="n">publishedAt</span><span class="w">

      </span><span class="n">comments</span><span class="w">    </span><span class="o"><-</span><span class="w"> </span><span class="n">comments</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">tibble</span><span class="p">(</span><span class="n">video_id</span><span class="w">    </span><span class="o">=</span><span class="w"> </span><span class="n">com_video</span><span class="p">,</span><span class="w">
                                                   </span><span class="n">com_id</span><span class="w">      </span><span class="o">=</span><span class="w"> </span><span class="n">com_id</span><span class="p">,</span><span class="w">
                                                   </span><span class="n">com_text</span><span class="w">    </span><span class="o">=</span><span class="w"> </span><span class="n">com_text</span><span class="p">,</span><span class="w">
                                                   </span><span class="n">com_created</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">com_created</span><span class="p">))</span><span class="w">
      </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">comments</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="k">break</span><span class="w">
      </span><span class="p">}</span><span class="w">

      </span><span class="n">nextPageToken</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="o">!</span><span class="nf">is.null</span><span class="p">(</span><span class="n">com</span><span class="o">$</span><span class="n">nextPageToken</span><span class="p">),</span><span class="w"> </span><span class="n">com</span><span class="o">$</span><span class="n">nextPageToken</span><span class="p">,</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">

    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">nextPageToken</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comments</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="k">break</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="n">comments</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

The function get_videos_comments receives a vector of video_ids and returns n comments for every video, using the previous get_video_comments function. Then I remove empty comments, join with the video information and remove videos with less than 100 comments.

<span class="n">get_videos_comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">videos</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">){</span><span class="w">
  </span><span class="n">comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pmap_df</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">videos</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">get_video_comments</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">raw_yt_comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_videos_comments</span><span class="p">(</span><span class="n">videos</span><span class="o">$</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">)</span><span class="w">

</span><span class="n">yt_comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">raw_yt_comments</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">com_text</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">videos</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"video_id"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"id"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">short_title</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w">
         </span><span class="n">com_created</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">com_created</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span>

And looking at the first rows we can already see some of that passion I was talking about 😳

<span class="n">datatable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">yt_comments</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)],</span><span class="w"> </span><span class="m">30</span><span class="p">),</span><span class="w"> </span><span class="n">rownames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
          </span><span class="n">options</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">pageLength</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">formatStyle</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">`font-size`</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'15px'</span><span class="p">)</span><span class="w">
</span>

\n \n \n
short_title<\/th>\n vid_created<\/th>\n com_text<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"pageLength":5,"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100],"rowCallback":"function(row, data) {\nvar value=data[0]; if (value!==null) $(this.api().cell(row, 0).node()).css({'font-size':'15px'});\nvar value=data[1]; if (value!==null) $(this.api().cell(row, 1).node()).css({'font-size':'15px'});\nvar value=data[2]; if (value!==null) $(this.api().cell(row, 2).node()).css({'font-size':'15px'});\n}"},"selection":{"mode":"multiple","selected":null,"target":"row"}},"evals":["options.rowCallback"],"jsHooks":[]}

Most used words and sentiment

In the tidy text world, a tidy dataset is a table with one-token-per-row. I start by tidying the yt_comments dataframe, and removing the stop words (the stop_word dictionary is already included in the tidytext package).

<span class="n">library</span><span class="p">(</span><span class="n">tidytext</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_yt_comments</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">yt_comments</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">tidytext</span><span class="o">::</span><span class="n">unnest_tokens</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">com_text</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">anti_join</span><span class="p">(</span><span class="n">stop_words</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"word"</span><span class="p">)</span><span class="w">
</span>

Positive and Negative words in comments

I’m using the bing lexicon to evaluate the emotion in the word, that categorizes it into positive and negative. I join the words in the tidy_yt_comments dataset with the sentiment on the bing lexicon, and then count how many times each word appears.

So let’s find out the most used words in the comments!

<span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">

</span><span class="n">yt_pos_neg_words</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tidy_yt_comments</span><span class="w"> </span><span class="o">%>%</span><span class="w">  
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"word"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">sort</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">sentiment</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">top_n</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">word</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reorder</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">nn</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">nn</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sentiment</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_col</span><span class="p">(</span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_manual</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"red2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"green3"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">sentiment</span><span class="p">,</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylim</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2500</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">coord_flip</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_minimal</span><span class="p">()</span><span class="w">
</span>

There is a lot of strong words here! And I’m pretty sure this trump positive word we are seeing is not quite the same Trump John has been talking about non stop for the last two years… and not precisely in a positive way… I could include this word in a custom_stop_words dataframe, but I’m going leave it like that for now.

Also… not sure why funny is in the negative category 🤔 I know it can be used as weird or something like that, but I think this happens because I’m not a native English speaker 🤷🏻
♀️

Are there more positive or negative words?

<span class="n">tidy_yt_comments</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">get_sentiments</span><span class="p">(</span><span class="s2">"bing"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"word"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">count</span><span class="p">(</span><span class="n">word&...

To leave a comment for the author, please follow the link and comment on their blog: d4tagirl.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)