Using the tuber package to analyse a YouTube channel

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.

First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.

 

Channel Subscriptions Views Videos Likes Dislikes Comments
Dan Vasc 5127 743322 57 9008 155 1993

 

We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.

 

date title viewCount likeCount dislikeCount commentCount
2016-03-09 “Heart Of Steel” – MANOWAR cover 95288 1968 53 371
2016-05-09 “The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover 13959 556 6 85
2016-07-04 One Man Choir – Handel’s Hallelujah 9390 375 6 70
2016-08-16 “Carry On” – MANOWAR cover 19146 598 12 98
2016-09-12 “You Are Loved (Don’t Give Up)” – JOSH GROBAN cover 2524 142 0 21
2016-09-26 “Hearts On Fire” – HAMMERFALL cover 6584 310 4 58
2016-10-26 “Dawn Of Victory” – RHAPSODY OF FIRE cover 10335 354 5 69
2017-04-28 “I Don’t Wanna Miss A Thing” – AEROSMITH cover 9560 396 5 89
2017-05-09 State of affairs 906 99 1 40
2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese) 2862 160 4 39
2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português) 3026 235 3 62
2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English) 2682 108 2 14
2017-06-14 HOW TO BE A YOUTUBE SINGER | ASKDANVASC 01 559 44 1 19
2017-06-17 Promotional Live || Q&A and video games 206 16 0 2
2017-07-18 “The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project) 3368 303 2 47
2017-07-23 “Numb” – LINKIN PARK cover (R.I.P. CHESTER) 6717 350 14 51
2017-07-27 THE PERFECT TAKE and HOW TO MIX VOCALS | ASKDANVASC 02 305 29 0 11
2017-08-04 4000 Subscribers and Second Channel 515 69 1 23
2017-08-10 “Hello” – ADELE cover [ROCK VERSION] 6518 365 2 120
2017-08-27 “The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover 2174 133 5 28
2017-08-31 “Africa” – TOTO cover 18251 642 10 172
2017-09-24 “Chop Suey!” – SYSTEM OF A DOWN cover 2562 236 6 56
2017-10-09 “An American Trilogy” – ELVIS PRESLEY cover 1348 168 1 48
2017-11-08 “Beauty And The Beast” – Main Theme cover | Feat. Alina Lesnik 2270 192 2 59
2017-11-16 “Bohemian Rhapsody” – QUEEN cover 2589 339 3 95
2017-11-23 “The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover | Feat. Dragica 1857 209 2 42
2017-11-24 “Back In Black” – AC/DC cover (RIP MALCOLM YOUNG) 2202 207 2 56
2017-11-30 “Immigrant Song” – LED ZEPPELIN cover 3002 204 1 62
2017-12-07 “Sweet Child O’ Mine” – GUNS N’ ROSES cover 1317 201 2 86

 

Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.

plot of chunk unnamed-chunk-3

Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.

plot of chunk unnamed-chunk-4

The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.

plot of chunk unnamed-chunk-5

Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.

plot of chunk unnamed-chunk-6

Codes

Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.

library(tuber)
library(tidyverse)
library(lubridate)
library(stringi)
library(wordcloud)
library(gridExtra)

httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = #

# = Autentication = #
yt_oauth("ID",
         "PASS",token = "")

# = Download and prepare data = #

# = Channel stats = #
chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA")

# = Videos = #
videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA")
videos = videos %>%
  mutate(date = as.Date(publishedAt)) %>%
  filter(date > "2016-01-01") %>%
  arrange(date)

# = Comments = #
comments = lapply(as.character(videos$video_id), function(x){
  get_comment_threads(c(video_id = x), max_results = 1000)
})

# = Prep the data = #
# = Video Stat Table = #
videostats = lapply(as.character(videos$video_id), function(x){
  get_stats(video_id = x)
})
videostats = do.call(rbind.data.frame, videostats)
videostats$title = videos$title
videostats$date = videos$date
videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>%
  as.tibble() %>%
  mutate(viewCount = as.numeric(as.character(viewCount)),
         likeCount = as.numeric(as.character(likeCount)),
         dislikeCount = as.numeric(as.character(dislikeCount)),
         commentCount = as.numeric(as.character(commentCount)))

# = General Stat Table = #
genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount,
                   Views = chstat$statistics$viewCount,
                   Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount),
                   Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount))

# = videostats Plot = #
p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)

# = Comments TS = #
comments_ts = lapply(comments, function(x){
  as.Date(x$publishedAt)
})
comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>%
  group_by(date) %>% count()
ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) +
  geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+
  geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red")

# = coments by video = #
selected = (nrow(videostats) - 3):nrow(videostats)
top4 = videostats$title[selected]
top4comments = comments[selected]

p = list()
for(i in 1:4){
  df = top4comments[[i]]
  df$date = as.Date(df$publishedAt)
  df = df %>%
    arrange(date) %>%
    group_by(year(date), month(date), day(date)) %>%
    count()
  df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`)
  p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i])
}
do.call(grid.arrange,p)

## = WordClouds = ##
comments_text = lapply(comments,function(x){
  as.character(x$textOriginal)
})
comments_text = tibble(text = Reduce(c, comments_text)) %>%
  mutate(text = stri_trans_general(tolower(text), "Latin-ASCII"))
remove = c("you","the","que","and","your","muito","this","that","are","for","cara",
         "from","very","like","have","voce","man","one","nao","com","with","mais",
         "was","can","uma","but","ficou","meu","really","seu","would","sua","more",
         "it's","it","is","all","i'm","mas","como","just","make","what","esse","how",
         "por","favor","sempre","time","esta","every","para","i've","tem","will",
         "you're","essa","not","faz","pelo","than","about","acho","isso",
         "way","also","aqui","been","out","say","should","when","did","mesmo",
         "minha","next","cha","pra","sei","sure","too","das","fazer","made",
         "quando","ver","cada","here","need","ter","don't","este","has","tambem",
         "una","want","ate","can't","could","dia","fiquei","num","seus","tinha","vez",
         "ainda","any","dos","even","get","must","other","sem","vai","agora","desde",
         "dessa","fez","many","most","tao","then","tudo","vou","ficaria","foi","pela",
         "see","teu","those","were")
words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>%
  group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>%
  filter(n > 10 & word %in% remove == FALSE) 

set.seed(3)
wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE,
          rot.per = 0.3, colors = 1:nrow(words))

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)