Using the tuber package to analyse a YouTube channel

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Gabriel Vasconcelos

So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.

First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.


Dan Vasc51277433225790081551993


We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.


2016-03-09“Heart Of Steel” – MANOWAR cover95288196853371
2016-05-09“The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover13959556685
2016-07-04One Man Choir – Handel’s Hallelujah9390375670
2016-08-16“Carry On” – MANOWAR cover191465981298
2016-09-12“You Are Loved (Don’t Give Up)” – JOSH GROBAN cover2524142021
2016-09-26“Hearts On Fire” – HAMMERFALL cover6584310458
2016-10-26“Dawn Of Victory” – RHAPSODY OF FIRE cover10335354569
2017-04-28“I Don’t Wanna Miss A Thing” – AEROSMITH cover9560396589
2017-05-09State of affairs90699140
2017-05-26“Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese)2862160439
2017-05-26“Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português)3026235362
2017-05-26“Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English)2682108214
2017-06-17Promotional Live || Q&A and video games2061602
2017-07-18“The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project)3368303247
2017-07-23“Numb” – LINKIN PARK cover (R.I.P. CHESTER)67173501451
2017-08-044000 Subscribers and Second Channel51569123
2017-08-10“Hello” – ADELE cover [ROCK VERSION]65183652120
2017-08-27“The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover2174133528
2017-08-31“Africa” – TOTO cover1825164210172
2017-09-24“Chop Suey!” – SYSTEM OF A DOWN cover2562236656
2017-10-09“An American Trilogy” – ELVIS PRESLEY cover1348168148
2017-11-08“Beauty And The Beast” – Main Theme cover | Feat. Alina Lesnik2270192259
2017-11-16“Bohemian Rhapsody” – QUEEN cover2589339395
2017-11-23“The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover | Feat. Dragica1857209242
2017-11-24“Back In Black” – AC/DC cover (RIP MALCOLM YOUNG)2202207256
2017-11-30“Immigrant Song” – LED ZEPPELIN cover3002204162
2017-12-07“Sweet Child O’ Mine” – GUNS N’ ROSES cover1317201286


Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.

plot of chunk unnamed-chunk-3

Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.

plot of chunk unnamed-chunk-4

The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.

plot of chunk unnamed-chunk-5

Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.

plot of chunk unnamed-chunk-6


Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.


httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = #

# = Autentication = #
         "PASS",token = "")

# = Download and prepare data = #

# = Channel stats = #
chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA")

# = Videos = #
videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA")
videos = videos %>%
  mutate(date = as.Date(publishedAt)) %>%
  filter(date > "2016-01-01") %>%

# = Comments = #
comments = lapply(as.character(videos$video_id), function(x){
  get_comment_threads(c(video_id = x), max_results = 1000)

# = Prep the data = #
# = Video Stat Table = #
videostats = lapply(as.character(videos$video_id), function(x){
  get_stats(video_id = x)
videostats =, videostats)
videostats$title = videos$title
videostats$date = videos$date
videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>%
  as.tibble() %>%
  mutate(viewCount = as.numeric(as.character(viewCount)),
         likeCount = as.numeric(as.character(likeCount)),
         dislikeCount = as.numeric(as.character(dislikeCount)),
         commentCount = as.numeric(as.character(commentCount)))

# = General Stat Table = #
genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount,
                   Views = chstat$statistics$viewCount,
                   Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount),
                   Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount))

# = videostats Plot = #
p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)

# = Comments TS = #
comments_ts = lapply(comments, function(x){
comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>%
  group_by(date) %>% count()
ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) +
  geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+
  geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red")

# = coments by video = #
selected = (nrow(videostats) - 3):nrow(videostats)
top4 = videostats$title[selected]
top4comments = comments[selected]

p = list()
for(i in 1:4){
  df = top4comments[[i]]
  df$date = as.Date(df$publishedAt)
  df = df %>%
    arrange(date) %>%
    group_by(year(date), month(date), day(date)) %>%
  df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`)
  p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i])

## = WordClouds = ##
comments_text = lapply(comments,function(x){
comments_text = tibble(text = Reduce(c, comments_text)) %>%
  mutate(text = stri_trans_general(tolower(text), "Latin-ASCII"))
remove = c("you","the","que","and","your","muito","this","that","are","for","cara",
words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>%
  group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>%
  filter(n > 10 & word %in% remove == FALSE) 

wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE,
          rot.per = 0.3, colors = 1:nrow(words))

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)