## New Season of Bojack Horseman – NO SPOILERS!

A new season of my favorite show, Bojack Horseman, just dropped on
Netflix, and I have absolutely zero time to sit down and watch it.
However, in some magical way there was time for me to write this blog
post and do this analysis. What analysis you ask? I’ll tell you in a
minute, I ain’t Horsin’ Around.

Bojack Horseman, in case you got stuck in 2013, is an animated
comedy-drama about a washed up ’90s sitcom star, who is also a horse
(animals are anthropomorphized in the show’s universe). If you follow
Bojack, you know that it isn’t your everyday animated series. It’s deep,
cynical and lights up some of the darkest corners of damaged human
behavior. Moreover, it keeps getting better, or should I say – heavier.

The psychologist in me got triggered, if indeed Bojack is getting deeper
and deeper, perhaps this pattern could be observed in the show’s script?
I mean, would the show runners use more abstract language compared to
concrete language as the show progresses? The data scientist in me
said “I’m on it”.

## Scrape Bojack Script and Language Analysis

First, I had to scrape Bojack’s script. I’ll do it the rvest, xml2 and stringer packages. It’s actually my first doing web scarping so apologies for the non-elegant code

library(xml2)
library(rvest)
library(stringr)

#data frame pre-allocation
m <- matrix(nrow = 5*10, ncol = 3)
colnames(m) <- c("season", "episode", "text")
m <- as.data.frame(m)

count = 1
for (i in 1:5){
for (j in 1:10){
if (j<10){
scrappedurl <-paste0("https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s0",
i, "e0", j)
n <- html_nodes(html.raw, "div#content_container")
txt <- html_text(n)
txt <- str_replace_all(txt, "[rnt]" , "")
t <- str_split(txt, "Episode Script", simplify = T)[3]
m$season[count] <- i m$episode[count] <- j
m$text[count] <- t } else{ scrappedurl <-paste0("https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s0", i, "e", j) html.raw <- read_html(scrappedurl) n <- html_nodes(html.raw, "div#content_container") txt <- html_text(n) txt <- str_replace_all(txt, "[rnt]" , "") t <- str_split(txt, "Episode Script", simplify = T)[3] m$season[count] <- i
m$episode[count] <- j m$text[count] <- t
}
count <-  count+1
}
}

library(kableExtra)
#let's see the dataframe
kable(m) %>%
kable_styling() %>%
scroll_box(width = "100%", height = "500px")
