TV Shows on the “Big 3” Streaming Services

Posted on August 10, 2020 by Unknown in R bloggers | 0 Comments

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

2020 has been a tough year, and I’ve been doing my best to keep busy (and distracted from all the insanity – both at the personal and worldwide levels). Earlier this year, I took a course in machine learning techniques and have been working on applying those techniques to work datasets, as well as fun sets through Kaggle.com.

Today, I thought I’d share another dataset I discovered through Kaggle: TV shows available on one or more streaming service (Netflix, Hulu, Prime, and Disney+). There are lots of fun things we could do with this dataset. Let’s start with some basic visualization and summarization.

setwd("~/Dropbox")

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Shows <- read_csv("tv_shows.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Title = col_character(),
##   Year = col_double(),
##   Age = col_character(),
##   IMDb = col_double(),
##   `Rotten Tomatoes` = col_character(),
##   Netflix = col_double(),
##   Hulu = col_double(),
##   `Prime Video` = col_double(),
##   `Disney+` = col_double(),
##   type = col_double()
## )

First, we can do some basic summaries, such as how many shows in the dataset are on each of the streaming services.

Counts <- Shows %>%
  summarise(Netflix = sum(Netflix),
            Hulu = sum(Hulu),
            Prime = sum(`Prime Video`),
            Disney = sum(`Disney+`)) %>%
  pivot_longer(cols = Netflix:Disney,
               names_to = "Service",
               values_to = "Count")

Counts %>%
  ggplot(aes(Service,Count)) +
  geom_col()

The biggest selling point of Disney+ is to watch their movies, though the few TV shows they offer can't really be viewed elsewhere (e.g., The Mandalorian). For the sake of simplicity, we'll drop Disney+, and focus on the big 3 services for TV shows.

The dataset also contains an indicator of recommended age, which we can plot.

Shows <- Shows %>%
  mutate(Age = factor(Age,
                      labels = c("all",
                                 "7+",
                                 "13+",
                                 "16+",
                                 "18+"),
                      ordered = TRUE))

Shows %>%
  ggplot(aes(Age)) +
  geom_bar()

Shows %>% group_by(Age) %>% summarise(Count = n(), Year_min = min(Year), Year_max = max(Year), Prime = sum(`Prime Video`)/2144, Netflix = sum(Netflix)/1931, Hulu = sum(Hulu)/1754) ## Warning: Factor `Age` contains implicit NA, consider using ## `forcats::fct_explicit_na` ## # A tibble: 6 x 7 ## Age Count Year_min Year_max Prime Netflix Hulu ## <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 all 4 1995 2003 0.000466 0.00155 0 ## 2 7+ 1018 1955 2020 0.0975 0.206 0.293 ## 3 13+ 750 1980 2020 0.0849 0.186 0.136 ## 4 16+ 848 1943 2020 0.104 0.155 0.208 ## 5 18+ 545 1932 2020 0.0896 0.0886 0.0906 ## 6 <NA> 2446 1901 2020 0.623 0.363 0.272

YearOutliers <- Shows %>% filter(Year < 1940) list(YearOutliers$Title) ## [[1]] ## [1] "Born To Explore" "The Three Stooges" ## [3] "The Little Rascals Classics" "Space: The New Frontier" ## [5] "Gods & Monsters with Tony Robinson" "History of Westinghouse" ## [7] "Betty Boop"

Netflix <- Shows %>% filter(Netflix == 1) %>% select(IMDb, `Rotten Tomatoes`) %>% mutate(Service = "Netflix") Hulu <- Shows %>% filter(Hulu == 1) %>% select(IMDb, `Rotten Tomatoes`) %>% mutate(Service = "Hulu") Prime <- Shows %>% filter(`Prime Video` == 1) %>% select(IMDb, `Rotten Tomatoes`) %>% mutate(Service = "Prime") BigThree <- rbind(Netflix, Hulu, Prime) BigThree <- BigThree %>% mutate(RotTom = as.numeric(sub("%","",`Rotten Tomatoes`))/100) BigThree %>% ggplot(aes(Service, IMDb)) + geom_boxplot() ## Warning: Removed 1194 rows containing non-finite values (stat_boxplot).

library(scales) ## ## Attaching package: 'scales' ## The following object is masked from 'package:purrr': ## ## discard ## The following object is masked from 'package:readr': ## ## col_factor BigThree %>% ggplot(aes(Service, RotTom)) + geom_boxplot() + scale_y_continuous(labels = percent) ## Warning: Removed 4772 rows containing non-finite values (stat_boxplot).