Gender diversity in the TV series industry

Posted on August 10, 2018 by Caroline Barret in R bloggers | 0 Comments

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post, I studied gender diversity in the film industry, I did this by focusing on some key behind-the-camera roles and measuring the evolution of the gender diversity in the last decade. The conclusion was not great: women are under-represented, especially in the most important roles of directors and writers, as these key roles determine the way women are portrayed in front of the camera.

I was curious about the TV series industry too: as the TV series industry is faster paced than the movie industry, might they might be more open to women? I decided to have a look.

In this post, as in the film industry post, the behind-the-camera roles I studied were: directors, writers, producers, sound teams, music teams, art teams, makeup teams and costume teams.

The whole code to reproduce the following results is available on GitHub.

Data Frame Creation – Web Scraping

All the data I used was gathered from the IMDb website: I went through the 100 Most Popular TV Shows (according to the IMDb ratings), and gathered some useful information about these 100 series: I built a data frame which contains the titles of these series, their years of release and their IMDb episode links – the link where we can find all the episodes of a series.

# IMDb 100 most popular TV shows ------------------------------

url <- "https://www.imdb.com/chart/tvmeter?sort=us,desc&mode=simple&page=1"
page <- read_html(url)

serie_nodes <- html_nodes(page, '.titleColumn') %>%
  as_list()

# Series details
serie_name <- c()
serie_link <- c()
serie_year <- c()
for (i in seq_along(serie_nodes)){
  serie_name <- c(serie_name, serie_nodes[[i]]$a[[1]])
  serie_link <- c(serie_link, attr(serie_nodes[[i]]$a, "href"))
  serie_year <- c(serie_year, serie_nodes[[i]]$span[[1]])
}
serie_link <- paste0("http://www.imdb.com",serie_link)
serie_year <- gsub("[()]", "", serie_year)
serie_episodelist <- sapply(strsplit(serie_link, split='?', fixed=TRUE),
                            function(x) (x[1])) %>%
  paste0("episodes?ref_=tt_eps_yr_mr")


# Create dataframe ----------------------------------------------
top_series <- data.frame(serie_name, serie_year, serie_episodelist, stringsAsFactors = FALSE)


# series_year was the date of 1st release but we needed the years of release for all the episodes
# I did not manage to gather this information by doing some web scraping.
# I added it manually as it is available on the IMDb episodes links (column serie_episodelist)
top_series[20:30, ]
##                        serie_name       serie_year
## 20                         Legion             2017
## 21 A Series of Unfortunate Events       2017, 2018
## 22                       Timeless 2016, 2017, 2018
## 23                      Westworld       2016, 2018
## 24                      Luke Cage             2016
## 25                       MacGyver 2016, 2017, 2018
## 26                  Lethal Weapon 2016, 2017, 2018
## 27            Designated Survivor 2016, 2017, 2018
## 28                           Bull 2016, 2017, 2018
## 29                     This Is Us 2016, 2017, 2018
## 30                        Atlanta       2016, 2018
##                                                 serie_episodelist
## 20 http://www.imdb.com/title/tt5114356/episodes?ref_=tt_eps_yr_mr
## 21 http://www.imdb.com/title/tt4834206/episodes?ref_=tt_eps_yr_mr
## 22 http://www.imdb.com/title/tt5511582/episodes?ref_=tt_eps_yr_mr
## 23 http://www.imdb.com/title/tt0475784/episodes?ref_=tt_eps_yr_mr
## 24 http://www.imdb.com/title/tt3322314/episodes?ref_=tt_eps_yr_mr
## 25 http://www.imdb.com/title/tt1399045/episodes?ref_=tt_eps_yr_mr
## 26 http://www.imdb.com/title/tt5164196/episodes?ref_=tt_eps_yr_mr
## 27 http://www.imdb.com/title/tt5296406/episodes?ref_=tt_eps_yr_mr
## 28 http://www.imdb.com/title/tt5827228/episodes?ref_=tt_eps_yr_mr
## 29 http://www.imdb.com/title/tt5555260/episodes?ref_=tt_eps_yr_mr
## 30 http://www.imdb.com/title/tt4288182/episodes?ref_=tt_eps_yr_mr

The series_year column often contains several years. For example, for the series called “This is us”, it means that episodes have been released in 2016, 2017 and 2018. This column will allow me to split the episodes by year of release, and then visualise the gender diversity of the crew for each year.

List Creation – Web Scraping

At this stage, I just had some global information on the 100 series. The next step was to go through the IMDb links gathered in the column series_episodelist of my top_series data frame, which gives me access to all the series episodes split by year of release. I did some web scraping on these links and built a list which gathered:

the names of the 100 most popular TV shows
for each series, the different years of release
for each year, the names of the episodes which have been released
for each episode, the names of the people whose job was included in one of the categories I listed above (directors, writers, …, costume teams)

### Create series list

series_list <- list()

# FOCUS ON EACH SERIES -----------------------------------------------------------------
for (r in seq_len(nrow(top_series))) { 
  
  serie_name <- top_series[r, "serie_name"]
  print(serie_name)
  
  # Years of release for each serie
  list_serieyear <- as.list(strsplit(top_series[r, "serie_year"], split = ", ")[[1]]) 
  # List of IMDb links where we find all the episodes per year of release
  link_episodelist_peryear <- list() 
  
  episodes_list_peryear <- list()
  
  # FOCUS ON EACH YEAR OF REALEASE FOR THIS SERIE -------------------------------------
  for (u in seq_along(list_serieyear)){ 

    year <- list_serieyear[[u]]
    print(year)
    
    link_episodelist_yeari <- strsplit(top_series[r, "serie_episodelist"], split='?', fixed=TRUE)[[1]][1] %>%
      paste0("?year=", year, collapse = "")
    link_episodelist_peryear[[u]] <- link_episodelist_yeari
    
    # FOCUS ON EACH EPISODE FOR THIS YEAR OF RELEASE ----------------------------------
    for (l in seq_along(link_episodelist_peryear)){ 
      
      page <- read_html(link_episodelist_peryear[[l]]) 
      episodes_nodes <- html_nodes(page, '.info') %>%
        as_list()
      
      episode_name <- c()
      episode_link <- c()
      
      for (t in seq_along(episodes_nodes)){
        episode_name <- c(episode_name, episodes_nodes[[t]]$strong$a[[1]])
        episode_link <- c(episode_link, attr(episodes_nodes[[t]]$strong$a, "href"))
      }
      
      episode_link <- paste0("http://www.imdb.com",episode_link)
      episode_link <- sapply(strsplit(episode_link, split='?', fixed=TRUE), 
                             function(x) (x[1])) %>%
        paste0("fullcredits?ref_=tt_ql_1")
      
      episode_name <- sapply(episode_name, 
                             function(x) (gsub(pattern = "\\#", replacement = "", x)))  %>% # some names = "Episode #1.1"
        as.character()
      
      # GATHER THE NAME OF THE EPISODE, ITS YEAR OF RELEASE AND ITS FULL CREW LINK ----
      episodes_details_peryear <- data.frame(year = year,
                                             episode_name = episode_name,
                                             episode_link = episode_link,
                                             stringsAsFactors = FALSE)
    }
    
    # FOCUS ON EACH FULL CREW LINK ----------------------------------------------------
    for (e in seq_len(nrow(episodes_details_peryear))){
      
      print(episodes_details_peryear[e, "episode_link"])
      
      episode_page <- read_html(episodes_details_peryear[e, "episode_link"])
      episode_name <- episodes_details_peryear[e, "episode_name"]
      
      # GATHER ALL THE CREW NAMES FOR THIS EPISODE -------------------------------------
      episode_allcrew <- html_nodes(episode_page, '.name , .dataHeaderWithBorder') %>%
        html_text()
      episode_allcrew <- gsub("[\n]", "", episode_allcrew) %>%
        trimws() #Remove white spaces 
      
      # SPLIT ALL THE CREW NAMES BY CATEGORY -------------------------------------------
      episode_categories <- html_nodes(episode_page, '.dataHeaderWithBorder') %>%
        html_text()
      episode_categories <- gsub("[\n]", "", episode_categories) %>%
        trimws() #Remove white spaces
      
      ## MUSIC DEPT -----------------------------------------------------------------------
      episode_music <- c()
      for (i in 1:(length(episode_allcrew)-1)){
        if (grepl("Music by", episode_allcrew[i])){
          j <- 1
          while (! grepl(episode_allcrew[i], episode_categories[j])){
            j <- j+1
          }
          k <- i+1
          while (! grepl(episode_categories[j+1], episode_allcrew[k])){
            episode_music <- c(episode_music, episode_allcrew[k])
            k <- k+1
          }
        }
      }
      for (i in 1:(length(episode_allcrew)-1)){
        if (grepl("Music Department", episode_allcrew[i])){
          # Sometimes music dept is last category
          if (grepl ("Music Department", episode_categories[length(episode_categories)])){ 
            first <- i+1
            for (p in first:length(episode_allcrew)) {
              episode_music <- c(episode_music, episode_allcrew[p])
            }
          } else {
            j <- 1
            while (! grepl(episode_allcrew[i], episode_categories[j])){
              j <- j+1
            }
            k <- i+1
            while (! grepl(episode_categories[j+1], episode_allcrew[k])){
              episode_music <- c(episode_music, episode_allcrew[k])
              k <- k+1
            }
          }
        }
      }
      if (length(episode_music) == 0){
        episode_music <- c("")
      }
      
      ## IDEM FOR OTHER CATEGORIES ----------------------------------------------------------
      
      ## EPISODE_INFO CONTAINS THE EPISODE CREW NAMES ORDERED BY CATEGORY -------------------
      episode_info <- list()
      episode_info$directors <- episode_directors
      episode_info$writers <- episode_writers
      episode_info$producers <- episode_producers
      episode_info$sound <- episode_sound
      episode_info$music <- episode_music
      episode_info$art <- episode_art
      episode_info$makeup <- episode_makeup
      episode_info$costume <- episode_costume
      
      ## EPISODES_LIST_PER_YEAR GATHERS THE INFORMATION FOR EVERY EPISODE OF THE SERIE-------
      ## SPLIT BY YEAR OF RELEASE --------------------------------------------------------
      episodes_list_peryear[[year]][[episode_name]] <- episode_info
    }
    
    ## SERIES_LIST GATHERS THE INFORMATION FOR EVERY YEAR AND EVERY SERIE -------------------
    series_list[[serie_name]] <- episodes_list_peryear
  } 
}

Let’s have a look at the information gathered in series_list. Here are some of the names I collected:

## - Black Mirror, 2011
##  Episode: The National Anthem 
##  Director: Otto Bathurst
## - Black Mirror, 2017
##  Episode: Black Museum 
##  Director: Colm McCarthy
## - Game of Thrones, 2011
##  Episode: Winter Is Coming 
##  Music team: Ramin Djawadi, Evyen Klean, David Klotz, Robin Whittaker, Michael K. Bauer, Brandon Campbell, Stephen Coleman, Janet Lopez, Julie Pearce, Joe Rubel, Bobby Tahouri
## - Game of Thrones, 2017
##  Episode: Dragonstone 
##  Music team: Ramin Djawadi, Omer Benyamin, Evyen Klean, David Klotz, William Marriott, Douglas Parker, Stephen Coleman

What we can see is that for the same series the crew changes depending on the episode we consider.

Gender Determination

Now that I had all the names gathered in the series_list, I needed to determine the genders. I used the same package as in my previous post on the film industry: GenderizeR, which “uses genderize.io API to predict gender from first names”. More details on this package and the reasons why I decided to use it are available in my previous post.

With this R package, I was able to determine for each episode the number of males and females in each category of jobs:

the number of male directors,
the number of female directors,
the number of male producers,
the number of female producers,
the number of males in costume team,
the number of females in costume team.

Here is the code I wrote:

### Genderize our lists of names

# for each serie
for (s in seq_along(series_list) ){  
  print(names(series_list[s])) # print serie name
  
  # for each year
  for (y in seq_along(series_list[[s]])){ 
    print(names(series_list[[s]][y])) # print serie year
    
    # for each episode
    for (i in seq_along(series_list[[s]][[y]])){ 
      print(names(series_list[[s]][[y]][i])) # print serie episode
      
      # Genderize directors -----------------------------------------------------
      directors <- series_list[[s]][[y]][[i]]$directors
      
      if (directors == ""){
        directors_gender <- list()
        directors_gender$male <- 0
        directors_gender$female <- 0
        series_list[[s]][[y]][[i]]$directors_gender <- directors_gender
      }
      
      else{
        # Split the firstnames and the lastnames
        # Keep the firstnames
        directors <- strsplit(directors, " ")
        l <- c()
        for (j in seq_along(directors)){
          l <- c(l, directors[[j]][1])
        }
        
        directors <- l
        serie_directors_male <- 0
        serie_directors_female <- 0
        
        # Genderize every firstname and count the number of males and females 
        for (p in seq_along(directors)){
          directors_gender <- genderizeAPI(x = directors[p], apikey = "233b284134ae754d9fc56717fec4164e")
          gender <- directors_gender$response$gender
          if (length(gender)>0 && gender == "male"){
            serie_directors_male <- serie_directors_male + 1
          }
          if (length(gender)>0 && gender == "female"){
            serie_directors_female <- serie_directors_female + 1
          }
        }
        
        # Put the number of males and females in series_list
        directors_gender <- list()
        directors_gender$male <- serie_directors_male
        directors_gender$female <- serie_directors_female
        series_list[[s]][[y]][[i]]$directors_gender <- directors_gender
      }  
      
      # Same code for the 7 other categories -----------------------------------
      
      } 
    }
  }
}

Here are some examples of numbers of male and female I collected:

## Black Mirror, 2011
##  Episode: The National Anthem 
##  Number of male directors: 1 
##  Number of female directors: 0 
## 
## Black Mirror, 2017
##  Episode: Black Museum 
##  Number of male directors: 1 
##  Number of female directors: 0 
## 
## Game of Thrones, 2011
##  Episode: Winter Is Coming 
##  Number of male in music team: 8 
##  Number of female in music team: 3 
## 
## Game of Thrones, 2017
##  Episode: Dragonstone 
##  Number of male in music team: 7 
##  Number of female in music team: 0 
##

Percentages Calculation

With these numbers gathered in my list, I then calculated the percentages of women in each job category, for each year between 2007 and 2018. I gathered these figures in a data frame called percentages:

##    year directors  writers producers     sound    music      art   makeup
## 1  2018  22.69693 25.06514  27.87217 12.247212 23.25581 36.93275 73.10795
## 2  2017  20.51948 28.20016  27.28932 10.864631 25.46912 29.90641 71.41831
## 3  2016  17.13456 24.51189  27.93240 11.553444 25.03117 30.98003 71.74965
## 4  2015  16.14764 19.42845  26.43828 11.214310 22.16505 29.83354 69.50787
## 5  2014  18.38624 20.88644  27.59163 10.406150 22.21016 30.11341 69.97544
## 6  2013  14.94413 19.60432  28.15726 10.504896 23.29693 29.01968 69.01683
## 7  2012  15.60694 19.82235  29.66566 10.685681 21.45378 26.74160 67.47677
## 8  2011  13.95349 17.60722  26.73747 11.296882 17.11185 25.61805 64.81795
## 9  2010  15.95745 17.05882  27.38841 11.264644 16.51376 24.14815 65.33004
## 10 2009  16.49123 18.90496  28.79557  8.498350 21.72285 26.11128 68.15961
## 11 2008  17.87440 16.62088  29.05844  7.594264 18.74405 23.46251 68.39827
## 12 2007  21.15385 21.78771  30.12798  9.090909 19.23077 21.66124 63.03502
##     costume
## 1  77.24853
## 2  81.34648
## 3  79.35358
## 4  76.48649
## 5  76.62972
## 6  74.74791
## 7  77.35247
## 8  77.46315
## 9  77.67380
## 10 79.56332
## 11 80.53191
## 12 79.24720

Gender Diversity in 2017: TV Series Industry VS Film Industry

Based on this data frame, I created some bar plots to visualise the gender diversity of each job category for each year. Here is the code I wrote to create the bar plot for 2017, which compares the TV series industry to the film industry.

### Barplot 2017

# Data manipulation -------------------------------------------------------------

# Import our movies dataset
percentages_movies <- read.csv("percentages_movies.csv") 
percentages_movies <- percentages_movies[ , -1]

# Change column names for movie and serie dataframes
colnames(percentages_movies) <- c("year", "directors", "writers", "producers", "sound", "music", "art", "makeup", "costume")
colnames(percentages) <- c("year", "directors", "writers", "producers", "sound", "music", "art", "makeup", "costume")

# From wide to long dataframes
percentages_movies_long <- percentages_movies %>%
  gather(key = category, value = percentage, -year)
percentages_long <- percentages %>%
  gather(key = category, value = percentage, -year)

# Add a column to these dataframes: movie or film ?
percentages_movies_long$industry <- rep("Film industry", 88)
percentages_long$industry <- rep("Series industry", 96)

# Combine these 2 long dataframes
percentages_movies_series <- bind_rows(percentages_long, percentages_movies_long)

# Filter with year=2017
percentages_movies_series_2017 <- percentages_movies_series %>%
  filter(year == 2017)


# Data visualisation -------------------------------------------------------------

percentages_movies_series_2017$percentage <- as.numeric(format(percentages_movies_series_2017$percentage, 
                                                        digits = 2))

bar_2017 <- ggplot(percentages_movies_series_2017, aes(x = category,
                                                       y = percentage,
                                                       group = category,
                                                       fill = category)) +
  geom_bar(stat = "identity") +
  facet_wrap(~industry) +
  coord_flip() + # Horizontal bar plot
  geom_text(aes(label = percentage), hjust=-0.1, size=3) +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(hjust = 0.5), # center the title
        legend.title=element_blank()) +  
  labs(title = paste("Percentages of women in 2017"),
       x = "",
       y = "Percentages") +
  guides(fill = guide_legend(reverse=TRUE)) + # reverse the order of the legend
  scale_fill_manual(values = brewer.pal(8, "Spectral")) # palette used to fill the bars and legend boxs

I have built a simple shiny app which gives access to the bar plots for each year between 2007 and 2017.

Let’s analyse the graph of the year 2017. If we only focus on the TV series figures, we see that sound teams show the lowest female occupation, with less than 11%. It is followed by the role of director with 20.5%. Then, we can see that between 25% and 30% of the roles of writers, producers, music teams and art teams are taken by women. Thus, women are still under-represented in the TV series industry. However, even if series figures show little gender diversity in the above job categories, they are better than the film industry ones, especially for the key roles of directors, writors and producers, which are respectively 5.7, 3 and 1.2 times higher for the series industry than for the film industry. The last thing to notice is that as in the film industry, the series industry graph shows a representativeness gap between the above roles and the jobs of make-up artists and costume designers, among which more than 70% of the roles are taken by women.

Evolution of the Gender Diversity: TV Series Industry VS Film Industry

Let’s have a look at the evolution of the gender diversity in these two industries in the last decade.

### Evolution plot

# year as date
percentages_movies_series_ymd <- percentages_movies_series %>%
  subset(year != 2018)
percentages_movies_series_ymd$year <- ymd(percentages_movies_series_ymd$year, truncated = 2L) 

# Data visualisation
evolution <- ggplot(percentages_movies_series_ymd, aes(x = year,
                                                       y = percentage,
                                                       group = category,
                                                       colour = category)) +
  geom_line(size = 2) +
  facet_wrap(~industry) +
  theme(panel.grid.minor.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  scale_x_date(date_breaks = "2 year", date_labels = "%Y") +
  scale_color_manual(values = brewer.pal(8, "Set1")) +
  labs(title = "Percentages of women from 2007 to 2017\n Film industry VS serie industry",
       x = "",
       y = "Percentages")

The first thing I noticed is that for both the film and series industries, the representation gap between the roles of make-up artists and costume designers and the other ones had not decreased since 2007.

The fact that the roles of directors, writers and producers are more open to women in the TV series industry than in the film one is easy to visualise with this graph, and we can see that it has been the case at least since 2007 (and probably before). Besides, since 2007 the series industry has been more diversified in terms of gender for all the categories I studied, except for the sound roles.

I also noticed that since 2010/2011, in the TV series industry, almost all the categories tend to be more diversified in terms of gender. The only exceptions are the roles of producers (percentages are generally decreasing slightly since 2007), sound teams (no improvement has been achieved since 2010) and costume teams (the trend has been positive only since 2013). Apart from that, there is a positive trend for the TV series industry, which is not the case for the film industry.

This trend is significant for some roles: writers, music teams, art teams and make-up teams percentages in the series industry have increased by 5 to 10% in the last decade. But if we look at the role of directors, the percentage of women has also increased by 5% since 2011, but the percentage reached in 2017 is essentially the same as the one reached in 2007, just as for the film industry. Let’s hope that the trend seen since 2011 for directors will continue.

Conclusion

This study has definitely shown that the TV series industry is more diversified in terms of gender than the film industry, especially for the key roles of directors and writers.

However even if the series percentages are better than the film ones, women are still under-represented in the TV series industry as the same regrettable analysis has been echoed: the only jobs which seem open to women are the stereotyped female jobs of make-up artists and costume designers. In all the other categories, the percentages of women in the series industry never reach more than 30%.

But contrary to the film industry, the TV series one is actually evolving in the right direction: since 2011, a positive trend has been happening for directors and writers. This evolution is encouraging for the future and suggests that powerful female characters, such as Daenerys Targaryen from Game of Thrones, are coming on TV screens.

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Gender diversity in the TV series industry

Data Frame Creation – Web Scraping

List Creation – Web Scraping

Gender Determination

Percentages Calculation

Gender Diversity in 2017: TV Series Industry VS Film Industry

Evolution of the Gender Diversity: TV Series Industry VS Film Industry

Conclusion

Related

Data Frame Creation – Web Scraping

List Creation – Web Scraping

Gender Determination

Percentages Calculation

Gender Diversity in 2017: TV Series Industry VS Film Industry

Evolution of the Gender Diversity: TV Series Industry VS Film Industry

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)