Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his second class project – Web Scraping.
There are various indicators in disciplines such as economics and politics that measure the state of different aspects of their fields. That is why—when events around the country in the past few years have caused people to question the state of the US and how surprised they are about “who this country is”—I am surprised there is no indicator that can tell us who we are and where we are going socially as a country; there are a collection of indicators that describe the social environment in terms of such things as poverty, obesity and suicide rates, but these largely describe outcomes and consequences rather than preferences and personality.
Spoiler Alert! A full solution to such a complicated task is beyond the scope of this project; a full solution would require multiple scraping projects and continued feedback from professionals in social psychology. I will address this again in the next steps section. Instead, I used this time to take a first step in building a social indicator by scraping and visualizing information about television shows.
While show titles could be found in both, I needed to scrape them off of Wikipedia in order to
- retrieve the Wikipedia URLs for the shows in order to get the network information and
- specify in IMDbPY which shows I wanted information for
Screenshots of two Wikipedia pages I scraped TV show titles and URLs from:
Screenshots of a Wikipedia show page from which I retrieved information:
For fields common to both Wikipedia and IMDb such as genre and start/end date, I still retrieved their information from Wikipedia; Once the scraping was finished, I filled in any missing data by collecting the same information from IMDb along with IMDb rating and number of votes.
A sample of my Wikipedia TV show scraper
View the code on Gist.
Using IMDbPY to get information about TV shows using show titles gathered from Wikipedia as the search term:
View the code on Gist.
Visualization and Analysis
In the app, I have visualizations on
- count of new shows created
- median IMDb rating of new shows
- median number of years shows ran for
- total number of votes on IMDb
This information is displayed for each year from the 1940s until 2016 by genre and by network.
Screenshots of some of the visualizations:
Count of new shows by genre from 1940s to 2016:
Count of new shows by network from 1940s to 2016:
What we can get out of the genre plots is that the networks and show creators believe that audiences want more comedies and reality shows (shows that tend to require less thinking). Dramas have not spiked up as much. While the shows created in these genres have been on a consistent rise, the number of shows created by the major networks has been on a decline since the mid-1980s. I will need to look into this further.
TV show data alone is not enough to answer “who are we as a society?”, especially without viewership data. Some future steps I would take to build upon this project are:
- Scrape more lists of TV shows; it seems that the lists of TV shows I scraped may not have been thorough for 2015 and 2016.
- Obtain numbers on the audience side, such as viewership of shows/genres/networks in order to get a better sense of audience preferences rather than just the creators’ and networks’ predictions of audience preferences
- Compare various data (like genre, viewership) of traditional networks with streaming services such as Netflix along with viewership, as this may give a sense of demographic contributors
- Include movies, music, books, magazines, news, etc to the analysis since one alone will not capture society
- Expand beyond entertainment. Include the trend in degrees and jobs.