Site icon R-bloggers

Seinfeld Characters – A Post About Nothing

[This article was first published on Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is dedicated to my mother – Seinfeld’s greatest fan. Seinfeld is a classic TV sitcom. It featured four main characters surrounded by relatively normal, everyday, run-of-the-mill scenarios. In the spirit of Seinfeld, this post will also “be about nothing.”

I used Python to create a web scraper which gathered scripts from various sites on the internet and input them into a local MySQL database.

Data From Local MySQL

Scraping the raw lines without parsing looked simple enough. Parsing before inserting into MySQL created some difficulties so it made sense to parse once loaded into R. A sample of the raw data:

I modified the raw data in order to strip out the character name and line spoken. I also removed lines to clean up the data a bit.

I used a shift function to create a new column which is the same as the character column but moved up by one row. This should help to show the conversation between two people. Inherently, this will be flawed because the beginning and ends of scenes will run together. I made the assumption that it wouldn’t impact the results since the instances would likely be evenly distributed across characters.

I created a list of characters with the most lines recorded.

Observations

I created a list of two characters speaking to each other. This is directional data (so Jerry speaking to George is separate from George speaking to Jerry).

Observations

I used a shift function once again to see how the conversation flows two lines after. This will give a hint as to whether the conversation is between two characters or more. Again, this is directional data.

I combined three lines in a row created a nice view of groups which speak in order.

Observations

The igraph library allows for the visualization of how two vectors are related to each other. Nodes represent the characters, the edges resemble lines (relationships) between the two.

Observations

Relationship of Top Seinfeld Characters

Conclusion

Ultimately, the “show about nothing” didn’t contain many surprises. Further analysis of the seasons could perhaps show some additional insights. Sentiment analysis would be useful in determining the “tone” of episodes and characters. Decision trees based off of lines or bi-grams, could perhaps predict which character is speaking. Maybe there will be more to come…Code used in this post is on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: Stoltzman Consulting Data Analytics Blog - Stoltzman Consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.