Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….

[This article was first published on Longhow Lam's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. I must confess, I was watching (had to watch) this years ago because my wife was watching it…… My gut feeling with these daily soap series is that missing a few months or even years does not matter. Once you’ve seen some GTST episodes you’ve seen them all, the story line is always very similar. Can we use some data science to test if this gut feeling makes sense? I am using R and SAS to investigate this and see if more interesting soap analytics can be derived.

Scraping episode plot summaries

First, data is needed. All GTST episode plot summaries are available, from the very first episode in October 1990.blogpic01

A great R package to scrape data from web sites is rvest by Hadley Wickham, I can advise anyone to learn this. Luckily, the website structure of the GTST plot summaries is not that difficult. With the following R code I was able to extract all episodes. First I wrote a function that extracts the plot summaries of one specific month.

getGTSTdata = function(httploc){
  tryCatch({
    gtst = html(httploc) %>%
     html_nodes(".mainarticle_body") %>%
     html_children()

    # The dates are in bold, the plot summaries are normal text
    texts = names(gtst[[1]]) == "text"
    datumsel = names(gtst[[1]]) == "b"

    episodeplot = gtst[[1]][texts] %>%
     sapply(html_text) %>%
     str_replace_all(pattern='n'," ")

    episodedate = gtst[[1]][datumsel] %>%
     sapply(html_text)

    # put data in a data frame and return as results
    return(data.frame(episodeplot, episodedate))
  },
    error = function(cond) {
      message(paste("URL does not seem to exist:", httploc))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NULL)
    }
  )
}

This function is then used inside a loop over all months to get all the episode summaries. Some months do not have episodes, and the corresponding link does not exist (actors are on summer holiday!). So I used the tryCatch function inside my function which will continue the loop if a link does not exist.

months = c("januari", "februari","maart","april","mei","juni","juli","augustus","september","oktober","november","december")
years = 1990:2012

GTST_Allplots = data.frame()

for (j in years){
  for(m in months){
    httploc = paste("http://www.rtl.nl/soaps/gtst/meerdijk/tvgids/", m, j, ".xml", sep="")
    out = getGTSTdata(httploc)

    if (!is.null(out)){
      out$datums = paste(out$datums, j)
      GTST_Allplots = rbind(GTST_Allplots,out)
    }
  }
}

A a result of the scraping, I got 4090 episode plot summaries. The real GTST fans know that there are more episodes, the more recent episode summaries (from 2013) are on a different site. For this analysis I did not bother to use them.

Text mining the plot summaries

The episode summaries are now imported in SAS Enterprise Guide, the following figure shows the first few rows of the data.

blogpic02

Click to enlarge

With SAS Text miner it is very easy and fast to analyze unstructured data, see my earlier blog post. What are the most important topics that can be found in all the episodes? Let’s generate them, I can use either the text topic node or the text cluster node in SAS. The difference is that with the topic node an episode can belong to more than one topic, while with the cluster node an episode belongs only to one cluster.

blogpic03I have selected 30 topics to be generated, you can experiment with different numbers of topics. The resulting topics can be found in the following figure.

blogpic04

GTST topics. click to enlarge

Looking at the different clusters found, it turns out that many topics are described by the names of the characters in the soap series. For example topic number 2 is described by terms like, “Anton“, “Bianca“, “Lucas“, ‘Maxime“, and “Sjoerd“, they occur often together in 418 episodes. And topic number 25 involves the terms “Ludo“, “Isabelle“, “Martine“, “Janine“. Here is a picture collage to show this in a more visual appealing way. I have only attached faces for six clusters, the other clusters are still the colored squares. The clusters that you see in the graph are based on the underlying distances between the clusters.

GTST_COLLAGE

click to enlarge

Zoom in on a topic

Now let’s say I am interested in the characters of topic 25 (the Ludo Isabelle story line). Can I discover sub-topics in this story line? So, I apply a filter on topic 25, only the episodes that belong to the Ludo Isabelle story line are selected and I generate a new set of topics (call them sub topics to distinguish them form the original topics) .

blogpic05

Text miner flow to investigate a specific topic

What are some of the subtopics of the Ludo Isabelle story line?

  • Getting money back
  • George, Annie Big panic, very worried
  • Writing farewell letter
  • Plan of Jack brings people in danger

As a data scientist I should now reach out or go back to the business and ask for validation. So I am reaching out now: Are there any real hardcore GTST fans out there that can explain:

  • why certain characters are grouped together?
  • why certain groups of characters are closer to each other than other groups?
  • recognize subtopics in Ludo Isabelle story lines?

Text profiling

To comeback to my gut feeling, can I see if the different story lines remain similar? I can use a the Text profile node in SAS Text miner to investigate this. The Text Profile node enables you to profile a target variable using terms found in the documents. For each level of a target variable, the node outputs a list of terms from the collection that characterize or describe that level. The approach uses a hierarchical Bayesian belief model to predict which terms are the most likely to describe the level. In order to avoid merely selecting the most common terms, prior probabilities are used to down-weight terms that are common in more than one level of the target variable.

The target variable can also be a date variable. In my scraped episode data I have the date of the episodes, there is a nice feature that allows the user to set the aggregation level. Let’s look at years as aggregation level.

blogpic06

Text profile node in SAS

The output consists of different tables and plots, two interesting plots are the Term Time series plot and the Target Similarity plot. The first one shows for a selected year the most important terms and how these terms evolve over the other years. Suppose we select 1991 then we get the following graph.

blogpic07

Click to enlarge. Terms over years

Myriam was an important term (character) in 1991, but we see that here role stopped in 1995. Jef was a slightly less important term, but his character continued for a very long time in the series. The similarity plot shows the similarities between the sets of episodes by the different years. The distances are calculated on the term beliefs of the chosen terms.

blogpic08

GTST similarities by year

We can see very strong similarities between years 2000 and 2001, two years that are also very similar are 1996 and 1997. Very isolated years are 2009 and 2012, maybe the writers tried a new story line that was unsuccessful and canceled it. Now let’s focus on one season of GTST, season 22 from September 2011 to May 2012, and look at similarities between months. They are shown in the following figure.

blogpic09

Season 22 of GTST: monthly similarities

It seems that the months September ’11, November ’11 and May ’12 are very similar but the rest of the months are quite separate.

Conclusion

The combination R / rvest and SAS Text miner is an ideal tool combination to quickly scrape (text) data from sites and easily get first insights into those texts. Applying this on Goede Tijden Slechte Tijden (GTST) episode summaries rejects my gut feeling that once you’ve seen some GTST episodes you’ve seen them all. It turns out that story lines between years, and story lines within one year can be quite dissimilar!

Without watching thousands of episodes I can quickly get an idea of which characters belong together, what the main topics are of a season, the rise and fall of characters in the series and how topics are changing over time. But keep in mind: If you are in a train and hear two nice lady’s talking about GTST, will this analysis be enough to have meaningful conversations with them?  …… I don’t think so!

Thanks @ErwinHuizenga and @josvandongen for reviewing.


To leave a comment for the author, please follow the link and comment on their blog: Longhow Lam's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)