Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Background

This analysis came about because the dataset was the topic of the Stockholm R User Group (SRUG) Meetup Group Hackathon 2017 (and took me a while to get back to). The task was simply to do something interesting using the dataset.

Text data can be analysed using a variety of different methods: topic modelling, network analysis, sentiment analysis, word frequencies etc. One less commonly applied approach is that of readability, or in other words, linguistic complexity. The thought was that this might reveal an interesting dimension of the data that might be missed by other approaches. It also made for a nice case for demonstrating how readability scores can be applied.

Readability formulas were developed as early as the first half of the twentieth century, and therefore used to be calculated by hand. ‘True’ readability is dependent on all sorts of factors: the complexity of the ideas expressed, the logical coherence of the text, the words used etc. What readability formulas measure is usually primarily a function of the most easily quantifiable aspects of readability: words per sentence, syllables per word etc. These quantities are then assembled together in a formula which weights the different components appropriately, to arrive at a readability score. There exist many different readability scores, which differ primarily in the degree of weighting they give to one or the other concept (e.g. sentence length vs word length), or to the way that different components of complexity are assessed (e.g. word length vs membership to an easy word list).

As such, readability formulas tend to be rather crude tools for assessing readability. However, while these measures do not perfectly capture the true readability of a text, they can be especially informative when examining relative changes in large sets of texts to examine changes. For example, I and some friends applied readability formulas to scientific abstracts as a hobby project, finding very strong trend indicating that scientific writing has been growing increasingly complex. Another nice example of their application is in an analysis of US State of the Union addreses, showing them becoming more simple over time.

The idea here was to apply readability fomulas to TED talk transcripts, and to examine whether there have been any changes over time, as well as whether the complexity of the language of the talks had any relation to the popularity of the talks.

## The Data Set

The data set is a Kaggle data set available here. The description is as follows:

These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. The TED main dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles. The TED transcripts dataset contains the transcripts for all talks available on TED.com.

## Setup

### Packages

library(tidyverse)
library(koRpus)
library(stringi)
library(viridis)

The readability package I’ll be using is called koRpus. While it has its quirks, and it tends to be a little bit slower than some equivalent tools in Python, it is quite easy to use and showcases a very comprehensive set of tools. First, we need to install the english language, as below. We install it using the commented code below, and then load it up like a usual library.

# install.koRpus.lang("en")
library(koRpus.lang.en)

First, I read the transcripts and the information in, join the two, and throw out everything where there was any missing data.

talks <- read_csv('../../static/data/20190321_ReadabilityTED/ted_main.csv')
## Parsed with column specification:
## cols(
##   description = col_character(),
##   duration = col_double(),
##   event = col_character(),
##   film_date = col_double(),
##   languages = col_double(),
##   main_speaker = col_character(),
##   name = col_character(),
##   num_speaker = col_double(),
##   published_date = col_double(),
##   ratings = col_character(),
##   related_talks = col_character(),
##   speaker_occupation = col_character(),
##   tags = col_character(),
##   title = col_character(),
##   url = col_character(),
##   views = col_double()
## )
## Parsed with column specification:
## cols(
##   transcript = col_character(),
##   url = col_character()
## )
alldat <- full_join(talks, transcripts) %>%
filter(complete.cases(.))
## Joining, by = "url"

It’s always a good idea to have a bit of a look at your data before running functions that take a long time. I sound like I’m being smart here. I was less smart at the time and had to re-do some steps here and there to my great chagrin. One discovers that there are little surprises hidden in the data, such as these gems:

alldat$transcript[1171] ## [1] "(Music)(Applause)(Music)(Applause)(Music)(Applause)(Music)(Applause)" # Shortened this one a bit for brevity str_sub(alldat$transcript[1176], start = 2742, 2816)
## [1] "\"sh\" in Spanish. (Laughter) And I thought that was worth sharing.(Applause)"

So, there seem to be various audience reactions included in the transcripts wrapped in brackets. Let’s make sure to remove these, as well as save them for later in case they come in handy.

## Audience reactions

Here I first removed and saved the bracketed audience actions. I then used stringi to get word counts so that we can remove the cases for which there’s nothing, or not much, left after removing the brackets, for cases like the first example above.

alldat <- alldat %>%
mutate(bracketedthings = str_extract_all(transcript, pattern = "\$$\\w+\$$"),
transcript = str_replace_all(transcript, "\$$\\w+\$$", " "),
word_count = map_dbl(transcript, ~stri_stats_latex(.)[["Words"]]))

Let’s examine these actions a little bit further

# Most Complex
tail(alldat_read$transcript, 1) ## [1] " I went down to St. James InfirmaryTo see my baby thereShe was lying on a long wooden tableSo cold, so still, so fairI went up to see the doctor\"She's very low,\" he saidI went back to see my babyGood God she's lying there deadI went down to old Joe's bar roomOn the corner of the squareThey were serving drinks as per usualAnd the usual crowd was thereTo my left stood Old Joe McKennedyHis eyes were bloodshot redHe turned to the crowd around himAnd these are the words he said\"Let her go, let her go, God bless herWherever she may beShe can search this whole wide world all overBut she'll never find another man like meShe can search this whole wide world all overAnd she'll never find another man like meWhen I die, please God, bury meIn my ten-dollar Stetson hatPut a twenty-dollar gold piece on my watch chainSo my friends know I died standing patGet six gamblers to carry my coffinAnd six choir girls to sing me a songStick a jazz band on my hearse wagonTo raise hell as I go alongNow that's the end of my storyLet's have another round of boozeAnd if anyone should ask youJust tell them I got the St. James Infirmary blues " They’re both songs! This is something that will trip up readability scores: they’re made for full sentences. Songs don’t have the usual sentences, and, at least in the second case, they are considered as one long sentence. Let’s take a look. alldat_read <- alldat_read %>% mutate(textstats = map(text_tt, ~koRpus::describe(.))) %>% mutate(sentences = map_dbl(textstats, 'sentences'), sentencelength = map_dbl(textstats, 'avg.sentc.length')) ggplot(alldat_read, aes(x=sentencelength)) + geom_histogram(fill="grey", colour="black") + labs(title="Average Sentence Lengths") ## stat_bin() using bins = 30. Pick better value with binwidth. ## Warning: Removed 2 rows containing non-finite values (stat_bin). So let’s try to remove the bad cases to the right, choosing a limit of average 40 words per sentence, as well removing those talks including music using the variable we created earlier from the audience reactions, and see how everything looks again readdat <- alldat_read %>% select(-bracketedthings) %>% filter(sentencelength < 40, hasmusic==FALSE) ggplot(readdat, aes(x=FKA)) + geom_histogram(fill="grey", colour="black") + labs(title="Muuuuch better!") ## stat_bin() using bins = 30. Pick better value with binwidth. Now that’s a normal distribution if ever I saw one! This data looks pretty ripe for digging into now! # Exploration In order to look at trends over time, let’s first fix up the dates. The dates are UNIX timestamps, so I’ll first convert these to more normal dates. readdat <- readdat %>% mutate(published_date = as.POSIXct(published_date, origin="1970-01-01"), published_date = as.Date(published_date)) ## Views per day As a crude indicator of interest in each video, I’ll calculate the number of views per day elapsed since the video was published, in order not to be biased by the time that the video has been available in which to be viewed. The dataset describes videos on the TED.com website published before September 21st 2017, with the dataset created on September 25th according to one of the comments on the page. readdat <- readdat %>% mutate(days = as.Date("2017-09-25") - published_date, days = as.numeric(days), viewsperday = views/days) ggplot(readdat, aes(x=viewsperday)) + geom_histogram(fill="grey", colour="black") ## stat_bin() using bins = 30. Pick better value with binwidth. Looks like this might be more useful with a log transformation readdat <- readdat %>% mutate(log_vpd = log10(viewsperday)) ggplot(readdat, aes(x=log_vpd)) + geom_histogram(fill="grey", colour="black") ## stat_bin() using bins = 30. Pick better value with binwidth. That looks much better! Let’s take a look at the trend over time. ggplot(readdat, aes(x=published_date, y=log_vpd)) + geom_point() + labs(y="Log views per day") I suspect that the date that the data was frozen is probably wrong. Or something else is funny here. Maybe we can just use the raw views data instead, but I would then remove those datapoints in the last few months which haven’t been around long enough to go viral. readdat$too_recent <- readdat$published_date > as.Date("2017-04-01") ggplot(readdat, aes(x=published_date, y=log10(views))) + geom_point(aes(colour=too_recent)) That looks ok. And they actually also look to be affected by published date to a surprisingly small extent. Therefore I think it makes sense to just use the views figures, and to cut out the most recent talks to avoid their bias. readdat <- readdat %>% filter(published_date < as.Date("2017-04-01")) ## Changes over Time As a first step, let’s take a look at whether there are any trend in readability over time. corstrength <- cor(readdat$FKA, as.numeric(readdat$published_date)) readability_trend <- ggplot(readdat, aes(x=published_date, y=FKA)) + geom_point(aes(colour=log10(views))) + scale_colour_viridis(option = 'D', 'Log(Views)') + geom_smooth(colour="red") + geom_smooth(method="lm") + labs(title='Readability over Time', subtitle=paste0("Linguistic complexity has increased over time: R = ",round(corstrength, 2)), y='Flesch-Kincaid Readability (Age)', x='Published Date') readability_trend ## geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")' This is actually rather stronger than I anticipated, and seems to be a pretty clear result of talks becoming more complex over time. The linear trend is the same as we saw in the scientific literature. Interestingly, from the smooth model fit to the data, there appears to have been a peak in complexity around the beginning of 2014, which has perhaps tapered off, but I wouldn’t be able to begin to start to speculate about what might have caused that, so it could well be nothing. ## Viewership Let’s also take a look then at how readability relates to views. readability_popularity <- ggplot(readdat, aes(x=FKA, y=log10(views))) + geom_point() + labs(title='Readability Score and Views', subtitle="No strong relation, but notice the gap top right", x='Flesch-Kincaid Readability (Age)', y='Log(Views)') readability_popularity We can see here that there are many talks which are highly readable (left) and with many views, however very few which are complex and also popular. Let’s take a closer look at this. For this, I will divide the readability scores into deciles, and compare the distributions of the views. readdat <- readdat %>% mutate(read_percentile = cut(FKA, 10), read_percentile_num = as.numeric(read_percentile)*10, read_percentile_num_mid = read_percentile_num-5) %>% group_by(read_percentile_num) %>% mutate(meanRead = mean(FKA)) %>% ungroup() readability_quantile <- ggplot(readdat, aes(x=read_percentile_num_mid, y=log10(views), fill=meanRead, group=read_percentile_num)) + geom_boxplot() + scale_fill_viridis('Mean \nReadability \nAge', direction = -1) + labs(title='Readability Percentile and Views', subtitle="Especially simple talks are most popular, \nand especially complex talks have uniformly few views", x='Readability Percentile (Simplest to Hardest)', y='Views (Log10)') readability_quantile ## Topics It would have been nice to separate the data into different topics or sections, but unfortunately that data isn’t quite so clear. What we do have is a set of tags. Let’s maybe take a little look at that data and see whether we might be able to try to see which topics are most complex and which are most simple. readdat$tags[[1]]
## [1] "['collaboration', 'entertainment', 'humor', 'physics']"
mutate(tags = str_match_all(tags, pattern = "\\'(\\w+)\\'"),
tags = map(tags, ~.x[,2]))

Let’s examine these actions a little bit further

tags <- c(unlist(topicdat$tags)) length(unique(tags)) ## [1] 328 That looks like too many - this data will be too sparse. But let’s see what the most common are. head( sort(table(tags), decreasing = T), 30) ## tags ## technology science culture TEDx design ## 613 501 433 358 350 ## business health innovation entertainment society ## 311 203 192 189 166 ## future biology art communication economics ## 164 161 160 158 148 ## brain medicine environment collaboration creativity ## 144 143 140 138 135 ## humanity activism education invention community ## 134 132 132 125 121 ## history children psychology politics women ## 120 112 110 109 108 These all have reasonable numbers of talks. Let’s compare them. Keep in mind that the same talks might belong to multiple categories, so there will be overlap. topics <- names(head( sort(table(tags), decreasing = T), 30)) topic_read <- topicdat %>% select(FKA, tags, views) selectbytopic <- function(topic, data) { filter(data, map_lgl(tags, ~topic %in% .x)) } topic_readgroups <- map(topics, ~selectbytopic(.x, topic_read)) names(topic_readgroups) <- topics topic_readgroups <- bind_rows(topic_readgroups, .id="Tag") %>% select(-tags) %>% group_by(Tag) %>% mutate(meanRead = mean(FKA), meanViews = mean(views)) %>% ungroup() So let’s take a look at the readability of the different topics. Let’s first look by views views_topics <- topic_readgroups %>% arrange(-meanViews) %>% mutate(Tag = fct_inorder(Tag)) %>% ggplot(aes(x=Tag, y=log10(views), fill=meanRead, group=Tag)) + geom_boxplot() + scale_fill_viridis('Mean \nReadability \nAge', direction = -1) + labs(title='Views by Tags', subtitle="Psychology and brain talks get more views, but the tags don't differ greatly otherwise", x='Tag', y='Views (Log10)') + theme(axis.text.x = element_text(angle = 60, hjust = 1)) views_topics Interesting to see such a strong preference for psychology and brain talks. We can also already see that the entertainment tag appears to be associated with more readable transcripts. But let’s take a look at the distributions. readability_topics <- topic_readgroups %>% arrange(-meanRead) %>% mutate(Tag = fct_inorder(Tag)) %>% ggplot(aes(x=Tag, y=FKA, fill=log10(meanViews), group=Tag)) + geom_boxplot() + scale_fill_viridis('Mean \nViews \n(Log10)') + labs(title='Readability by Tags', subtitle="Politics are complex, entertainment is simple, but the others don't differ much", x='Tag', y='Flesch-Kincaid Readability Age') + theme(axis.text.x = element_text(angle = 60, hjust = 1)) readability_topics That politics and economics should be most complex, and that education, children and entertainment should be simplest, makes intuitive sense. It definitely seems like, despite being quite crude instruments, that the readability formulas do a pretty good job of capturing the real differences in complexity. Also, interesting to note above that the topics of the talks don’t appear to be completely driving the differences in readability: brain talks are very popular, but reasonably complicated; while entertainment talks are very simple, but not massively popular. ## Audience Laughter Next, let’s take a look at how funny talks were compared to their complexity. We saved the laughs per minute earlier, so we can use that data. Let’s filter for the laughs for which laughter was recorded first though. lolpermin <- readdat %>% filter(lolzpermin > 0) %>% ggplot(aes(x=FKA, y=log10(lolzpermin), colour=log10(views))) + geom_point() + geom_smooth(method="lm") + labs(title='Readability and Laughs', subtitle='Simpler talks get more laughs', x='Flesch-Kincaid Readability (Age)', y='Lolz/minute (Log10)') + scale_colour_viridis(option = 'D', 'Log(Views)') lolpermin ## Engagement And engagement. Let’s make a crude marker of engagement by taking the log of the number of ratings per view. First, we need to extract the number of ratings, and then we can calculate engagement. readdat$ratings[[1]]
## [1] "[{'id': 23, 'name': 'Jaw-dropping', 'count': 402}, {'id': 7, 'name': 'Funny', 'count': 1637}, {'id': 1, 'name': 'Beautiful', 'count': 59}, {'id': 22, 'name': 'Fascinating', 'count': 267}, {'id': 9, 'name': 'Ingenious', 'count': 116}, {'id': 21, 'name': 'Unconvincing', 'count': 15}, {'id': 10, 'name': 'Inspiring', 'count': 57}, {'id': 25, 'name': 'OK', 'count': 126}, {'id': 3, 'name': 'Courageous', 'count': 72}, {'id': 24, 'name': 'Persuasive', 'count': 14}, {'id': 26, 'name': 'Obnoxious', 'count': 52}, {'id': 11, 'name': 'Longwinded', 'count': 56}, {'id': 8, 'name': 'Informative', 'count': 9}, {'id': 2, 'name': 'Confusing', 'count': 6}]"
get_ratingcount <- function(ratingtext) {
ratingcount <- stringr::str_match_all(ratingtext, "'count': (\\d*)")[[1]][,2]
sum(as.numeric(ratingcount))
}

mutate(ratingcount = map_dbl(ratings, ~get_ratingcount(.x)),
engagement = log10(ratingcount/views))

Right, now let’s see how it looks

engagement <- readdat %>%
ggplot(aes(x=FKA, y=engagement, colour=log10(views))) +
geom_point() +
geom_smooth(method="lm") +
scale_colour_viridis('Log Views') +
subtitle='Simpler talks get more engagement',
y='Engagement (Log (Ratings / Views)')

engagement

# Conclusions

In this analysis, and much to my surprise I must admit, we found that linguistic complexity, measured using readability formulas, appears to be related to pretty much everything we looked at. It seemed to be changing over time, with talks growing more complex. We showed that talks with high complexity had universally low viewership, and that simpler talks are not only more viewed, but also get more laughs, and get more engagement. We further showed that politics and economics talks tend to be more complicated, while talks about education, children and entertainment tend to be more simple.

And, lastly, if you should ever be preparing to do a TED talk: keep it simple!