**Edwin Chen's Blog » r**, and kindly contributed to R-bloggers)

# No “That” Left Behind?

I came across a post on Language Log last week giving some evidence that Obama tends to add *that* to the prepared version of his speeches.

For example, in a recent speech at George Washington University, the prepared speech was written as

It’s about the kind of future we want. It’s about the kind of country we believe in.

but Obama spoke the two sentences at

It’s about the kind of future *that* we want. It’s about the kind of country *that* we believe in.

(Amusingly, Liberman has the intuition that *that*-omission adds informality, while I have the opposite intuition.)

I wanted to get some more data to test whether Obama really does add *that* to his speeches, and to see whether his frequency of *that*-addition depends on the audience (e.g., maybe more formal speeches have less *that*-addition compared to rallies), so I scraped the White House website for some speeches.

# Data

The Speeches & Remarks section has transcripts of Obama’s speeches as he actually delivered them, so I pulled the text from the 13 most recent for an “as delivered” dataset.

The Weekly Address section, on the other hand, has the *as-prepared transcripts* of Obama’s speeches, so I used the 11 most recent as an “as prepared” dataset.

Using this data, we can test whether the frequency of *that* in Obama’s delivered speeches differs from the frequency of *that* in Obama’s prepared weekly addresses.

[Note, though, that I’m not distinguishing between the use of *that* to introduce a relative clause (which is what the Language Log post focuses on, and which I’m interested in) and the use of *that* for other purposes (e.g., as a demonstrative). A quick hand-check of both datasets suggested that almost all uses of *that* are for the former purpose, so hopefully this won’t matter too much.]

# That-Frequencies

Here are the proportions of *that* in Obama’s delivered remarks:

```
delivered-remarks 2.59%
delivered-remarks 3.34%
delivered-remarks 2.35%
delivered-remarks 2.36%
delivered-remarks 1.98%
delivered-remarks 3.23%
delivered-remarks 3.27%
delivered-remarks 2.43%
delivered-remarks 2.29%
delivered-remarks 3.04%
delivered-remarks 1.81%
delivered-remarks 2.41%
delivered-remarks 2.40%
```

And here are the proportions in his prepared addresses:

```
prepared-addresses 1.92%
prepared-addresses 1.47%
prepared-addresses 1.74%
prepared-addresses 1.58%
prepared-addresses 0.88%
prepared-addresses 0.73%
prepared-addresses 1.40%
prepared-addresses 0.98%
prepared-addresses 2.11%
prepared-addresses 1.94%
prepared-addresses 1.87%
```

Just by eye-balling, it’s pretty evident that Obama’s delivered remarks have a higher proportion of *that*, and we have enough data that a formal hypothesis test probably isn’t necessary. But just for kicks, let’s do one anyways.

# Bayesian Confidence Intervals

Instead of going the standard frequentist route of performing a chi-square test or t-test, let’s go the Bayesian route instead.

## Beta, Bayes, Barack

Let’s recall how to calculate a Bayesian confidence interval (aka, a credible interval).

First, we use Bayes’ Theorem to calculate P(*that*-frequency in Obama’s delivered remarks is q | data):

**First-term on the right**: If we pool the dataset together, so that the “as delivered” dataset has occurrences of*that*out of total words, then .**Second term on the right**: If we place an uninformative prior on , then our posterior distribution is .

Completely analogously, by placing an uninformative prior on the frequency of *that* in Obama’s prepared addresses, we get a posterior distribution .

## Applied to our Datasets

Our delivered dataset had instances of *that* out of total words, and our prepared dataset had instances of *that* out of total words, so our posterior distributions are

- .

Here’s what these distributions look like, along with some R + ggplot2 code for generating them:

```
library(ggplot2)
x = seq(0, 1, by = 0.0001)
y_delivered = dbeta(x, 1240, 48219)
y_prepared = dbeta(x, 113, 8190)
qplot(x, y_delivered, geom = "line", main = "P(that-frequency in delivered data = q | delivered data) ~ Beta(1240, 48219)", xlab = "q", ylab = "density")
qplot(x, y_prepared, geom = "line", main = "P(that-frequency in prepared data = r | prepared data) ~ Beta(113, 8190)", xlab = "r", ylab = "density")
```

And together on the same plot:

```
d = data.frame(x = c(x, x), y = c(y_delivered, y_prepared), which = rep(c("delivered", "prepared"), each = length(x)))
qplot(x, y, colour = which, data = d, geom = "line", xlim = c(0, 0.04), ylab = "density")
```

As we can see, the distributions are pretty much entirely disjoint, confirming our earlier suspicions that there’s a distinct difference between the *that*-frequency of our two datasets.

## Confidence in the Difference

What we really want, though, is the probability distribution of the *difference* of the two Beta distributions , not the individual Beta distributions themselves. The difference of two Beta distributions doesn’t have a closed form, so we use a simulation to calculate the probability:

```
delivered_sim = rbeta(10000, 1240, 48219)
prepared_sim = rbeta(10000, 113, 8190)
diff = delivered_sim - prepared_sim
qplot(diff, geom = "density")
mean(diff) # 0.01147737
length(diff) / length(diff) # 1.0
quantile(diff, c(0.025, 0.975)) # 0.008461888 0.014222934
```

We see that is effectively 0, so we’re quite confident that . Furthermore, we have and a 95% credible interval for is .

# Hierarchical Models

In the analysis above, we pooled the documents in each dataset, treating all the delivered speeches as essentially one giant delivered speech and likewise for the prepared transcripts. We also ignored the fact that the two datasets had something in common, namely, that they both deal with Obama.

This was fine for our problem, but sometimes we don’t want to ignore these relationships. So instead, we could have built our model as follows:

- We can imagine that each of the delivered speeches has a slightly different
*that*-frequency (due to, say, variations in the topic being discussed), but that these frequencies are related in some way. We can model this by saying that each individual delivered speech has an individual*that*-frequency drawn from a common distribution, say, . This allows the*that*-frequencies of each delivered speech to differ, while still linking them with an overall structure. - Similarly, we can model each prepared transcript as having an individual
*that*-frequency drawn from a separate common distribution, say, . - Next, we might want to link the parameters of our beta distributions (), so we could model them as coming from common Gamma distributions and .

This gives us a more complex *hierarchical model*, and I’ll leave it at that, but perhaps I’ll discuss them some more in a future post.

**leave a comment**for the author, please follow the link and comment on their blog:

**Edwin Chen's Blog » r**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...