# Tracking Social Issues and Topics in Presidential Speeches

October 22, 2015
By

(This article was first published on StatOfMind, and kindly contributed to R-bloggers)

## Scraping presidential transcripts

To begin, we must scrape the content of all presidential speeches recorded in American history. To do that, I’ll rely on the very handy BeautifulSoup library, and eventually store all data in a pandas dataframe that will be persisted in a pickle file.

## Topic modeling and visualization

Now that the raw text of all presidential speeches in American history has been retrieved, we can proceed to light preprocessing before applying Latent Dirichlet Allocation.

In the following 5 cells, we effectively tokenize and remove stopwords from each document (i.e. presidential speech), compute the frequency of each token, and filter out all those that appear less than 10 times in the entire corpus of presidential speeches. Note that I used an ad-hoc threshold of 10, but this should be a parameter that could be played around. Also, the amount of porcessing on each document is intentionally simplistic. Finally, we set up gensim-specific objects that include a dictionary mapping words to integer ids, and a corpus that simply counts the number of occurences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

## Finding the Optimum Number of Topics

Now that the data is ready, we can run a batch LDA (because of the small size of the dataset that we are working with) to discover the main topics in our document.

0 --- 0.025*states + 0.017*united + 0.017*shall + 0.014*state + 0.010*constitution + 0.009*president + 0.009*act + 0.009*congress + 0.008*laws + 0.007*law


1 — 0.026government + 0.007states + 0.007chilean + 0.007men + 0.006sailors + 0.006united + 0.005mr + 0.005german + 0.005police + 0.004vessels

2 — 0.012world + 0.011peace + 0.009people + 0.008america + 0.008freedom + 0.007soviet + 0.006united + 0.006new + 0.005states + 0.005nations

3 — 0.050president + 0.030mr + 0.024think + 0.008secretary + 0.008general + 0.008people + 0.007time + 0.007viet + 0.007going + 0.007nam

4 — 0.011government + 0.007people + 0.007business + 0.006country + 0.005economic + 0.005congress + 0.005world + 0.005federal + 0.005tax + 0.005public

5 — 0.006people + 0.004government + 0.004united + 0.004states + 0.003country + 0.003public + 0.003congress + 0.003question + 0.003going + 0.002time

6 — 0.013states + 0.012government + 0.009united + 0.008congress + 0.007public + 0.005country + 0.005great + 0.005year + 0.004general + 0.004people

7 — 0.011peace + 0.010vietnam + 0.010people + 0.009war + 0.009world + 0.008united + 0.007south + 0.007american + 0.007nations + 0.006states

8 — 0.009world + 0.009congress + 0.008new + 0.008year + 0.007america + 0.006people + 0.006energy + 0.006american + 0.006nation + 0.005government

9 — 0.014government + 0.012people + 0.007states + 0.007union + 0.007constitution + 0.006great + 0.006shall + 0.006men + 0.006country + 0.005free

The display of inferred topics shown above does not really lend itself very well to interpretation. Aside from the fact that you have to read through all the topics, most people will interpret the main themes of each topics differently. This hits right to the core of my mixed feelings towards topic modeling. To be given the ability and opportunity to infer topics from a large set of documents is truly amazing, but I have always personally felt (and maybe that is just me) that the ensuing display of information was lacking. Indeed, I have found that the output of typical topic modeling techniques does not lend itself very well to visualization and – in the case of presentations to the uninitiated – interpretation. However, I recently came across the LDAviz R library developed by Kenny Shirley and Carson Sievert, which to paraphrase their words is a D3.js interactive visualization that's designed help you interpret the topics in a topic model fit to a corpus of text using LDA. Here, we use the great Python extension port of the LDAviz R library, available on GitHub at the following URL https://github.com/bmabey/pyLDAvis. Two attractive features of pyLDAviz are its ability to help interpret the topics extracted from a fitted LDA model, but also the fact that it can be easily incorporated within an iPython notebook in nothing more than two lines of code!

## Tracking and visualizing topics propensity over time

Now that we have shown how results gathered from topic modeling methods such as LDA can be visualized in a intuitive way, we can move to additional data analysis. In particular, it would be interesting to uncover the temporal variation of topics across American History. I would personally be very curious to find out whether topic modeling can reverse-engineer the major events in American History. In the next step, we produce a dataframe where each row represents a speech and each of the 20 columns represent a topic. Each cell in the dataframe represents the probability that a given topic was assigned to a speech.

[(7, 0.10493554997876908), (10, 0.011621459517891617), (15, 0.86674743636700446)]

                              topic_0   topic_1   topic_2   topic_3   topic_4  \
lincoln|July 4, 1861         0.000011  0.000011  0.000011  0.011112  0.000011
buchanan|February 24, 1859   0.000033  0.000033  0.000033  0.203170  0.000033
reagan|November 11, 1988     0.000526  0.197846  0.000526  0.000526  0.000526
tyler|February 20, 1845      0.000114  0.000114  0.000114  0.000114  0.000114
eisenhower|January 17, 1961  0.000063  0.000063  0.000063  0.000063  0.000063

topic_5   topic_6   topic_7   topic_8   topic_9  \
lincoln|July 4, 1861         0.008504  0.000011  0.064621  0.051711  0.752269
buchanan|February 24, 1859   0.000033  0.000033  0.002136  0.249423  0.032370
reagan|November 11, 1988     0.453219  0.000526  0.000526  0.000526  0.000526
tyler|February 20, 1845      0.000114  0.000114  0.014633  0.361549  0.128588
eisenhower|January 17, 1961  0.615735  0.000063  0.000063  0.000063  0.000063

...    topic_12  topic_13  topic_14  topic_15  \
lincoln|July 4, 1861         ...    0.000011  0.000011  0.000011  0.111633
buchanan|February 24, 1859   ...    0.000033  0.000033  0.006279  0.490955
reagan|November 11, 1988     ...    0.101381  0.239133  0.000526  0.000526
tyler|February 20, 1845      ...    0.000114  0.000114  0.000114  0.493404
eisenhower|January 17, 1961  ...    0.000063  0.074416  0.000063  0.000063

topic_16  topic_17  topic_18  topic_19  \
lincoln|July 4, 1861         0.000011  0.000011  0.000011  0.000011
buchanan|February 24, 1859   0.015235  0.000033  0.000033  0.000033
reagan|November 11, 1988     0.000526  0.000526  0.000526  0.000526
tyler|February 20, 1845      0.000114  0.000114  0.000114  0.000114
eisenhower|January 17, 1961  0.010800  0.000063  0.000063  0.000063

president   year
lincoln|July 4, 1861            lincoln   1861
buchanan|February 24, 1859     buchanan   1859
reagan|November 11, 1988         reagan   1988
tyler|February 20, 1845           tyler   1845
eisenhower|January 17, 1961  eisenhower   1961

[5 rows x 22 columns]


Finally, we can compute the normalized frequency of topics by year and plot these as a time-series using the dygraphs library.

At this point, I’m going to do something that I am not very proud of and proceed to some nasty context switching. Although I played around with the charts library, I was not satisified with the results and temporarily switched to R in order to leverage the dygraphs library. Thankfully, Jupyter notebooks have plenty of magic that make it easy to call R from the notebook itself!

## Clustering individual presidential speeches

We can also wrangle the data a little bit more in order to visualize how each individual speeches cluster together. This time, we use document-topic distributions and apply the t-sne dimensionality reduction algorithm to map all speeches into two-dimensional space. Roughly, t-sne is considered to be useful because of its property to conserve the overall topology of the data, so that neighboring (i.e. similar) speeches will hopefully be mapped into neighboring locations in two-dimensional space. Other well-known clustering techniques such as k-means or MDS would likely be just as adequate for this exercise, but I’ve had good fortune when using t-sne, so am unwisely (and perharps not very smartly) sticking to it here.

t-SNE: 13 sec

                                     0         1
lincoln|July 4, 1861         12.246211 -4.594903
buchanan|February 24, 1859   13.982249 -1.675186
reagan|November 11, 1988     -7.665759  4.714818
tyler|February 20, 1845      11.953091  1.884652
eisenhower|January 17, 1961 -13.193183 -3.790267


We can now leverage the mpld3 library to display the t-sne clusters inline. The interactive figure below shows the 2-dimensional t-sne coordinates of all 880 presidential speeches in American history. One of the challenges here was to generate distinct colors to map the different presidents, and I don’t think I did a particularly good job at it (the figure could probably benefit from a legend too, but I opted to waste my time on adding tooltip functionnality instead!)