Comparing subreddits, with Latent Semantic Analysis in R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with “r/“). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below — apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders.


The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call “subreddit algebra”: you can “add” two subreddits and find a third that reflects the intersection of the two. (One example they give is adding r/nba to r/minnesota gives you r/timberwolves, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream r/politics site from those in r/The_Donald you're left with posts that look like those in several toxic subreddits.

The statistical technique used to identify posts that are “similar” to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits:


The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above — known as a ternary diagram — was created using the ggtern package.

For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below.

FiveThirtyEight: Dissecting Trump's Most Rabid Following


To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)