Stylometry: Identifying authors of texts using R

December 2, 2016

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which statements are truly the politician's own words, and which are driven primarily by advisors or influencers.

Recently, David Robinson established a way of figuring out which tweets from Donald Trump's Twitter account came from him personally, as opposed to from campaign staff, whcih he verified by comparing the sentiment of tweets from Android vs iPhone devices. Now, Ali Arsalan Kazmi has used stylometric analysis to investigate the provenance of speeches by the Prime Minister of Pakistan

By looking at the aspects of linguistic style (word/sentence length, frequency of word pairings, use of punctuation, etc.) of the speeches of the Prime Minister Nawaz Sharif, Ali found suggestions of at least 2 authors (and possibly more) behind the speeches. This is particularly apparent in this consensus network of appearances of 4-character sequences in speeches, which divides them into two clusters (of possibly differing authorship).


Ali used R and several packages to perform the analysis. These included the openNLP package to extract attributes from the speech data, the stylo package for stylometric analysis, the fpc package for the clustering, and the igraph package to visualize the clusters. The complete R script used for the analysis is available on Github.

For an overview of the analysis, check out this slide presentation by Ali, and for the complete details take a look at the blog post linked below.

A Blog On Data Analytics: How many Authors does the Prime Minister have for his speeches: A Stylometric Analysis


To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)