Jilia Silge and David Robinson are both dab hands at using R to analyze text, from tracking the happiness (or otherwise) of Jane Austen characters, to identifying whether Trump's tweets came from him or a staffer. If you too would like to be able to make statistical sense of masses of (possibly messy) text data, check out their book Tidy Tidy Text Mining with R, available free online and soon to be published by O'Reilly.
The book builds on the tidytext package (to which Gabriela De Queiroz also contributed) and describes how to handle and analyze text data. The “tidy text” of the title refers to a standardized way of handling text data, as a simple table with one term per row (where a “term” may be a word, collection of words, or sentence, depending on the application). Julia gave several examples of tidy text in her recent talk at the RStudio conference:
Once you have text data in this “tidy” format, you can apply a vast range of statistical tools to it, by assigning data values to the terms. For example, can use sentiment analysis tools to quantify terms by their emotional content, and analyze that. You can compare rates of term usage, such as between chapters or to compare authors, or simply create a word cloud of terms used. You coyld use topic modeling techniques, to classify a collection of documents into like kinds.
There are a wealth of data sources you can use to apply these techniques: documents, emails, text messages … anything with human-readable text. The book includes examples of analyzing works of literature (check out the janesustenr and guternbergr packages), downloading Tweets and Usenet posts, and even shows how to use metadata (in this case, from NASA) as the subject of a text analysis. But it's just as likely you have data of your own to try tidy text mining with, so check out Tidy Text Mining with R and to get started.