About the book
This book applies tidy data principles to text analysis. The aim is to present tools to make many text mining tasks easier, more effective, and consistent with tools already in use, and in particular it presents the tidytext R package.
The authors of this beautiful exposition of methodology and coding are Julia Silge and David Robinson. Kudos to both of them. In particular, Iâ€™ve been following Juliaâ€™s blog posts in the last two years and using it as a reference to teach R in my courses.
Table of contents
List of chapters:
- The tidy text format
- Sentiment analysis with tidy data
- Analyzing word and document frequency: tf-idf
- Relationships between words: n-grams and correlations
- Converting to and from non-tidy formats
- Topic modeling
- Case study: comparing Twitter archives
- Case study: mining NASA metadata
- Case study: analyzing usenet text
Remarkable contributions of this book
In my opinion chapter 5 is one of the best expositions of data structures in R. By using modern R packages such as
tidytext, among other packages, the authors move between
VCorpus, while they present a set of good practises in R and do include
ggplot2 charts to make concepts such as sentiment analysis clear.
If you often hear colleagues saying that R syntax is awkward, show this material to them. Probably people who used R 5 years ago or more, and havenâ€™t used it in a while, will be amazed to see how the
%>% operator is used here.
Text analysis requires working with a variety of tools, many of which have inputs and outputs that aren’t in a tidy form. What the authors present here is a noble and remarkable piece of work.