R is primarily a language for working with numbers, but we often need to work with text as well. Whether it’s formatting text for reports, or analyzing natural language data, R provides a number of facilities for working with character data. Handling Strings with R, a free (CC-BY-NC-SA) e-book by UC Berkeley’s Gaston Sanchez, provides an overview of the ways you can manipulate characters and strings with R.
There are many useful sections in the book, but a few selections include:
- C-style formatting — very useful for preparing tabular data for reports
- String manipulation with the stringr package — which provides some welcome consistency in handling strings with R
- Regular expressions — the savior and/or curse for many data extraction problem
Note that the book does not cover analysis of natural language data, for which you might want to check out the CRAN Task View on Natural Language Processing or the book Text Mining with R: A Tidy Approach. It’s also sadly silent on the topic of character encoding in R, a topic that often causes problems when dealing with text data, especially from international sources. Nonetheless, the book is a really useful overview of working with text in R, and has been updated extensively since it was last published in 2014. You can read Handling Strings with R at the link below.
Gaston Sanchez: Handling Strings with R