Learn about handling character data in R with this free e-book

March 21, 2014

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Most people know R as a statistics/analytics language for analysis of quantitative data, and don't think of it as a tool for processing raw text. But R actually has some quite powerful facilities for processing character data. And as Gaston Sanchez learned, text manipulation is an important part of a modern data scientist's repertoire:

Many years ago I decided to apply for a job in a company that developed data mining applications for big retailers. I was invited for an on-site visit and I went through the typical series of interviews with the members of the analytics team. Everything was going smoothly and I was enjoying all the conversations. Then it came turn to meet the computer scientist. After briefly describing his role in the team he started asking me a bunch of technical questions and tests. Although I was able to answer those questions related with statistics and multivariate analysis, I had a really hard time trying to answer a series of questions related with string manipulations.

I will remember my interview with that guy as one of the most embarrassing moments of my life. That day, the first thing I did when I went back home was to open my laptop, launch R, and start reproducing the tests I failed to solve. It didn't take me that much to get the right answers. Unfortunately, it was too late and the harm was already done. Needless to say I wasn't offered the job. That shocking experience showed me that I was not prepared for manipulating character strings. I felt so bad that I promised myself to learn the basics of strings manipulation and text processing. "Handling and Processing Strings in R" is one of the derived results of that old promise.

Gaston's Creative-Commons licensed 112-page e-book, Handling and Processing Strings in R, is an excellent and comprehensive review of R's string handling capabilities. It cover's R's basic string-handling capabilities (reading, converting, manipulating and formatting), and also devotes a chapter to the higher-level functions of Hadley Wickham's stringr package. The two chapters on regular expressions are a must-read for anyone who hasn't yet come to grips with the power of regexes for handing string-based data. There are a few practical examples at the end of the e-book (frequency counting, word clouds) but the book sticks mainly with the fundamentals, and doesn't stray into semantic analysis. Highly recommended for anyone working with strings or character data in R.

Gaston Sanchez: Handling and Processing Strings in R (via Sharon Machlis)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)