tolower() – error catching unmappable characters

January 6, 2013
By

(This article was first published on minimalR.com » r-bloggers, and kindly contributed to R-bloggers)

The tolower() function returns an error where it can’t map to the Unicode character set of the input data – a common occurrence when analysing social media data with emoticons.

Emoticons are those symbols that are commonly used on mobile phones but aren’t always recognised on all platforms.

For example, when converting tweets to @delta (Delta Airlines), I got the following error:

Error in tolower(text) :
invalid input '@ActualALove: First time I've seen a foot-rest in first class! Oh @Delta, how I love thee \ud83d\ude0a✈\ud83d\udc78 http://t.co/noKI9CiM' in 'utf8towcs'

When I looked up the actual tweet, it looked liked this.

20130106-194554.jpg

The two unicode characters that weren’t recognised were \ud83d\ude0a (SMILING FACE WITH SMILING EYES) and \ud83d\udc78 (PRINCESS).

Gaston Sanchez has posted a solution to this problem in his blog Data Analysis Visually Enforced. I’ve used the code and it works well. When I have time, I’ll extend it to replace the offending characters instead of returning NA for the entire string.

To leave a comment for the author, please follow the link and comment on his blog: minimalR.com » r-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.