tolower() – error catching unmappable characters

[This article was first published on minimalR.com » r-bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The tolower() function returns an error where it can’t map to the Unicode character set of the input data – a common occurrence when analysing social media data with emoticons.

Emoticons are those symbols that are commonly used on mobile phones but aren’t always recognised on all platforms.

For example, when converting tweets to @delta (Delta Airlines), I got the following error:

Error in tolower(text) :
invalid input '@ActualALove: First time I've seen a foot-rest in first class! Oh @Delta, how I love thee \ud83d\ude0a✈\ud83d\udc78 http://t.co/noKI9CiM' in 'utf8towcs'

When I looked up the actual tweet, it looked liked this.

20130106-194554.jpg

The two unicode characters that weren’t recognised were \ud83d\ude0a (SMILING FACE WITH SMILING EYES) and \ud83d\udc78 (PRINCESS).

Gaston Sanchez has posted a solution to this problem in his blog Data Analysis Visually Enforced. I’ve used the code and it works well. When I have time, I’ll extend it to replace the offending characters instead of returning NA for the entire string.

To leave a comment for the author, please follow the link and comment on their blog: minimalR.com » r-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)