R and foreign characters

[This article was first published on Quantifying Memory, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Working with Russian characters can be mind-numbingly frustrating. This is true for R, as for other applications, so below I've written out the my top five tricks for making Russian inputs work in R; i believe they should be transferable to most other languages.

Having forced any number of programs to accept Russian characters in the past, I have come to appreciate UTF-8 as the only sensible encoding system for non-latin script. R operates with UTF-8 as default, so using Russian or other foreign scripts should be straightforward, right?
Wrong. There is no end to the annoyance experienced when attempting to import data into R by appending
encoding = "utf-8"
to the end of every line. Sometimes it will work, but rarely both in the characters displayed on screen, and those output by R. So, annoyingly, characters formatted as Russian in a data.frame will magically appear as gobbledygook when written to an output file, or even a plot. Infuriating. The solution is brutal in its simplicity - don't rely on R's UTF-8 to display characters for you, instead start sessions in the appropriate language, using the line
Sys.setlocale("LC_CTYPE", "russian")
Now that solves all the problems, right?
Almost. Often when scraping data or when inputting data (e.g. through Shiny apps), strings need to be formatted as UTF-8 as follows:
>Enoding(annoyingMisbehavingString) <- "UTF-8"
Be careful with this one, though. Encoding text that already is utf-8 as utf-8 will not work well.
Finally, if you ever want to save .R scripts with non-Latin characters in them, do so with care. When you reopen the files the strings will be scrambled, for some reason not quite clear to me. If you use the script as a source file, any command reliant on the non-Latin string (e.g. grep) will return errors or no hits. The solution is to use a different function all together:
eval(parse("iPolarCalc.R", encoding = "UTF-8"))
And that's about it. For Windows systems at least.

Update: 06/02/2013
Except encoding issues never really end. Enter the latest problem:
displaying cyrillic characters with Knitr.

Knitr is great. It will take R code and combine it with markdown, allowing you to create ready formatted webpages with calculations and graphics created on the fly from R. But it doesn't work properly with non ascii characters. The solution: Don't use R-studio's built in knitr to html (ctrl-shift-h). Instead save the rmd file in your working directory, and run these lines:
knit("test.rmd", encoding = "utf-8")
markdownToHTML("test.md", "test.html")
browseURL(paste("file://", file.path(getwd(), "test.html"), sep = ""))

Update 21/11/2013

Here's my latest discovery: you know when you have foreign characters in a url? Chances are you didn't notice, because most browsers can handle this. Paste this into your browser, and you will get search results for the Katyn massacre:

However, this is all smoke and mirrors: paste the same string into notepad, and you will see this:

What does this have to do with R? well, we need some way to convert the former to the latter if we want to access URLs with foreign characters in. To do that, use curlEscape() from the rCurl package:

> curlEscape("катынь")
[1] "%D0%BA%D0%B0%D1%82%D1%8B%D0%BD%D1%8C"

To leave a comment for the author, please follow the link and comment on their blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)