Escaping from character encoding hell in R on Windows

June 14, 2016
By

(This article was first published on Ista Zahn (Posts about R), and kindly contributed to R-bloggers)

Note: the title of this post was inspired by this question on stackoverflow.

This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. The issues described here by and large to not apply on Mac or Linux; they are specific to running R on Windows.

If you are on a deadline and just need to get the job done this section should be all you need. Additional background and discussion is presented in later sections.

To read a text file with non ASCII encoding into R you should a) determine the encoding and b) read it in such a way that the information is re-encoded into UTF-8, and c) ignore the bug in the data.frame print method on Windows. Hopefully the encoding is specified in the documentation that accompanied your data. If not, you can guess the encoding using the stri_read_raw and stri_enc_detect functions in the stringi package. You can ensure that the information is re-encoded to UTF-8 by using the readr package.

For example, I have two versions of a file containing numbers and Japanese characters: japanese_utf8.csv is encoded in UTF-8, and japanese_shiftjis.csv is encoded in SHIFT-JIS. We can read these files as follows on any platform (Windows, Linux, Mac):

library(readr)
options(stringsAsFactors = FALSE)
read_csv("japanese_utf8.csv",
	 locale = locale(encoding = "UTF-8"))
read_csv("japanese_shiftjis.csv",
	 locale = locale(encoding = "SHIFT-JIS"))
    No.         ??? ???     ?? ???
1 00001 2015?09?25?   ?? ????    022
2 00002 2015?09?25?   ?? ????    018
3 00003 2015?09?21?   ??   ???    003
    No.         ??? ???     ?? ???
1 00001 2015?09?25?   ?? ????    022
2 00002 2015?09?25?   ?? ????    018
3 00003 2015?09?21?   ??   ???    003

On Windows there is a bug in print.data.frame that causes data.frame‘s with UTF-8 encoded columns to be displayed incorrectly in non UTF-8 locales. Running the above example on Windows produces this:

    No.                            
1 00001 20150925                                022
2 00002 20150925                                018
3 00003 20150921                         3                      003

    No.                            
1 00001 20150925                                022
2 00002 20150925                                018
3 00003 20150921                         3                      003

which looks terrible but does not actually indicate a problem. The information is encoded correctly, but due to a long-standing bug it is displayed incorrectly. You can check to see if the values are correct by converting the data.frame by (ab)using print.listof, e.g.,

print.listof(read_csv("japanese_shiftjis.csv",
		      locale = locale(encoding = "SHIFT-JIS")))
No. :
[1] "00001" "00002" "00003"

??? :
[1] "2015?09?25?" "2015?09?25?" "2015?09?21?"

??? :
[1] "??" "??" "??"

?? :
[1] "????" "????" "???"  

??? :
[1] "022" "018" "003"

To recap:

  • Regardless of platform (Windows, Mac Linux), use the readr package to read data into R. This will re-encode the contents of the file to UTF-8 for you.
  • Make sure you specify the encoding using the locale argument as shown in the example above.
  • Ignore the ugly print.data.frame bug and use print.listof to check that your data was imported correctly.

Those wishing for more details about this issue can read on.

What is the problem?

The problem is that the basic R functions for reading and writing data from and to files does no work in any reasonable way on Windows. If you are struggling with this you are not alone! There are numerous questions on stackoverflow, blog posts (e.g., this one by Rolf Fredheim, and another by Huidong Tian), and anguished mailing list posts. Thinking of the person-hours wasted on this issue over the years almost brings a tear to my eye.

Let’s try it, using some simplified data from a project I worked on last year. For illustration I’ve created two files containing a mix of English letters, numbers, and Japanese characters. I saved one version with UTF-8 encoding, and another with SHIFT-JIS. On Linux we can read both files easily, provided only that we correctly specify the encoding if the file is not already encoded in UTF-8:

read.csv("japanese_utf8.csv")
  No.         ??? ???     ?? ???
1   1 2015?09?25?   ?? ????     22
2   2 2015?09?25?   ?? ????     18
3   3 2015?09?21?   ??   ???      3
read.csv("japanese_shiftjis.csv", fileEncoding = "SHIFT-JIS")
  No.         ??? ???     ?? ???
1   1 2015?09?25?   ?? ????     22
2   2 2015?09?25?   ?? ????     18
3   3 2015?09?21?   ??   ???      3

On Windows things are much more difficult. Using read.csv with the default options does not work because read.csv assumes that the encoding of the file matches the Windows locale settings:

read.csv("japanese_utf8.csv")
  No.         ç.ºè.Œæ.. æœ.å..å.Š       é..å.. ペãƒ.ã..
1   1 2015年09月25日    週刊 週刊朝日        22
2   2 2015年09月25日    週刊 週刊朝日        18
3   3 2015年09月21日    朝刊    3総合         3

Trying to tell R that the file is encoded in UTF-8 not a general solution because read.csv will then try to re-encode from UTF-8 to the native encoding, which may or may not work depending on the contents of the file. On my system trying to read a UTF-8 encoded file containing Japanese characters with the fileEncoding falls flat on its face:

read.csv("japanese_utf8.csv", fileEncoding = "UTF-8")
[1] No. X  
<0 rows> (or 0-length row.names)
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  invalid input found on input connection 'japanese_utf8.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'japanese_utf8.csv'

Finally, we might try the encoding argument rather than fileEncoding. This simply marks the strings with the specified encoding:

read.csv("japanese_utf8.csv", encoding = "UTF-8")
read.csv("japanese_utf8.csv", encoding = "UTF-8")
  No.        X.U.767A..U.884C..U.65E5. X.U.671D..U.5915..U.520A.                X.U.9762..U.540D. X.U.30DA..U.30FC..U.30B8.
1   1 20150925                                   22
2   2 20150925                                   18
3   3 20150921                          3                         3

This kind of works, though you wouldn’t know it from the output. As mentioned above, there is a bug in the print.data.frame function that prevents UTF-8 encoded text from displaying correctly. We can use another print method to see that the column values have been read in correctly:

print.listof(read.csv("japanese_utf8.csv", encoding = "UTF-8"))
No. :
[1] 1 2 3

X.U.767A..U.884C..U.65E5. :
[1] "2015?09?25?" "2015?09?25?" "2015?09?21?"

X.U.671D..U.5915..U.520A. :
[1] "??" "??" "??"

X.U.9762..U.540D. :
[1] "????" "????" "???"  

X.U.30DA..U.30FC..U.30B8. :
[1] 22 18  3

Unfortunately there are two problems with this: first, the names of the columns have not been correctly encoded, and second, this will only work if your input data is in UTF-8 in the first place. Trying to apply this strategy to our SHIFT-JIS encoded file will not work at all because we cannot mark strings with arbitrary encoding, only with UTF-81. Trying to mark the string as SHIFT-JIS will silently fail:

print.listof(read.csv("japanese_shiftjis.csv", encoding = "SHIFT-JIS"))
No. :
[1] 1 2 3

X...s.ú :
[1] "2015”N09ŒŽ25“ú" "2015”N09ŒŽ25“ú" "2015”N09ŒŽ21“ú"

X....Š. :
[1] "TŠ§" "TŠ§" "’©Š§"

X.Ê.. :
[1] "TŠ§’©“ú" "TŠ§’©“ú" "‚R‘‡"  

ƒy..ƒW :
[1] 22 18  3

Ouch! Why is this so hard? Can we make it suck less?

Encoding in R

Basically R gives you two ways of handling character encoding. You can use the default encoding of your OS, or you can use UTF-81. On OS X and Linux these options are often the same, since the default OS encoding is usually UTF-8; this is a great advantage because just about everything can be represented in UTF-8. On Windows there is no such luck. On my Windows 7 machine the default is “Windows code page 1252”; many characters (such as Japanese) cannot be represented in code page 1252. If I want to work with Japanese text in R on Windows I have two options; change my locale to Japanese, or I can convert strings to UTF-8 and mark them as such.

In some ways just changing your locale to one that can accommodate the data you are working with is the simplest approach. Again, on Mac and Linux the locale usually specifies UTF-8 encoding, so no changes are needed; things should just work as you would expect them to. On windows we can change the locale to match the data we are working with using the Sys.setlocale function. This sometimes works well; for example, we can read our UTF-8 and SHIFT-JIS encoded data on Windows as follows:

setlocale("LC_ALL", "English_United States.932")
read.csv("japanese_shiftjis.csv")
read.csv("japanese_utf8.csv", fileEncoding = "UTF-8")
[1] "LC_COLLATE=English_United States.932;LC_CTYPE=English_United States.932;LC_MONETARY=English_United States.932;LC_NUMERIC=C;LC_TIME=English_United States.932"

  No.         ??? ???     ?? ???
1   1 2015?09?25?   ?? ????     22
2   2 2015?09?25?   ?? ????     18
3   3 2015?09?21?   ??   ???      3

  No.         ??? ???     ?? ???
1   1 2015?09?25?   ?? ????     22
2   2 2015?09?25?   ?? ????     18
3   3 2015?09?21?   ??   ???      3

This works fine until we want to read some other kind of text in the same R session, and then we are right back to the same old problem. Another issue with this method is that it does not work in Rstudio unless the locale is set on startup; you cannot change the locale of a running session in Rstudio2.

Because the Sys.setlocale method only works for a subset of languages in any given session, our best bet is to read store everything in UTF-8 (and make sure it is marked as such). It is not convenient to do this using the read.table family of functions in R, but it is possible with some care:

x <- read.csv("japanese_shiftjis.csv", 
	      encoding = "UTF-8", 
	      check.names = FALSE # otherwise R will mangle the names
	      )
charcols <- !sapply(x, is.numeric)
x[charcols] <- lapply(x[charcols], iconv, from = "SHIFT-JIS", to = "UTF-8")
names(x) <- iconv(names(x), from = "SHIFT-JIS", to = "UTF-8")
print.listof(x)
No. :
[1] 1 2 3

??? :
[1] "2015?09?25?" "2015?09?25?" "2015?09?21?"

??? :
[1] "??" "??" "??"

?? :
[1] "????" "????" "???"  

??? :
[1] 22 18  3

OK it works, but honestly that too much work for something as simple as reading a .csv file into R. As suggested at the beginning of this post, a better strategy is to use the readr package because it will do the conversion to UTF-8 for you:

print.listof(read_csv("arabic_utf-8.csv"), locale = locale(encoding = "UTF-8"))
print.listof(read_csv("japanese_utf8.csv"), locale = locale(encoding = "UTF-8"))
print.listof(read_csv("japanese_shiftjis.csv"), locale = locale(encoding = "SHIFT-JIS"))
X5 :
[1] "1895-01-02" "1895-01-07" "1895-01-16"
X8 :
[1] "????" "????" "????"
X12 :
[1] "?????" "?????" "?????"

No. :
[1] "00001" "00002" "00003"
??? :
[1] "2015?09?25?" "2015?09?25?" "2015?09?21?"
??? :
[1] "??" "??" "??"
?? :
[1] "????" "????" "???"  
??? :
[1] "022" "018" "003"


No. :
[1] "00001" "00002" "00003"
??? :
[1] "2015?09?25?" "2015?09?25?" "2015?09?21?"
??? :
[1] "??" "??" "??"
?? :
[1] "????" "????" "???"  
??? :
[1] "022" "018" "003"

Files

Here are the example data files and code and needed to run the examples in this post.

Footnotes:

1

We can also mark strings as encoded in latin1, but that is not useful if you take my advice and store everything in UTF-8.

To leave a comment for the author, please follow the link and comment on their blog: Ista Zahn (Posts about R).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)