Adi Sarid (Tel Aviv university and Sarid Research Institute LTD.)
BackgroundA while back I participated in an R workshop, in the annual convention of the Israeli Association for Statistics. I had the pleasure of talking with Tal Galili and Jonathan Rosenblatt which indicated that a lot of Israeli R users run into difficulties with Hebrew with R. My firm opinion is that its best to keep everything in English, but sometimes you simply don’t have a choice. For example, I had to prepare a number of R shiny dashboards to Hebrew speaking clients, hence Hebrew was the way to go, in a kind of English-Hebrew “Mishmash” (mix). I happened to run into a lot of such difficulties in the past, so here are a few pointers to get you started when working in R with Hebrew. This post deals with Reading and writing files which contain Hebrew characters. Note, there is also a bit to talk about in the context of Shiny apps which contain Hebrew and using right-to-left in shiny apps, and using Hebrew variable names. Both work with some care, but I won’t cover them here. If you have any other questions you’d like to see answered, feel free to contact me [email protected].
Reading and writing files with Hebrew charactersR can read and write files in many formats. The common formats for small to medium data sets include the comma separated values (*.csv), and excel files (*.xlsx, *.xls). Each such read/write action is facilitated using some kind of “encoding”. Encoding, in simple terms, is a definition of a character set which help you operating system to interpret and represent the character as it should (לדוגמה, תווים בעברית). There are a number of relevant character sets (encodings) when Hebrew is concerned:
- ISO 8859-8
Using csv files with Hebrew charactersHere’s an example for something that can go wrong, and a possible solution. In this case I’ve prepared a csv file which encoded with UTF-8. When using R’s standard
read.csvfunction, this is what happens:
sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv") sample.data ## ן...... X...... X............ ## 1 ׳¨׳•׳ ׳™ 25 ׳—׳™׳₪׳” ## 2 ׳׳•׳˜׳™ 77 ׳”׳¨׳¦׳׳™׳” ## 3 ׳“׳ ׳™ 13 ׳×׳-׳׳‘׳™׳‘ ׳™׳₪׳• ## 4 ׳¨׳¢׳•׳× 30 ׳§׳¨׳™׳× ׳©׳׳•׳ ׳” ## 5 ׳“׳ ׳” 44 ׳‘׳™׳× ׳©׳׳Oh boy, that’s probably not what the file’s author had in mind. Let’s try to instruct
read.csvto use a different encoding.
sample.data <- read.csv("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv", encoding = "UTF-8") sample.data ## X.U.FEFF.שם גיל מגורים ## 1 רוני 25 חיפה ## 2 מוטי 77 הרצליה ## 3 דני 13 תל-אביב יפו ## 4 רעות 30 קרית שמונה ## 5 דנה 44 בית שאןA bit better isn’t it? However, not perfect. We can read the Hebrew, but there is a weird thing in the header “X.U.FEFF”. A better way to read and write files (much more than just encoding aspects – it’s quicker reading large files) is using the
readrpackage which is part of the
tidyverse. On a side note, if you haven’t already,
install.packages(tidyverse), it’s a must. It includes
readrbut a lot more goodies (read on). Now, for some tools you get with
library(readr) locale("he") ##First we used
## Numbers: 123,456.78 ## Formats: %AD / %AT ## Timezone: UTC ## Encoding: UTF-8 ## ## Days: יום ראשון (יום א׳), יום שני (יום ב׳), יום שלישי (יום ג׳), יום ## רביעי (יום ד׳), יום חמישי (יום ה׳), יום שישי (יום ו׳), יום ## שבת (שבת) ## Months: ינואר (ינו׳), פברואר (פבר׳), מרץ (מרץ), אפריל (אפר׳), מאי (מאי), ## יוני (יוני), יולי (יולי), אוגוסט (אוג׳), ספטמבר (ספט׳), ## אוקטובר (אוק׳), נובמבר (נוב׳), דצמבר (דצמ׳) ## AM/PM: לפנה״צ/אחה״צ guess_encoding("http://www.sarid-ins.co.il/files/utf_encoded_sample.csv") ## # A tibble: 2 × 2 ## encoding confidence ## ## 1 UTF-8 1.00 ## 2 KOI8-R 0.98
locale()which knows the date format and default encoding for the language (UTF-8 in this case). On it’s own
locale()does nothing than output the specs of the locale, but when used in conjuction with
read_csveverything it needs to know. Also note the use of
guess_encodingwhich reads the first “few” lines of a file (10,000 is the default) which helps us, well… guess the encoding of a file. You can see that
readris pretty confident we need the UTF-8 here (and 98% confident we need a Korean encoding, but first option wins here…)
sample.data <- read_csv(file = "http://www.sarid-ins.co.il/files/utf_encoded_sample.csv", locale = locale(date_names = "he", encoding = "UTF-8")) ## Parsed with column specification: ## cols( ## שם = col_character(), ## גיל = col_integer(), ## מגורים = col_character() ## ) sample.data ## # A tibble: 5 × 3 ## שם גיל מגורים ##Awesome isn’t it? Note that the resulting
## 1 רוני 25 חיפה ## 2 מוטי 77 הרצליה ## 3 דני 13 תל-אביב יפו ## 4 רעות 30 קרית שמונה ## 5 דנה 44 בית שאן
sample.datais a tibble and not a data.frame (read about tibbles). The package
readrhas tons of functions features to help us with reading (writing) and controlling the encoding, so I definitely recommend it. By the way, try using
read_csvwithout setting the locale parameter and see what happens.
What about files saved by Excel?Excel files are not the best choice for storing datasets, but the format is extremely common for obvious reasons.
CSV files which were saved by excelIn the past, I had run to a lot of difficulties trying to load CSV files which were saved by excel into R. Excel seems to save them in either “Windows-1255” or “ISO-8859-8”, instead of “UTF-8”. The default read by
read_csvmight yield something like “” instead of “שלום”. In other cases you might get a “multibyte error”. Just make sure you check the “Windows-1255” or “ISO-8859-8” encodings if the standard UTF-8 doesn’t work well (i.e., use
read_csv(file, locale = locale(encoding = "ISO-8859-8"))).
Reading directly from excelAlso, if the original is in Excel, you might want to consider reading it directly from the excel file (skipping CSVs entirely). There are a number of packages for reading excel files and I recommend using
read_xlswill do the trick (depending on file format). You don’t even have to specify the encoding, if there are Hebrew characters they will be read as they should be.
SummaryFor reading csv files with Hebrew characters, it’s very convenient to use
readr. The package has a lot of utilities for language encoding and localization like
locale. If the original data is in excel, you might want to try skipping the csv and read the data directly from the excel format using the
readxlpackage. Somtimes reading files envolves a lot of trial and error – but eventually it will work.
Don’t give up!