More powerful iconv in R

June 19, 2010
By

(This article was first published on BioStatMatt, and kindly contributed to R-bloggers)

The R function iconv converts between character string encodings, for example, from the locale dependent encoding to UTF-8:

> iconv("foo", to="UTF-8")
[1] "foo"

However, R has long-running trouble with embedded null characters ('\0') in strings. Hence, if we try to convert to an encoding that permits embedded null characters, iconv will fail:

> iconv("foo", to="UTF-16")
Error in iconv("foo", to = "UTF-16") : 
  embedded nul in string: '\xff\xfef\0o\0o\0'

The ‘embedded nul’ error is thrown by mkCharLenCE, after the real conversion is complete. The converted string exists in memory, though not in a form that R can currently represent as a STRSXP. Hence the error when passed to mkCharLenCE.

The issue of embedded null characters has been discussed previously on the R mailing lists (see this thread), but I don’t think this is the issue here. The point here is that the C implementation of iconv operates on binary data, not necessarily null terminated C strings. Hence, in order to fully utilize the iconv mechanism, the R-level iconv ought to accept and return objects that can handle arbitrary binary data, i.e.of type RAWSXP, in addition to character vectors.

To this end, I’ve written a small patch (13 lines w/o documentation) against the current R-devel sources (r52328) that allows the R-level iconv to accept an argument of type RAWSXP, in addition to character vectors. Now, when a raw object is passed to iconv, no character substitution is performed, the arguments sub and mark are ignored, and a raw object is returned. However, rather than returning NA (NA does not exist for RAWSXPs) when conversions are invalid or incomplete, a partially converted object is returned. The following patch doesn't touch any of the code associated with STRSXPs, nor affect the behavior of iconv when a character vector is passed.

Once compiled into R the new iconv will operate on raw vectors. Continuing with our example:

> bar <- iconv(charToRaw("foo"), to="UTF-16")
> bar
[1] ff fe 66 00 6f 00 6f 00
> rawToChar(iconv(bar, from="UTF-16"))
[1] "foo"

The patch code is listed below, and also available here R-devel-iconv-0.0.patch. P.S. Thanks to Tal Galili for recommending the GeSHi plugin for wordpress, it worked out nicely for the R and patch (lang="diff") code in this post, though I prefer the more subtle coloring in the patch code.

Index: src/library/base/R/New-Internal.R
===================================================================
--- src/library/base/R/New-Internal.R	(revision 52328)
+++ src/library/base/R/New-Internal.R	(working copy)
@@ -239,7 +239,7 @@
 
 iconv <- function(x, from = "", to = "", sub = NA, mark = TRUE)
 {
-    if(!is.character(x)) x <- as.character(x)
+    if(!is.character(x) && !is.raw(x)) x <- as.character(x)
     .Internal(iconv(x, from, to, as.character(sub), mark))
 }
 
Index: src/main/sysutils.c
===================================================================
--- src/main/sysutils.c	(revision 52328)
+++ src/main/sysutils.c	(working copy)
@@ -548,16 +548,17 @@
 	int mark;
 	const char *from, *to;
 	Rboolean isLatin1 = FALSE, isUTF8 = FALSE;
+	Rboolean isRawx = (TYPEOF(x) == RAWSXP);
 
-	if(TYPEOF(x) != STRSXP)
-	    error(_("'x' must be a character vector"));
+	if(TYPEOF(x) != STRSXP && !isRawx)
+	    error(_("'x' must be a character vector or raw"));
 	if(!isString(CADR(args)) || length(CADR(args)) != 1)
 	    error(_("invalid '%s' argument"), "from");
 	if(!isString(CADDR(args)) || length(CADDR(args)) != 1)
 	    error(_("invalid '%s' argument"), "to");
 	if(!isString(CADDDR(args)) || length(CADDDR(args)) != 1)
 	    error(_("invalid '%s' argument"), "sub");
-	if(STRING_ELT(CADDDR(args), 0) == NA_STRING) sub = NULL;
+	if(STRING_ELT(CADDDR(args), 0) == NA_STRING || isRawx) sub = NULL;
 	else sub = translateChar(STRING_ELT(CADDDR(args), 0));
 	mark = asLogical(CAD4R(args));
 	if(mark == NA_LOGICAL)
@@ -584,7 +585,7 @@
 	PROTECT(ans = duplicate(x));
 	R_AllocStringBuffer(0, &cbuff);  /* 0 -> default */
 	for(i = 0; i < LENGTH(x); i++) {
-	    si = STRING_ELT(x, i);
+	    si = isRawx ? x : STRING_ELT(x, i);
 	top_of_loop:
 	    inbuf = CHAR(si); inb = LENGTH(si);
 	    outbuf = cbuff.data; outb = cbuff.bufsize - 1;
@@ -622,7 +623,7 @@
 		goto next_char;
 	    }
 
-	    if(res != -1 && inb == 0) {
+	    if(res != -1 && inb == 0 && !isRawx) {
 		cetype_t ienc = CE_NATIVE;
 
 		nout = cbuff.bufsize - 1 - outb;
@@ -632,7 +633,13 @@
 		}
 		SET_STRING_ELT(ans, i, mkCharLenCE(cbuff.data, nout, ienc));
 	    }
-	    else SET_STRING_ELT(ans, i, NA_STRING);
+	    else if(!isRawx) SET_STRING_ELT(ans, i, NA_STRING);
+	    else {
+		nout = cbuff.bufsize - 1 - outb;
+		UNPROTECT(1);
+		PROTECT(ans = allocVector(RAWSXP, nout));
+		memcpy(RAW(ans), cbuff.data, nout);
+	    }
 	}
 	Riconv_close(obj);
 	R_FreeStringBuffer(&cbuff);

To leave a comment for the author, please follow the link and comment on his blog: BioStatMatt.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

2 Responses to More powerful iconv in R

  1. Rbloggers (R bloggers website) on June 19, 2010 at 6:23 pm

    Twitter Comment


    More powerful iconv in R: The R function iconv converts between character string… [link to post] #rstats

    Posted using Chat Catcher

  2. znmeb (Ed Borasky) on June 19, 2010 at 6:56 pm

    Twitter Comment


    RT @Rarchive: More powerful iconv in R: The R function iconv converts between character string… [link to post] #rstats: … http://b

    Posted using Chat Catcher