More powerful iconv in R

[This article was first published on BioStatMatt, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The R function iconv converts between character string encodings, for example, from the locale dependent encoding to UTF-8:

> iconv("foo", to="UTF-8")
[1] "foo"

However, R has long-running trouble with embedded null characters ('\0') in strings. Hence, if we try to convert to an encoding that permits embedded null characters, iconv will fail:

> iconv("foo", to="UTF-16")
Error in iconv("foo", to = "UTF-16") : 
  embedded nul in string: '\xff\xfef\0o\0o\0'

The ‘embedded nul’ error is thrown by mkCharLenCE, after the real conversion is complete. The converted string exists in memory, though not in a form that R can currently represent as a STRSXP. Hence the error when passed to mkCharLenCE.

The issue of embedded null characters has been discussed previously on the R mailing lists (see this thread), but I don’t think this is the issue here. The point here is that the C implementation of iconv operates on binary data, not necessarily null terminated C strings. Hence, in order to fully utilize the iconv mechanism, the R-level iconv ought to accept and return objects that can handle arbitrary binary data, i.e.of type RAWSXP, in addition to character vectors.

To this end, I’ve written a small patch (13 lines w/o documentation) against the current R-devel sources (r52328) that allows the R-level iconv to accept an argument of type RAWSXP, in addition to character vectors. Now, when a raw object is passed to iconv, no character substitution is performed, the arguments sub and mark are ignored, and a raw object is returned. However, rather than returning NA (NA does not exist for RAWSXPs) when conversions are invalid or incomplete, a partially converted object is returned. The following patch doesn’t touch any of the code associated with STRSXPs, nor affect the behavior of iconv when a character vector is passed.

Once compiled into R the new iconv will operate on raw vectors. Continuing with our example:

> bar <- iconv(charToRaw("foo"), to="UTF-16")
> bar
[1] ff fe 66 00 6f 00 6f 00
> rawToChar(iconv(bar, from="UTF-16"))
[1] "foo"

The patch code is listed below, and also available here R-devel-iconv-0.0.patch. P.S. Thanks to Tal Galili for recommending the GeSHi plugin for wordpress, it worked out nicely for the R and patch (lang="diff") code in this post, though I prefer the more subtle coloring in the patch code.

Index: src/library/base/R/New-Internal.R
===================================================================
--- src/library/base/R/New-Internal.R	(revision 52328)
+++ src/library/base/R/New-Internal.R	(working copy)
@@ -239,7 +239,7 @@
 
 iconv <- function(x, from = "", to = "", sub = NA, mark = TRUE)
 {
-    if(!is.character(x)) x <- as.character(x)
+    if(!is.character(x) && !is.raw(x)) x <- as.character(x)
     .Internal(iconv(x, from, to, as.character(sub), mark))
 }
 
Index: src/main/sysutils.c
===================================================================
--- src/main/sysutils.c	(revision 52328)
+++ src/main/sysutils.c	(working copy)
@@ -548,16 +548,17 @@
 	int mark;
 	const char *from, *to;
 	Rboolean isLatin1 = FALSE, isUTF8 = FALSE;
+	Rboolean isRawx = (TYPEOF(x) == RAWSXP);
 
-	if(TYPEOF(x) != STRSXP)
-	    error(_("'x' must be a character vector"));
+	if(TYPEOF(x) != STRSXP && !isRawx)
+	    error(_("'x' must be a character vector or raw"));
 	if(!isString(CADR(args)) || length(CADR(args)) != 1)
 	    error(_("invalid '%s' argument"), "from");
 	if(!isString(CADDR(args)) || length(CADDR(args)) != 1)
 	    error(_("invalid '%s' argument"), "to");
 	if(!isString(CADDDR(args)) || length(CADDDR(args)) != 1)
 	    error(_("invalid '%s' argument"), "sub");
-	if(STRING_ELT(CADDDR(args), 0) == NA_STRING) sub = NULL;
+	if(STRING_ELT(CADDDR(args), 0) == NA_STRING || isRawx) sub = NULL;
 	else sub = translateChar(STRING_ELT(CADDDR(args), 0));
 	mark = asLogical(CAD4R(args));
 	if(mark == NA_LOGICAL)
@@ -584,7 +585,7 @@
 	PROTECT(ans = duplicate(x));
 	R_AllocStringBuffer(0, &cbuff);  /* 0 -> default */
 	for(i = 0; i < LENGTH(x); i++) {
-	    si = STRING_ELT(x, i);
+	    si = isRawx ? x : STRING_ELT(x, i);
 	top_of_loop:
 	    inbuf = CHAR(si); inb = LENGTH(si);
 	    outbuf = cbuff.data; outb = cbuff.bufsize - 1;
@@ -622,7 +623,7 @@
 		goto next_char;
 	    }
 
-	    if(res != -1 && inb == 0) {
+	    if(res != -1 && inb == 0 && !isRawx) {
 		cetype_t ienc = CE_NATIVE;
 
 		nout = cbuff.bufsize - 1 - outb;
@@ -632,7 +633,13 @@
 		}
 		SET_STRING_ELT(ans, i, mkCharLenCE(cbuff.data, nout, ienc));
 	    }
-	    else SET_STRING_ELT(ans, i, NA_STRING);
+	    else if(!isRawx) SET_STRING_ELT(ans, i, NA_STRING);
+	    else {
+		nout = cbuff.bufsize - 1 - outb;
+		UNPROTECT(1);
+		PROTECT(ans = allocVector(RAWSXP, nout));
+		memcpy(RAW(ans), cbuff.data, nout);
+	    }
 	}
 	Riconv_close(obj);
 	R_FreeStringBuffer(&cbuff);

To leave a comment for the author, please follow the link and comment on their blog: BioStatMatt.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)