Using iconv via the R exported header

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Character encodings can be tricky and frustrating to deal with. Several newer languages such as Go or Julia default to native UTF-8 on all platforms, greatly facilitating and easing use of characters with languages other than English on all common platforms. With R we are not quite as lucky. UTF-8 is mostly working as desired on “operating systems with an x” but sadly, there are still a lot of Windows users out there for whom native UTF-8 is not quite in reach. A very detailed discussion of the issues involved was provided last summer on the R Developers Blog in this post.

More recently, another blog post. The useful idea presented in the post is to rely on the (public) header R_ext/Riconv.h which then transparently passes on to the iconv library R itself uses. (Strictly speaking this is an optional feature, see capabilities("iconv") to check your build of R.)

In order to test this, we wrapped up a little (GitHub-only) package RcppIconvExample so that we could toss the file at the accessiable Windows builders for tests (given that we do not have a physical Windows machine around). Together with an input file encoded in windows-1252 (taken from the uchardet CRAN package wrapping Mozilla’s uchardet library) we can then read and convert text in these ‘foreign’ encoding:

win1252file <- system.file("rawdata", "windows-1252.txt", package="RcppIconvExample")
win1252txt <- RcppIconvExample::read_file(win1252file, "windows-1252")
cat(win1252txt)

L’œuf de volaille est un produit agricole servant d'ingrédient entrant dans la
composition de nombreux plats, dans de nombreuses cultures gastronomiques du
monde.

Our implementation of read_file() follows. It refactors the two functions in the blog post into a single function with an optional encoding argument:

// cf https://fishandwhistle.net/post/2021/using-rs-cross-platform-iconv-wrapper-from-cpp11
std::string read_file(std::string filename, std::string encoding = "") {
    const int len = 2048;
    char buffer[len/2];

    std::ifstream file;
    file.open(filename, std::ifstream::in | std::ifstream::binary);

    file.read(buffer, len/2);
    size_t n_read = file.gcount();
    file.close();

    if (encoding == "") {       // no encoding given so return 'as is'
        return std::string(buffer, n_read);
    }

    std::string str_source(buffer, n_read);

    void* iconv_handle = Riconv_open("UTF-8", encoding.c_str());
    if (iconv_handle == ((void*) -1)) {
        Rcpp::stop("Can't convert from '%s' to 'UTF-8'", encoding.c_str());
    }

    const char* in_buffer = str_source.c_str();
    char utf8_buffer[len];
    char* utf8_buffer_mut = utf8_buffer;
    size_t in_bytes_left = n_read;
    size_t out_bytes_left = len;

    size_t result = Riconv(iconv_handle, &in_buffer, &in_bytes_left, &utf8_buffer_mut, &out_bytes_left);
    Riconv_close(iconv_handle);

    if (result == ((size_t) -1) || (in_bytes_left != 0)) {
        Rcpp::stop("Failed to convert file contents to UTF-8");
    }

    return std::string(utf8_buffer, len - out_bytes_left);
}

The entire function body is plain C++ code in a basic C++1998 standard, calls the C API of R to access iconv if a conversion is selected, and relies on Rcpp for the convenience of automating the interface and translating strings to SEXP objects and back.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)