Using R's cross-platform iconv wrapper from cpp11
In some recent adventures parsing text embedded within binary files, I came across the need to correctly interpret input bytes from a file representing a character string. As a developer, you want to write software that protects users from ever having to deal with an encoding issue! I do a fair amount of (and like the challenge of) parsing old file formats that assumed system encoding, and in some cases I need to write (/enjoy writing and learning about) the parser in compiled code. So how does iconv()
work when you’re trying to do this in C++?
The first principle is: always return a character vector with elements encoded as (and correctly marked as) UTF-8. In particular, Jim Hester has written about this and given an excellent overview in a talk about cpp11. As somebody who now spends 37.5 hours a week in Windows for work, I appreciate this more than I did before. Basically, R’s default is to interpret the bytes of a string as system encoding, which for everything except Windows is UTF-8. In Windows, strings have to be correctly marked in order to be interpreted as UTF-8, and UTF-8 strings that are not marked will end up looking like this:
## [1] "Québec"
In the recent talk about cpp11 I linked above, Jim Hester alludes to the idea that you should only deal with non UTF-8 encodings at the edges of your program. When reading user-supplied files, that’s where we’re at! I’ll dig in with an example. I’ve taken the liberty of writing the great province of Québec’s name ^[This is pretty close to exactly what inspired this: government-produced shapefiles generally have to support both French and English and require a some guessing about what encoding the computer generating the file was using that day.] to disk so that we have some files to work with:
readr::read_file("qc.utf8.txt")
## [1] "Québec"
readr::read_file(
"qc.windows1252.txt",
locale = readr::locale(encoding = "windows-1252")
)
## [1] "Québec"
The problem is that if you don’t have a way to specify the input encoding, you get something like this:
readr::read_file("qc.windows1252.txt")
## [1] "Qu\xe9bec"
Or this:
readr::read_file(
"qc.utf8.txt",
locale = readr::locale(encoding = "windows-1252")
)
## [1] "Québec"
To start off, I’ll illustrate one way to read in some bytes from a file using cpp11’s wrapper around C++:
#include <fstream>
#include <memory>
#include "cpp11.hpp"
using namespace cpp11;
namespace writable = cpp11::writable;
[[cpp11::register]]
std::string demo_read_file(std::string filename) {
char buffer[1024];
std::ifstream file;
file.open(filename, std::ifstream::binary);
file.read(buffer, 1024);
size_t n_read = file.gcount();
file.close();
return std::string(buffer, n_read);
}
This function only reads the first 1024 bytes of the file, but this is ok for our purposes. Giving it a go on our files, we get the following:
demo_read_file("qc.utf8.txt")
## [1] "Québec"
demo_read_file("qc.windows1252.txt")
## [1] "Qu\xe9bec"
Lo and behold we get the weird output I foretold! This is because cpp11 marks anything that comes out of C++ as UTF-8, and the second file was not UTF-8 encoded:
Encoding(demo_read_file("qc.utf8.txt"))
## [1] "UTF-8"
Encoding(demo_read_file("qc.windows1252.txt"))
## [1] "UTF-8"
Enter iconv! In particular, the Riconv()
function which is built in to R and accessible in C or C++ via the R_ext/Riconv.h
header. While you could attempt to link to a system library (that might have more encodings supported), using the built-in R one saves you from writing a configure script and the other complexities that arise when using a system library. Unfortunately, the headers don’t give us much to go on:
void * Riconv_open (const char* tocode, const char* fromcode);
size_t Riconv (void * cd, const char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);
int Riconv_close (void * cd);
I am eternally grateful for this post, which has to be one of the only readable examples of using iconv()
on the internet (my apologies if you have a good one that I missed). Basically, we need an output buffer that’s big enough to hold the re-encoded string and to very carefully pass it to Riconv()
^[The degree of carefulness I exude when calling functions from other libraries is directly proportional to the number of *
and &
symbols in the definition, and there are a lot of them here].
#include <fstream>
#include <memory>
#include "cpp11.hpp"
using namespace cpp11;
namespace writable = cpp11::writable;
#include <R_ext/Riconv.h>
[[cpp11::register]]
std::string demo_read_file_enc(std::string filename, std::string encoding) {
char buffer[1024];
std::ifstream file;
file.open(filename, std::ifstream::in | std::ifstream::binary);
file.read(buffer, 1024);
size_t n_read = file.gcount();
file.close();
std::string str_source(buffer, n_read);
void* iconv_handle = Riconv_open("UTF-8", encoding.c_str());
if (iconv_handle == ((void*) -1)) {
stop("Can't convert from '%s' to 'UTF-8'", encoding.c_str());
}
const char* in_buffer = str_source.c_str();
char utf8_buffer[2048];
char* utf8_buffer_mut = utf8_buffer;
size_t in_bytes_left = n_read;
size_t out_bytes_left = 2048;
size_t result = Riconv(
iconv_handle,
&in_buffer, &in_bytes_left,
&utf8_buffer_mut, &out_bytes_left
);
Riconv_close(iconv_handle);
if (result == ((size_t) -1) || (in_bytes_left != 0)) {
stop("Failed to convert file contents to UTF-8");
}
return std::string(utf8_buffer, 2048 - out_bytes_left);
}
The thing I missed the first 7 times I tried this was that **outbuf
needs to be a separate pointer from utf8_buffer
because it gets modified by Riconv()
. Actually, everything except inbuf
gets modified. Also, the error codes of (void*) -1
and (size_t) -1
I only got from reading the R source code for R’s iconv()
. Let’s see if it worked!
demo_read_file_enc("qc.utf8.txt", "UTF-8")
## [1] "Québec"
demo_read_file_enc("qc.windows1252.txt", "windows-1252")
## [1] "Québec"
The above example won’t work for everything. In particular, I guessed that a buffer twice as big as the input would be enough to hold the output, which may or may not be true. Because the example did not call any C++ that could throw an exception between Riconv_open()
and Riconv_close()
, I didn’t need to do anything fancy to manage the lifecycle of that handle. When I implemented this to get the encoding right while reading DBF files, I used a C++ class with a deleter that called Riconv_close()
to ensure it would not be forgotten about and leak memory.