iconv_open: anyone having success with conversions between UTF-8, UTF-16, UTF-32?

765 views
Skip to first unread message

Floh

unread,
Oct 12, 2013, 11:22:03 AM10/12/13
to native-cli...@googlegroups.com
I'm banging my head against the wall here since I can't get iconv_open() to work (in current pepper_canary).

I need to converst from UTF-8 to UTF-32 (stored in wchar_t string), UTF-32 to UTF-8, and UTF-16 to UTF-32. The UTF-16 conversion to UTF-32 is necessary because we have file formats which think that wchar_t is 2 bytes (as in Windows). 

On Linux, OSX and emscripten I can use the following iconv_open calls for this:

iconv_open("UTF-32LE", "UTF-8");
iconv_open("UTF-8", "UTF-32LE");
iconv_open("UTF-32LE", "UTF-16LE");

This doesn't work in PNaCl with newlib (didn't test NaCl or glibc), instead the functions return -1, errno is set to 25 (ENOTTY), strerror(errno) says "Not a character device".

Looking at usr/include/newlib.h I first thought that UTF-32 simply isn't support, only UTF-8, UTF-16 and UCS-2, UCS-4 are mentioned there.

So I basically tried all combinations of the following strings:

UCS-4, UCS-2, UCS-4LE, UCS-2LE, ucs-4, ucs-2 (with and without le), ucs_4, ucs_2 (with and without le), ucs4, ucs2, utf8, utf-8, utf_8, etc etc, all without success.

So... did anyone have success with converting between UTF-8, UTF-16 and UTF-32 in PNaCl so far, and if not, are there alternatives (apart from writing a converter myself, which I'm almost ready to do).

Cheers & Thanks,
-Floh.

Floh

unread,
Oct 12, 2013, 12:08:40 PM10/12/13
to native-cli...@googlegroups.com
PS: I think the errno doesn't have any meaningful value and probably isn't set by iconv_open, I moved iconv_open to another place and errno is set to 0 after the iconv_open call but the function still returns with -1.

Victor Khimenko

unread,
Oct 12, 2013, 12:28:17 PM10/12/13
to Native Client Discuss
On Sat, Oct 12, 2013 at 7:22 PM, Floh <flo...@gmail.com> wrote:
I'm banging my head against the wall here since I can't get iconv_open() to work (in current pepper_canary).

I need to converst from UTF-8 to UTF-32 (stored in wchar_t string), UTF-32 to UTF-8, and UTF-16 to UTF-32. The UTF-16 conversion to UTF-32 is necessary because we have file formats which think that wchar_t is 2 bytes (as in Windows). 

On Linux, OSX and emscripten I can use the following iconv_open calls for this:

iconv_open("UTF-32LE", "UTF-8");
iconv_open("UTF-8", "UTF-32LE");
iconv_open("UTF-32LE", "UTF-16LE");

This doesn't work in PNaCl with newlib (didn't test NaCl or glibc), instead the functions return -1, errno is set to 25 (ENOTTY), strerror(errno) says "Not a character device".

Looking at usr/include/newlib.h I first thought that UTF-32 simply isn't support, only UTF-8, UTF-16 and UCS-2, UCS-4 are mentioned there.

They are mentioned there, but they are not compiled-in. Most users don't need iconv and/or are happy with dumb iconv which only supports UTF-8 and nothing else. You'll need to recompile newlib to get anything besides UTF-8. Unfortunately I have no idea how to compile newlib for pNaCl.


Documentation for newlib's iconv can be found, surprisingly enough, in newlib/libc/iconv/iconv.tex . I don't know if it's available online (quick search for "enable-newlib-iconv-from-encodings" "enable-newlib-iconv-to-encodings" return dozen of copies of configure.in and don't return the info files with documentation).

Floh

unread,
Oct 13, 2013, 7:56:26 AM10/13/13
to native-cli...@googlegroups.com
Ah ok, thanks for the info. I think I'll try to come up with a generic UTF-8/16/32 converter, shouldn't be too hard.

Cheers,
-Floh.


Am Samstag, 12. Oktober 2013 17:22:03 UTC+2 schrieb Floh:

JF Bastien

unread,
Oct 14, 2013, 12:48:06 PM10/14/13
to native-cli...@googlegroups.com
I think it would make sense for us to support converting to/from UTF-{8,16,32}, Im not sure about the other encodings (thoughts?). I'm building with this support in right now and will compare the size of final executables to make sure that it only goes up if iconv is used (and stays the same if it isn't). I'll open an issue and send a CL later today, and add some testing. If it all works out (it always does, right?) you should be able to get the canary SDK in a few days with that support.


--
You received this message because you are subscribed to the Google Groups "Native-Client-Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to native-client-di...@googlegroups.com.
To post to this group, send email to native-cli...@googlegroups.com.
Visit this group at http://groups.google.com/group/native-client-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Floh

unread,
Oct 14, 2013, 2:50:25 PM10/14/13
to native-cli...@googlegroups.com
Sounds good, this would be very helpful! These conversion should only add some code, no big conversion tables (for instance, see here: http://clang.llvm.org/doxygen/ConvertUTF_8c_source.html). 

To give a quick overview how we're using this: all our internal strings are represented as UTF-8. On some platforms we need to convert text input from UTF-16 or UTF-32 to UTF-8 (I think Windows is the only one where a wchar_t is 16 bit, all others are 32 bit). On the other end, we need to convert from UTF-8 to UTF-16 or UTF-32 on some platforms for calling system functions (for instance the ...W() functions on Windows), or where 3rd party libs expect wchar_t strings. Thankfully many 3rd party libs accept UTF-8 nowadays. Finally, we use wchar_t in our own text rendering to lookup glyph texture coordinates. I think that covers it. With this stuff we currently support all European languages, several Asian (Chinese, Korean, Thai), and even Arabic (which is special because it needs additional text processing for glyphs at the start and end of strings). We do need UTF-16 to UTF-32 even on platforms where wchar_t is 32 bit because some of our file formats have 16-bit wchar_t in them. 

I don't think language-specific code page conversions are needed anymore in web applications, but I might be wrong.

Thanks and Cheers,
-Floh.
To unsubscribe from this group and stop receiving emails from it, send an email to native-client-discuss+unsub...@googlegroups.com.

JF Bastien

unread,
Oct 18, 2013, 12:36:50 PM10/18/13
to native-cli...@googlegroups.com
It took more time than I expected, but iconv UTF-{8,16,32} support for PNaCl newlib and NaCl x86 newlib is now in. NaCl ARM newlib currently has some other issues standing in the way, should be fixed soon.


The updated newlib should be available in canary SDK builds soon (once r12276 rolls into Chrome, and the SDK build rolls).


To unsubscribe from this group and stop receiving emails from it, send an email to native-client-di...@googlegroups.com.

Floh

unread,
Oct 18, 2013, 3:44:14 PM10/18/13
to native-cli...@googlegroups.com
Nice, thanks!
-Floh
To unsubscribe from this group and stop receiving emails from it, send an email to native-client-discuss+unsubscr
To post to this group, send email to native-cli...@googlegroups.com.

Floh

unread,
Nov 13, 2013, 1:50:56 PM11/13/13
to native-cli...@googlegroups.com
I finally got around testing this. Something's still off:

A wchar_t is a 32-bit little-endian number in PNaCl (e.g. in a string literal L"wide string"), but the UTF-16 and UCS-4 (== UTF-32) conversion functions seem to accept and generate only big-endian numbers. I had to endian-swap all 16-bit and 32-bit characters before and after calling the iconv-routines as a workaround.

On PNaCl I have to use the following iconv_open() calls (others don't seem to be supported):

    convTable[UTF8toUTF32]  = iconv_open("UCS-4", "UTF-8");
    convTable[UTF32toUTF8]  = iconv_open("UTF-8", "UCS-4");
    convTable[UTF16toUTF32] = iconv_open("UCS-4", "UTF-16");

On all other (little-endian) POSIX-like platforms I'm using these calls (note the LE):

    convTable[UTF8toUTF32]  = iconv_open("UTF-32LE", "UTF-8");
    convTable[UTF32toUTF8]  = iconv_open("UTF-8", "UTF-32LE");
    convTable[UTF16toUTF32] = iconv_open("UTF-32LE", "UTF-16LE");

BTW: Here's a text renderer demo which uses the UTF conversion code 


Cheers,
-Floh.

Am Freitag, 18. Oktober 2013 18:36:50 UTC+2 schrieb JF Bastien:
To unsubscribe from this group and stop receiving emails from it, send an email to native-client-discuss+unsubscri...@googlegroups.com.
To post to this group, send email to native-cli...@googlegroups.com.

JF Bastien

unread,
Nov 13, 2013, 2:45:05 PM11/13/13
to native-cli...@googlegroups.com
I though the default (no LE/BE suffix) was the platform's default, and that LE/BE added BOMs?

I'll look into it some more. This internalization thing is foreign to me, despite my name having that weird character in it ;-)
Thanks for reporting this issue.


To unsubscribe from this group and stop receiving emails from it, send an email to native-client-di...@googlegroups.com.

Floh

unread,
Nov 13, 2013, 4:24:01 PM11/13/13
to native-cli...@googlegroups.com
I'm not an expert on iconv either, but I seem to remember that I explicitely had to specify the LE formats on x86 Linux and OSX to make everything work as expected.

JF Bastien

unread,
Nov 13, 2013, 6:01:46 PM11/13/13
to native-cli...@googlegroups.com, Andre Weissflog
OK, I dug into this a bit more and my initial understanding was flawed: when BE/LE is specified then there *shouldn't* be a BOM. UTF8 is correct since it doesn't have an endianness, UTF16 was LE with BOM, and UTF32 was BE without BOM. I changed things around so that we now accept UTF8 (unchanged), UTF16LE, and UCS4LE (newlib's implementation doesn't seem to have the UTF32LE alias to UCS4).

I'll have to do a multi-step update to our newlib build and toolchain so that I don't break intermediate tests, I'll email back when it's all gone through.


To unsubscribe from this group and stop receiving emails from it, send an email to native-client-di...@googlegroups.com.

JF Bastien

unread,
Nov 18, 2013, 5:38:27 PM11/18/13
to native-cli...@googlegroups.com, Andre Weissflog
This should now be fixed for PNaCl as of https://codereview.chromium.org/72293002/
It'll be in a canary SDK in the near future.

Thanks for reporting the issue, let me know if it indeed works for you!
Reply all
Reply to author
Forward
0 new messages