UTF-8 and wchar

Michal Nazarewicz

unread,

Mar 2, 2010, 8:28:50 AM3/2/10

to

Hello everyone,

I am facing a situation where I need to handle UTF-8 input along with
input from standard input (ie. locale dependent multibyte). In the end,
after some computations, concatenations, etc I need to output it to
standard output (again locale dependent multibyte).

What I want to do is convert both the UTF-8 input as well as data from
standard input to an array of wchar_t and then output it using wprintf()
(or one of the other "wide" functions).

It is my understanding, however, that implementation may choose any kind
of encoding for wchar_t which means that I cannot simply assume it
stores unicode code points.

On the other hand, is my understanding correct that if
__STDC_ISO_10646__ macro is defined then wchar_t in fact represent
unicode code points? If so then I could check for that macro and signal
#error if it's not defined, right?

Also, what happens when I say to wprintf() a string which contains wide
character which has no representation in current locale (ie. some funky
unicode character where locale is set to ISO-8859-1 encoding)? Can
I somehow instruct the standard library function to print, say,
a question mark in such situations or do I have to handle such cases by
myself?

--
Best regards, _ _
.o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +--<mina86*tlen.pl>--<jid:mina86*jabber.org>--ooO--(_)--Ooo--

Mikko Rauhala

unread,

Mar 2, 2010, 10:03:56 AM3/2/10

to

On Tue, 02 Mar 2010 14:28:50 +0100, Michal Nazarewicz <min...@tlen.pl> wrote:
> On the other hand, is my understanding correct that if
> __STDC_ISO_10646__ macro is defined then wchar_t in fact represent
> unicode code points? If so then I could check for that macro and signal
> #error if it's not defined, right?

Yes, but note also that both 16-bit and 32-bit Unicode wchar_t
implementations are in use (the former on Windows and latter
on some *nix systems at least).

(Don't know about wprintf() and unrepresentable characters offhand.)

--
Mikko Rauhala <m...@iki.fi> - http://www.iki.fi/mjr/blog/
The Finnish Pirate Party - http://piraattipuolue.fi/
World Transhumanist Association - http://transhumanism.org/
Singularity Institute - http://singinst.org/

Ersek, Laszlo

unread,

Mar 2, 2010, 1:16:20 PM3/2/10

to

In article <87hboyc...@erwin.mina86.com>, Michal Nazarewicz <min...@tlen.pl> writes:
> Hello everyone,
>
> I am facing a situation where I need to handle UTF-8 input along with
> input from standard input (ie. locale dependent multibyte). In the end,
> after some computations, concatenations, etc I need to output it to
> standard output (again locale dependent multibyte).
>
> What I want to do is convert both the UTF-8 input as well as data from
> standard input to an array of wchar_t and then output it using wprintf()
> (or one of the other "wide" functions).

Handle stdin and stdout like you intend, ie. with setlocale() and the
implicit conversion provided by <stdio.h> functions.

For the UTF-8 input coming from elsewhere: if you can stick with glibc,
just call

#include <iconv.h>

convdesc = iconv_open("WCHAR_T", "UTF-8");

http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html

Otherwise, you'll have to switch at least the LC_CTYPE locale category
manually, and proceed with the separate input like with stdin.

Cheers,
lacos

Ersek, Laszlo

unread,

Mar 2, 2010, 2:20:43 PM3/2/10

to

In article <87hboyc...@erwin.mina86.com>, Michal Nazarewicz <min...@tlen.pl> writes:

> Also, what happens when I say to wprintf() a string which contains wide
> character which has no representation in current locale (ie. some funky
> unicode character where locale is set to ISO-8859-1 encoding)?

wprintf() will return a negative value [and errno will be set to EILSEQ].

> Can I somehow instruct the standard library function to print, say,
> a question mark in such situations or do I have to handle such cases by
> myself?

On a second thought, you might be better off if you converted the output
with iconv() too, from WCHAR_T to the codeset used by the current
locale.

http://www.opengroup.org/onlinepubs/007908775/xsh/iconv.html
----v----
If iconv() encounters a character in the input buffer that is valid, but
for which an identical character does not exist in the target codeset,
iconv() performs an implementation-dependent conversion on this
character.
----^----

(You would have to test this.)

You should be able to get the codeset used by the current locale by
calling

nl_langinfo(CODESET)

http://www.opengroup.org/onlinepubs/007908775/xsh/nl_langinfo.html

(Sorry for being glibc/SUSv2-specific.)

Cheers,
lacos

Michal Nazarewicz

unread,

Mar 2, 2010, 3:42:07 PM3/2/10

to

> In article <87hboyc...@erwin.mina86.com>,
> Michal Nazarewicz <min...@tlen.pl> writes:
>> Also, what happens when I say to wprintf() a string which contains wide
>> character which has no representation in current locale (ie. some funky
>> unicode character where locale is set to ISO-8859-1 encoding)?

la...@ludens.elte.hu (Ersek, Laszlo) writes:
> wprintf() will return a negative value [and errno will be set to EILSEQ].

>> Can I somehow instruct the standard library function to print, say,
>> a question mark in such situations or do I have to handle such cases by
>> myself?
>
> On a second thought, you might be better off if you converted the output
> with iconv() too, from WCHAR_T to the codeset used by the current
> locale.

[...]

> (Sorry for being glibc/SUSv2-specific.)

Thanks for all the links and information. I have been considering
iconv() but didn't notice that it can do conversion to/from wchar_t as
well and that was my biggest concern. I'll be sure to look more into
it.

Being SUS specific is not a big issue since perfect portability is not
my goal (ie. what Mikko Rauhala confirmed earlier about
__STDC_ISO_10646__ (thanks Mikko!) was enough for me as I imagine
"major" implementations use unicode as internal representation of
wchar_t) however depending on glibc may hurt me a bit as my code won't
quite work on, say, BSD then.

Anyway, thanks for all the comments and links!

Nobody

unread,

Mar 2, 2010, 4:16:18 PM3/2/10

to

On Tue, 02 Mar 2010 19:16:20 +0100, Ersek, Laszlo wrote:

> For the UTF-8 input coming from elsewhere: if you can stick with glibc,
> just call
>
> #include <iconv.h>
>
> convdesc = iconv_open("WCHAR_T", "UTF-8");

Or you can write your own UTF-8 encoder/decoder. Personally, I wouldn't
make iconv a requirement just for UTF-8. If you're going to be using iconv
anyhow, then you may as well use it for this as well.

Nick

unread,

Mar 3, 2010, 2:38:45 AM3/3/10

to

Nobody <nob...@nowhere.com> writes:

I found sqlite3_unicode a great help when doing that - lots of stuff to
start you off and no copyright issues:

You can find it at:
<URL:http://ioannis.mpsounds.net/blog/2009/01/11/sqlite3_unicode-updated-for-sqlite3-v367/>

(it's not that ioannis btw).
--
Online waterways route planner | http://canalplan.eu
Plan trips, see photos, check facilities | http://canalplan.org.uk

Mikko Rauhala

unread,

Mar 3, 2010, 5:04:31 AM3/3/10

to

On Tue, 02 Mar 2010 21:42:07 +0100, Michal Nazarewicz <min...@tlen.pl> wrote:
> Thanks for all the links and information. I have been considering
> iconv() but didn't notice that it can do conversion to/from wchar_t as
> well and that was my biggest concern. I'll be sure to look more into
> it.

To clarify further, it's not necessarily able to do so. The GNU
implementation does support it, but more generally, available
iconv sources/targets are implementation-defined.

> however depending on glibc may hurt me a bit as my code won't
> quite work on, say, BSD then.

Indeed I'm not sure if WCHAR_T is available for iconv there.
You can probably use GNU libiconv (under LGPL) there too if you like,
though.

(Yeah, getting Unixy, sorry about that; if one continues further,
probably better to move to comp.unix.programming)

Ersek, Laszlo

unread,

Mar 3, 2010, 7:42:06 AM3/3/10

to

In article <pan.2010.03.02....@nowhere.com>, Nobody <nob...@nowhere.com> writes:
> On Tue, 02 Mar 2010 19:16:20 +0100, Ersek, Laszlo wrote:
>
>> For the UTF-8 input coming from elsewhere: if you can stick with glibc,
>> just call
>>
>> #include <iconv.h>
>>
>> convdesc = iconv_open("WCHAR_T", "UTF-8");
>
> Or you can write your own UTF-8 encoder/decoder.

I rather would not try.

Invalid byte sequences (overlong forms) have security implications:
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

--o--

I would have expected iconv() to support some kind of normalization too:

Combining characters (when already in UCS/wchar_t):
http://www.cl.cam.ac.uk/~mgk25/unicode.html#comb

Normalizing to NFC (when already in UCS/wchar_t -- "NFC is the preferred
form for Linux and WWW"):
http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

But it seems this is not a reasonable expectation:

(2001) http://sources.redhat.com/ml/libc-alpha/2001-09/msg00170.html
(2004) http://sourceware.org/ml/libc-alpha/2004-01/msg00287.html
(2009) http://www.mail-archive.com/bug-co...@gnu.org/msg15501.html

Cheers,
lacos

Michal Nazarewicz

unread,

Mar 7, 2010, 6:51:51 AM3/7/10

to

Michal Nazarewicz <min...@tlen.pl> writes:
> I am facing a situation where I need to handle UTF-8 input along with
> input from standard input (ie. locale dependent multibyte). In the end,
> after some computations, concatenations, etc I need to output it to
> standard output (again locale dependent multibyte).

Thanks everyone for comments. I managed to come up with code that seems
to work. Used the standard C approach with requirement that
__STDC_ISO_10646__ is defined (ie. wchar_t holds Unicode code points).

If anyone interested in the code, here it is:

Initialisation:

#v+
static void initCodesets(void) {
setlocale(LC_CTYPE, "");
}
#v-

Printing a wide-character string replacing invalid characters by '?':

#v+
static void _out(wchar_t ch) {
char buf[MB_CUR_MAX];
int ret;

retry:
ret = wctomb(buf, ch);
if (ret > 0) {
printf("%.*s", ret, buf);
} else if (ch != L'?') {
ch = L'?';
goto retry;
} else {
putchar('?');
}
}

static void _outs(const wchar_t *str, size_t len) {
wctomb(NULL, 0);
for (; len; --len, ++str) {
_out(*str);
}
}
#v-

Converting UTF-8 to wchar_t ignoring invalid or unsupported sequences:

#v+
static size_t appendUTF(size_t offset, const char *str, size_t len) {
uint_least32_t val = 0;
unsigned seq = 0;
wchar_t *out;

ensureCapacity(offset + len);
out = D.line + offset;

/* http://en.wikipedia.org/wiki/UTF-8#Description */
/* Invalid sequences are simply ignored. */
while (len--) {
unsigned char ch = *str++;

if (unlikely(!ch)) {
seq = 0;
} else if (likely(ch < 0x80)) {
seq = 0;
*out++ = ch;
} else if ((ch & 0xC0) == 0x80) {
if (unlikely(!seq)) continue;
val = (val << 6) | (ch & 0x3f);
if (!--seq && (val < 0xD800 || val > 0xDFFF)
&& val <= (uint_least32_t)WCHAR_MAX - WCHAR_MIN) {
*out = val;
}
} else if (unlikely(ch == 0xC0 || ch == 0xC1 || ch >= 0xF5)) {
seq = 0;
} else if ((ch & 0xE0) == 0xC0) {
seq = 1;
val = ch & ~0x1F;
} else if ((ch & 0xF0) == 0xE0) {
seq = 2;
val = ch & 0xF;
} else if ((ch & 0xF0) == 0xF0) {
seq = 3;
val = ch & 0xF;
}
}

return out - D.line;
}
#v-

Converting multibyte string to a wide character string ignoring invalid
sequences:

#v+
static const wchar_t *wideFromMulti(const char *str) {
size_t len, capacity, pos;
wchar_t *buf = NULL;
mbstate_t ps;

pos = 0;
len = strlen(str);
memset(&ps, 0, sizeof ps);
capacity = 16;
goto realloc;

for(;;) {
size_t ret = mbrtowc(buf + pos, str, len, &ps);

if (ret == (size_t)-1) {
/* EILSEQ, try skipping single byte */
++str;
--len;
} else if (ret) {
/* Consumed ret bytes */
str += ret;
len -= ret;
if (++pos < capacity) continue;

capacity *= 2;
realloc:
buf = realloc(buf, capacity * sizeof *buf);
pdie_on(!buf, "malloc");
} else {
/* Got NUL */
return buf;
}
}
}
#v-

The whole code at:
<URL:http://github.com/mina86/tinyapps/blob/master/mpd-show.c>

christian.bau

unread,

Mar 7, 2010, 2:32:43 PM3/7/10

to

That code isn't filtering out 3 and 4 byte sequences that could be
represented in a shorter way, which is one major cause of software
vulnerabilities, and it doesn't filter values > 0x10ffff if WCHAR_MAX
is large (like 32 bit). And I'm not quite sure why you compare against
WCHAR_MAX - WCHAR_MIN.

Ersek, Laszlo

unread,

Mar 7, 2010, 2:35:08 PM3/7/10

to

In article <87sk8ci...@erwin.mina86.com>,
Michal Nazarewicz <min...@tlen.pl> writes:

> static void _out(wchar_t ch) {
> char buf[MB_CUR_MAX];
> int ret;
>
> retry:
> ret = wctomb(buf, ch);
> if (ret > 0) {
> printf("%.*s", ret, buf);
> } else if (ch != L'?') {
> ch = L'?';
> goto retry;
> } else {
> putchar('?');
> }
> }

L'?' can always be represented in the execution character set. '?' is a
member of both the source and the execution basic character set (C99
5.2.1.2p1, 6.4.4.4p11, 7.17p2). Thus if wctomb() fails above,

ch != L'?'

will always hold. What's more, you don't need to convert L'?' with an
explicit call to wctomb(), since it will store a single '?' character
anyway. The members of the basic character sets are locale-independent.

Hence, the middle section and the label are superfluous.

(I didn't look at other parts of your code.)

Cheers,
lacos

Michal Nazarewicz

unread,

Mar 8, 2010, 6:14:12 PM3/8/10

to

"christian.bau" <christ...@cbau.wanadoo.co.uk> writes:

> That code isn't filtering out 3 and 4 byte sequences that could be
> represented in a shorter way,

It's not really an issue for me (I think). The string is not
interpreted in any way so security is not a concern. After converting
to a string of wide characters it's printed to standard output using
wctomb() so everything should be fine.

Still, I'll add a comment pointing that out.

> and it doesn't filter values > 0x10ffff if WCHAR_MAX is large (like 32
> bit).

It does that by treating bytes >= 0xF5 in the input UTF-8 string as
invalid. Refer to
<URL:http://en.wikipedia.org/wiki/UTF-8#Description>.

> And I'm not quite sure why you compare against WCHAR_MAX - WCHAR_MIN.

wchar_t can be signed so (WCHAR_MAX - WCHAR_MIN + 1) is the number of
values it can represent therefore if value <= (WCHAR_MAX - WCHAR_MIN) it
can be represented in wchar_t.

Michal Nazarewicz

unread,

Mar 8, 2010, 6:15:57 PM3/8/10

to

> In article <87sk8ci...@erwin.mina86.com>,
> Michal Nazarewicz <min...@tlen.pl> writes:
>> static void _out(wchar_t ch) {
>> char buf[MB_CUR_MAX];
>> int ret;
>>
>> retry:
>> ret = wctomb(buf, ch);
>> if (ret > 0) {
>> printf("%.*s", ret, buf);
>> } else if (ch != L'?') {
>> ch = L'?';
>> goto retry;
>> } else {
>> putchar('?');
>> }
>> }

la...@ludens.elte.hu (Ersek, Laszlo) writes:
> L'?' can always be represented in the execution character set. [...]

> Hence, the middle section and the label are superfluous.

It's my first time using wchar_t so I wasn't quite sure how things work
in "wide world". Thanks for pointing that out.

UTF-8 and wchar_t

Michal Nazarewicz

Mikko Rauhala

Ersek, Laszlo

Ersek, Laszlo

Michal Nazarewicz

Nobody

Nick

Mikko Rauhala

Ersek, Laszlo

Michal Nazarewicz

christian.bau

Ersek, Laszlo

Michal Nazarewicz

Michal Nazarewicz