I am facing a situation where I need to handle UTF-8 input along with
input from standard input (ie. locale dependent multibyte). In the end,
after some computations, concatenations, etc I need to output it to
standard output (again locale dependent multibyte).
What I want to do is convert both the UTF-8 input as well as data from
standard input to an array of wchar_t and then output it using wprintf()
(or one of the other "wide" functions).
It is my understanding, however, that implementation may choose any kind
of encoding for wchar_t which means that I cannot simply assume it
stores unicode code points.
On the other hand, is my understanding correct that if
__STDC_ISO_10646__ macro is defined then wchar_t in fact represent
unicode code points? If so then I could check for that macro and signal
#error if it's not defined, right?
Also, what happens when I say to wprintf() a string which contains wide
character which has no representation in current locale (ie. some funky
unicode character where locale is set to ISO-8859-1 encoding)? Can
I somehow instruct the standard library function to print, say,
a question mark in such situations or do I have to handle such cases by
myself?
--
Best regards, _ _
.o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +--<mina86*tlen.pl>--<jid:mina86*jabber.org>--ooO--(_)--Ooo--
Yes, but note also that both 16-bit and 32-bit Unicode wchar_t
implementations are in use (the former on Windows and latter
on some *nix systems at least).
(Don't know about wprintf() and unrepresentable characters offhand.)
--
Mikko Rauhala <m...@iki.fi> - http://www.iki.fi/mjr/blog/
The Finnish Pirate Party - http://piraattipuolue.fi/
World Transhumanist Association - http://transhumanism.org/
Singularity Institute - http://singinst.org/
Handle stdin and stdout like you intend, ie. with setlocale() and the
implicit conversion provided by <stdio.h> functions.
For the UTF-8 input coming from elsewhere: if you can stick with glibc,
just call
#include <iconv.h>
convdesc = iconv_open("WCHAR_T", "UTF-8");
http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html
Otherwise, you'll have to switch at least the LC_CTYPE locale category
manually, and proceed with the separate input like with stdin.
Cheers,
lacos
> Also, what happens when I say to wprintf() a string which contains wide
> character which has no representation in current locale (ie. some funky
> unicode character where locale is set to ISO-8859-1 encoding)?
wprintf() will return a negative value [and errno will be set to EILSEQ].
> Can I somehow instruct the standard library function to print, say,
> a question mark in such situations or do I have to handle such cases by
> myself?
On a second thought, you might be better off if you converted the output
with iconv() too, from WCHAR_T to the codeset used by the current
locale.
http://www.opengroup.org/onlinepubs/007908775/xsh/iconv.html
----v----
If iconv() encounters a character in the input buffer that is valid, but
for which an identical character does not exist in the target codeset,
iconv() performs an implementation-dependent conversion on this
character.
----^----
(You would have to test this.)
You should be able to get the codeset used by the current locale by
calling
nl_langinfo(CODESET)
http://www.opengroup.org/onlinepubs/007908775/xsh/nl_langinfo.html
(Sorry for being glibc/SUSv2-specific.)
Cheers,
lacos
la...@ludens.elte.hu (Ersek, Laszlo) writes:
> wprintf() will return a negative value [and errno will be set to EILSEQ].
>> Can I somehow instruct the standard library function to print, say,
>> a question mark in such situations or do I have to handle such cases by
>> myself?
>
> On a second thought, you might be better off if you converted the output
> with iconv() too, from WCHAR_T to the codeset used by the current
> locale.
[...]
> (Sorry for being glibc/SUSv2-specific.)
Thanks for all the links and information. I have been considering
iconv() but didn't notice that it can do conversion to/from wchar_t as
well and that was my biggest concern. I'll be sure to look more into
it.
Being SUS specific is not a big issue since perfect portability is not
my goal (ie. what Mikko Rauhala confirmed earlier about
__STDC_ISO_10646__ (thanks Mikko!) was enough for me as I imagine
"major" implementations use unicode as internal representation of
wchar_t) however depending on glibc may hurt me a bit as my code won't
quite work on, say, BSD then.
Anyway, thanks for all the comments and links!
> For the UTF-8 input coming from elsewhere: if you can stick with glibc,
> just call
>
> #include <iconv.h>
>
> convdesc = iconv_open("WCHAR_T", "UTF-8");
Or you can write your own UTF-8 encoder/decoder. Personally, I wouldn't
make iconv a requirement just for UTF-8. If you're going to be using iconv
anyhow, then you may as well use it for this as well.
I found sqlite3_unicode a great help when doing that - lots of stuff to
start you off and no copyright issues:
You can find it at:
<URL:http://ioannis.mpsounds.net/blog/2009/01/11/sqlite3_unicode-updated-for-sqlite3-v367/>
(it's not that ioannis btw).
--
Online waterways route planner | http://canalplan.eu
Plan trips, see photos, check facilities | http://canalplan.org.uk
To clarify further, it's not necessarily able to do so. The GNU
implementation does support it, but more generally, available
iconv sources/targets are implementation-defined.
> however depending on glibc may hurt me a bit as my code won't
> quite work on, say, BSD then.
Indeed I'm not sure if WCHAR_T is available for iconv there.
You can probably use GNU libiconv (under LGPL) there too if you like,
though.
(Yeah, getting Unixy, sorry about that; if one continues further,
probably better to move to comp.unix.programming)
I rather would not try.
Invalid byte sequences (overlong forms) have security implications:
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
--o--
I would have expected iconv() to support some kind of normalization too:
Combining characters (when already in UCS/wchar_t):
http://www.cl.cam.ac.uk/~mgk25/unicode.html#comb
Normalizing to NFC (when already in UCS/wchar_t -- "NFC is the preferred
form for Linux and WWW"):
http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf
But it seems this is not a reasonable expectation:
(2001) http://sources.redhat.com/ml/libc-alpha/2001-09/msg00170.html
(2004) http://sourceware.org/ml/libc-alpha/2004-01/msg00287.html
(2009) http://www.mail-archive.com/bug-co...@gnu.org/msg15501.html
Cheers,
lacos
Thanks everyone for comments. I managed to come up with code that seems
to work. Used the standard C approach with requirement that
__STDC_ISO_10646__ is defined (ie. wchar_t holds Unicode code points).
If anyone interested in the code, here it is:
Initialisation:
#v+
static void initCodesets(void) {
setlocale(LC_CTYPE, "");
}
#v-
Printing a wide-character string replacing invalid characters by '?':
#v+
static void _out(wchar_t ch) {
char buf[MB_CUR_MAX];
int ret;
retry:
ret = wctomb(buf, ch);
if (ret > 0) {
printf("%.*s", ret, buf);
} else if (ch != L'?') {
ch = L'?';
goto retry;
} else {
putchar('?');
}
}
static void _outs(const wchar_t *str, size_t len) {
wctomb(NULL, 0);
for (; len; --len, ++str) {
_out(*str);
}
}
#v-
Converting UTF-8 to wchar_t ignoring invalid or unsupported sequences:
#v+
static size_t appendUTF(size_t offset, const char *str, size_t len) {
uint_least32_t val = 0;
unsigned seq = 0;
wchar_t *out;
ensureCapacity(offset + len);
out = D.line + offset;
/* http://en.wikipedia.org/wiki/UTF-8#Description */
/* Invalid sequences are simply ignored. */
while (len--) {
unsigned char ch = *str++;
if (unlikely(!ch)) {
seq = 0;
} else if (likely(ch < 0x80)) {
seq = 0;
*out++ = ch;
} else if ((ch & 0xC0) == 0x80) {
if (unlikely(!seq)) continue;
val = (val << 6) | (ch & 0x3f);
if (!--seq && (val < 0xD800 || val > 0xDFFF)
&& val <= (uint_least32_t)WCHAR_MAX - WCHAR_MIN) {
*out = val;
}
} else if (unlikely(ch == 0xC0 || ch == 0xC1 || ch >= 0xF5)) {
seq = 0;
} else if ((ch & 0xE0) == 0xC0) {
seq = 1;
val = ch & ~0x1F;
} else if ((ch & 0xF0) == 0xE0) {
seq = 2;
val = ch & 0xF;
} else if ((ch & 0xF0) == 0xF0) {
seq = 3;
val = ch & 0xF;
}
}
return out - D.line;
}
#v-
Converting multibyte string to a wide character string ignoring invalid
sequences:
#v+
static const wchar_t *wideFromMulti(const char *str) {
size_t len, capacity, pos;
wchar_t *buf = NULL;
mbstate_t ps;
pos = 0;
len = strlen(str);
memset(&ps, 0, sizeof ps);
capacity = 16;
goto realloc;
for(;;) {
size_t ret = mbrtowc(buf + pos, str, len, &ps);
if (ret == (size_t)-1) {
/* EILSEQ, try skipping single byte */
++str;
--len;
} else if (ret) {
/* Consumed ret bytes */
str += ret;
len -= ret;
if (++pos < capacity) continue;
capacity *= 2;
realloc:
buf = realloc(buf, capacity * sizeof *buf);
pdie_on(!buf, "malloc");
} else {
/* Got NUL */
return buf;
}
}
}
#v-
The whole code at:
<URL:http://github.com/mina86/tinyapps/blob/master/mpd-show.c>
> static void _out(wchar_t ch) {
> char buf[MB_CUR_MAX];
> int ret;
>
> retry:
> ret = wctomb(buf, ch);
> if (ret > 0) {
> printf("%.*s", ret, buf);
> } else if (ch != L'?') {
> ch = L'?';
> goto retry;
> } else {
> putchar('?');
> }
> }
L'?' can always be represented in the execution character set. '?' is a
member of both the source and the execution basic character set (C99
5.2.1.2p1, 6.4.4.4p11, 7.17p2). Thus if wctomb() fails above,
ch != L'?'
will always hold. What's more, you don't need to convert L'?' with an
explicit call to wctomb(), since it will store a single '?' character
anyway. The members of the basic character sets are locale-independent.
Hence, the middle section and the label are superfluous.
(I didn't look at other parts of your code.)
Cheers,
lacos
> That code isn't filtering out 3 and 4 byte sequences that could be
> represented in a shorter way,
It's not really an issue for me (I think). The string is not
interpreted in any way so security is not a concern. After converting
to a string of wide characters it's printed to standard output using
wctomb() so everything should be fine.
Still, I'll add a comment pointing that out.
> and it doesn't filter values > 0x10ffff if WCHAR_MAX is large (like 32
> bit).
It does that by treating bytes >= 0xF5 in the input UTF-8 string as
invalid. Refer to
<URL:http://en.wikipedia.org/wiki/UTF-8#Description>.
> And I'm not quite sure why you compare against WCHAR_MAX - WCHAR_MIN.
wchar_t can be signed so (WCHAR_MAX - WCHAR_MIN + 1) is the number of
values it can represent therefore if value <= (WCHAR_MAX - WCHAR_MIN) it
can be represented in wchar_t.
la...@ludens.elte.hu (Ersek, Laszlo) writes:
> L'?' can always be represented in the execution character set. [...]
> Hence, the middle section and the label are superfluous.
It's my first time using wchar_t so I wasn't quite sure how things work
in "wide world". Thanks for pointing that out.