UTF-16 as the internal encoding

Christian Kruse

unread,

Jul 21, 2010, 8:13:08 PM7/21/10

to cforu...@googlegroups.com

Hi there.

I'm referring to our IRC discussion. We decided to use UTF-16 as the
internal charset. Well, after I thought about it again I'm unsure if
this is a good idea. UTF-16 support in C is de facto not present. You
can't even use string literals coded in UTF-16 in C… this somewhat
sucks. Why couldn't we use UTF-8? It is at least backwards compatible
and can be used with every C compiler. I can understand that you don't
really want to work with UTF-8 encoding by hand, e.g. substringing them
or something like that. Me neither.

But we (I?) could write a small library implementing the most
important functions, e.g. substr(), strlen(), strcasecmp(), etc, pp.

What do you think?

Greetings,
CK

PGP.sig

Alexander Nitsch

unread,

Jul 22, 2010, 3:58:31 PM7/22/10

to cforu...@googlegroups.com

Hi.

> We decided to use UTF-16 as the internal charset. Well, after I thought
> about it again I'm unsure if this is a good idea. UTF-16 support in C is
> de facto not present.

But it is ICU's standard encoding. I thought that was the whole point.

> You can't even use string literals coded in UTF-16 in C

That's true. The question is: Do we need to? ICU provides macros for
unicode string literals. They only work for what they call "invariant
characters" (latin letters, digits, and some punctuation) though ...

> Why couldn't we use UTF-8? It is at least backwards compatible and can
> be used with every C compiler.

I'd say: drop backward compatibility if a new (hopefully better) solution
in this rewrite requires it. I wouldn't stick to UTF-8 only to keep
backward compatibility -- which is not the only reason for your
suggestion, I know, but you get the point.

What else would make UTF-8 the better choice?

> I can understand that you don't really want to work with UTF-8 encoding
> by hand, e.g. substringing them or something like that. Me neither.

Yes, no doubt about that. We definitely need some lib for unicode
handling, else it will be the same mess as in the current version again.

> But we (I?) could write a small library implementing the most important
> functions, e.g. substr(), strlen(), strcasecmp(), etc, pp.

Personally, I'd rather give ICU (and UTF-16) a shot, simply because much
of the work that is unicode handling has _already_ been invested and
produced this working tool.

--

Alex

Christian Seiler

unread,

Jul 22, 2010, 5:55:18 PM7/22/10

to cforu...@googlegroups.com

Hi,

> That's true. The question is: Do we need to? ICU provides macros for
> unicode string literals. They only work for what they call "invariant
> characters" (latin letters, digits, and some punctuation) though ...

But that's what we need in string literals in the source code - string
literals in C source only contain standard ASCII characters - which are
handled by these ICU macros.

> Personally, I'd rather give ICU (and UTF-16) a shot, simply because much

> of the work that is unicode handling has _already_ been invested and
> produced this working tool.