Should C++0x contain a distinct type for UTF-8?

Martin B.

unread,

Aug 22, 2010, 4:15:05 PM8/22/10

to

Hi I have posted this question to c.std.c++ a few weeks back but failed
to get any useful response.
So I thought I'd try a second time:

Should C++0x contain a distinct type for UTF-8?

Current draft N3092 specifies:
+ char16_t* for UTF-16
+ char32_t* for UTF-32
+ char* for execution narrow-character set
+ wchar_t* for execution wide-character set
+ unsigned char*, possibly for raw data buffers etc.

a) Wouldn't it make sense to have a char8_t where char8_t arrays would
hold UTF-8 character sequences exclusively?

b) What is the rationale for not including it?

cheers,
Martin

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Goran Pusic

unread,

Aug 23, 2010, 4:23:18 AM8/23/10

to

On Aug 22, 10:15 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
> Hi I have posted this question to c.std.c++ a few weeks back but failed
> to get any useful response.
> So I thought I'd try a second time:
>
> Should C++0x contain a distinct type for UTF-8?
>
> Current draft N3092 specifies:
> + char16_t* for UTF-16
> + char32_t* for UTF-32
> + char* for execution narrow-character set
> + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
>
> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> hold UTF-8 character sequences exclusively?
>
> b) What is the rationale for not including it?

I would _guess_ that char8_t would be equivalent to char, at which
point you'd lose any use - either people would cast from one to
another freely (bad), either it would really be typedef-ed from char
(bad, again).

In my mind, wchar_t is quite OK to represent Unicode-encoded data on
the given platform, as wchar_t is supposed to represent platform-
specific encoded text. So if platform uses UTF-8, wchar_t should be
used to represent UTF-8 text. Yes, that means that there are
surrogates ( a lot of them :-) ), and that "one datum!=one unicode
code point". But that is already the case when platform uses UTF-8 and
UTF-16 (a __vast__ majority of all platforms, and UTF-32 is might
wasteful for any platform), so there's not much harm done, really.

The problem, in my mind, is that gcc picks a 32-bit datum for systems
that actually use utf-8 (e.g. linux). Why? On Windows, encoding is
UTF-16, and gcc's wchar_t is 16-bit. It would have been quite OK if
wchar_t meant UTF-8 under e.g. linux.

I would say that gcc people came around to this realization too late,
when wchar_t was already in use, and so, gcc on windows could employ
"correct" logic, the other being stuck with UTF-32 for legacy reasons.

So, in my mind, it boils down to this: char* _is_ UTF-8 already, use
of wchar_t is somewhat broken, and standard is fine :-).

Goran.

Martin B.

unread,

Aug 23, 2010, 1:57:30 PM8/23/10

to

Goran Pusic wrote:
> On Aug 22, 10:15 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
>> Hi I have posted this question to c.std.c++ a few weeks back but failed
>> to get any useful response.
>> So I thought I'd try a second time:
>>
>> Should C++0x contain a distinct type for UTF-8?
>>
>> Current draft N3092 specifies:
>> + char16_t* for UTF-16
>> + char32_t* for UTF-32
>> + char* for execution narrow-character set
>> + wchar_t* for execution wide-character set
>> + unsigned char*, possibly for raw data buffers etc.
>>
>> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
>> hold UTF-8 character sequences exclusively?
>>
>> b) What is the rationale for not including it?
>
> I would _guess_ that char8_t would be equivalent to char, at which

I hope it would not. char8_t would be as different from char as e.g.
int16_t is different from char16_t or as different as wchar_t would be
from char16_t. [*]

> point you'd lose any use - either people would cast from one to
> another freely (bad), either it would really be typedef-ed from char
> (bad, again).
>

There would be a choice and if the standard mandates a distinct built-in
type then a typedef doesn't cut it. (or does it?)

> In my mind, wchar_t is quite OK to represent Unicode-encoded data on
> the given platform, as wchar_t is supposed to represent platform-
> specific encoded text. So if platform uses UTF-8, wchar_t should be

> used to represent UTF-8 text. Yes, that means that there are [...]

I thought the whole point of char16_t/char32_t (and possibly char8_t)
was to get rid of platform-dependency and know beforehand what kind of
encoding to expect!

So in light of your reply the question really seems to be: If char16_t
(UTF-16) and char32_t (UTF-32) are deemed valuable to have around, why
was char8_t (UTF-8) not included in C++0x and should it be included to
facilitate type-safety for UTF-8 encoded strings?

cheers,
Martin

[*] Note: Visual Studio 2010 currently implements char16_t as a typedef
to wchar_t. As far as I know this would be non-conforming as char16_t is
supposed to be a distinct type from wchar_t.

Joshua Maurice

unread,

Aug 23, 2010, 8:02:34 PM8/23/10

to

On Aug 23, 1:23 am, Goran Pusic <gor...@cse-semaphore.com> wrote:
> In my mind, wchar_t is quite OK to represent Unicode-encoded data on
> the given platform, as wchar_t is supposed to represent platform-
> specific encoded text. So if platform uses UTF-8, wchar_t should be
> used to represent UTF-8 text. Yes, that means that there are
> surrogates ( a lot of them :-) ), and that "one datum!=one unicode
> code point".

Technically, the term "surrogate pair" refers only to UTF-16 when
encoding Unicode code points beyond the BMP (Basic Multilingual
Plane), aka those requiring 2 UTF-16 encoding units. The problem of
"surrogate pairs", or more generally variable width encoding schemes,
is not the only problem. For correct manipulation of (modern) Unicode
strings, you also need to deal with grapheme clusters, a sequence of
encoded Unicode code points which represent a single "user pierced
character". Example: Unicode code point Latin Letter e followed by
Unicode code point combining character accent acute, aka é. This is 1
"user pierced character", represented by 2 Unicode code points,
encoded with 3 encoding units in UTF-8. (Technically, I typed out the
precomposed form only because I don't know offhand how to use my
current input devices to do the combining character form. I am a
stupid English speaker.)

> But that is already the case when platform uses UTF-8 and
> UTF-16 (a __vast__ majority of all platforms, and UTF-32 is might
> wasteful for any platform), so there's not much harm done, really.
>
> The problem, in my mind, is that gcc picks a 32-bit datum for systems
> that actually use utf-8 (e.g. linux). Why? On Windows, encoding is
> UTF-16, and gcc's wchar_t is 16-bit. It would have been quite OK if
> wchar_t meant UTF-8 under e.g. linux.
>
> I would say that gcc people came around to this realization too late,
> when wchar_t was already in use, and so, gcc on windows could employ
> "correct" logic, the other being stuck with UTF-32 for legacy reasons.
>
> So, in my mind, it boils down to this: char* _is_ UTF-8 already, use
> of wchar_t is somewhat broken, and standard is fine :-).

For the record, here's a sample program I whipped up and ran on some
systems available to me.

#include <limits.h>
#include <iostream>
using namespace std;
int main() { cout << CHAR_BIT << " " << sizeof(wchar_t) << endl; }

Windows, under most compilers, as we all know has output: 8 2
AIX 5.2
with xlC_r version 6
output: 8 2
HP-UX <hostname> B.11.23 U ia64 0622264057 unlimited-user license
with aCC: HP aC++/ANSI C B3910B A.06.05 [Jul 25 2005]
output: 8 4
Linux <hostname> 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008
x86_64 x86_64 x86_64 GNU/Linux
with gcc version 4.1.2
output: 8 4
SunOS 5.8
CC: Sun WorkShop 6 update 2 C++ 5.3 2001/05/15
output: 8 4

I think you're making the claim that most kernels work in terms of
UTF8 *or* UTF16. I'm not sure offhand, but this sounds reasonable.

However, I think that you are mistaken, and that you are not familiar
with the facts of history. AFAIK, this is the basic course of events.
We started with ASCII, an 8 bit encoding of basic Latin characters.
Eventually, other people besides English speakers wanted to use
computers, so they started getting their own encodings which were not
"compatible". So, around 1990, the first Unicode standard came out
which contained a listing of code points for each letter and an
encoding for these code points, a constant width 16 bit encoding known
as UCS-2. It was called "wide char" to distinguish it from the normal
8 bit char. C90 was standardized around the same time, so they added a
wide char, wchar_t. However, Unicode shortly thereafter decided that
16 bits was not enough to give a unique code point to all existing
letters, and thus the encodings UTF-8, UTF-16, and UTF-32 were born.
Sadly, a lot of people are still persisting in the belief that Unicode
means UCS-2. It doesn't help that several popular languages jumped on
board immediately with UCS-2, including C to a small degree and Java
to a very large degree. So, due to various compatibility concerns and
simple ignorance, we're left in this muddled state which we are now.

What does this mean? If you're fine with using an encoding which might
break when you compile the program with one locale and run it on
another locale for some systems, and you're fine with using UTF-16 on
some OSs, and UTF-32 on other OSs, then use C++ wide strings ala
L"..." and wchar_t. It's based on the now discontinued and
insufficient UCS-2, and no one has bothered to clean it up and make it
portably useful, so I much prefer to avoid it in its entirety.

What's the conclusion? It depends on what you mean when you say you
want to "handle" Unicode.

If you just want to take some Unicode data from point A and put it to
point B, then you could use char arrays for all that matters; you're
not interpreting the data.

If you want to take some strings from point A and put it to point B,
possibly changing the encoding, then you can still use char arrays;
you're not interpreting the data in any meaningful way and are
probably using some library to do the translations.

There's also "printing to screen" or dealing with printing Unicode
characters to the terminal in a portable manner. Good luck with this
one. There's also full GUI support which is entirely out of my scope
of experience.

Then there's collation, sorting and equivalent comparisons. Here, you
need some full blown Unicode support library like ICU to do this
correctly, and the new standardized types are marginally useful at
best to accomplish this; they are in no way a replacement for ICU.

Finally, there's full blown string manipulation, such as concatenating
and substringing. For that, the indexing units are usually in terms of
grapheme clusters, not Unicode code points and not encoding units, and
again you need a full blown Unicode library like ICU to do this
correctly.

Goran Pusic

unread,

Aug 24, 2010, 4:56:02 PM8/24/10

to

On Aug 24, 2:02 am, Joshua Maurice <joshuamaur...@gmail.com> wrote:
> On Aug 23, 1:23 am, Goran Pusic <gor...@cse-semaphore.com> wrote:
>
> > In my mind, wchar_t is quite OK to represent Unicode-encoded data on
> > the given platform, as wchar_t is supposed to represent platform-
> > specific encoded text. So if platform uses UTF-8, wchar_t should be
> > used to represent UTF-8 text. Yes, that means that there are
> > surrogates ( a lot of them :-) ), and that "one datum!=one unicode
> > code point".
>
> Technically, the term "surrogate pair" refers only to UTF-16 when
> encoding Unicode code points beyond the BMP (Basic Multilingual
> Plane), aka those requiring 2 UTF-16 encoding units. The problem of
> "surrogate pairs", or more generally variable width encoding schemes,
> is not the only problem. For correct manipulation of (modern) Unicode
> strings, you also need to deal with grapheme clusters, a sequence of
> encoded Unicode code points which represent a single "user pierced
> character". Example: Unicode code point Latin Letter e followed by
> Unicode code point combining character accent acute, aka é. This is 1
> "user pierced character", represented by 2 Unicode code points,
> encoded with 3 encoding units in UTF-8. (Technically, I typed out the
> precomposed form only because I don't know offhand how to use my
> current input devices to do the combining character form. I am a
> stupid English speaker.)

Heh, thanks for that explanation, I didn't know what grapheme clusters
were - until now! To résume, it's a way to represent some Unicode
characters as "clusters" of two, possibly more distinct Unicode code
points, but the user sees that as one character, is this correct?

Yes, albeit it's a bold claim; IIRC, one usually compiles Linux with
UTF-8, and Windows is on UTF-16 (was UCS-2).

> However, I think that you are mistaken, and that you are not familiar
> with the facts of history. AFAIK, this is the basic course of events.
> We started with ASCII, an 8 bit encoding of basic Latin characters.
> Eventually, other people besides English speakers wanted to use
> computers, so they started getting their own encodings which were not
> "compatible". So, around 1990, the first Unicode standard came out
> which contained a listing of code points for each letter and an
> encoding for these code points, a constant width 16 bit encoding known
> as UCS-2. It was called "wide char" to distinguish it from the normal
> 8 bit char. C90 was standardized around the same time, so they added a
> wide char, wchar_t. However, Unicode shortly thereafter decided that
> 16 bits was not enough to give a unique code point to all existing
> letters, and thus the encodings UTF-8, UTF-16, and UTF-32 were born.
> Sadly, a lot of people are still persisting in the belief that Unicode
> means UCS-2. It doesn't help that several popular languages jumped on
> board immediately with UCS-2, including C to a small degree and Java
> to a very large degree. So, due to various compatibility concerns and
> simple ignorance, we're left in this muddled state which we are now.

Well, yes, that's pretty much my understanding, too.

> What does this mean? If you're fine with using an encoding which might
> break when you compile the program with one locale and run it on
> another locale for some systems, and you're fine with using UTF-16 on
> some OSs, and UTF-32 on other OSs, then use C++ wide strings ala
> L"..." and wchar_t. It's based on the now discontinued and
> insufficient UCS-2, and no one has bothered to clean it up and make it
> portably useful, so I much prefer to avoid it in its entirety.

But, wchar_t is normally fine, at least under Windows, because system
actually uses UTF-16. Are you saying that e.g. your AIX there uses
UCS-2, and not UTF-16? I'd be surprised, at least for a current
version of it (no idea how old 5.2 is).

That's kinda why I am saying "just use wchar_t (if it actually
reflected underlying system encoding, which is not the case for gcc on
Linux)".

> What's the conclusion? It depends on what you mean when you say you
> want to "handle" Unicode.
>
> If you just want to take some Unicode data from point A and put it to
> point B, then you could use char arrays for all that matters; you're
> not interpreting the data.
>
> If you want to take some strings from point A and put it to point B,
> possibly changing the encoding, then you can still use char arrays;
> you're not interpreting the data in any meaningful way and are
> probably using some library to do the translations.
>
> There's also "printing to screen" or dealing with printing Unicode
> characters to the terminal in a portable manner. Good luck with this
> one. There's also full GUI support which is entirely out of my scope
> of experience.
>
> Then there's collation, sorting and equivalent comparisons. Here, you
> need some full blown Unicode support library like ICU to do this
> correctly, and the new standardized types are marginally useful at
> best to accomplish this; they are in no way a replacement for ICU.
>
> Finally, there's full blown string manipulation, such as concatenating
> and substringing. For that, the indexing units are usually in terms of
> grapheme clusters, not Unicode code points and not encoding units, and
> again you need a full blown Unicode library like ICU to do this
> correctly.

+1 for that.

Goran.

Pete Becker

unread,

Aug 24, 2010, 4:59:27 PM8/24/10

to

On 2010-08-23 14:02:34 -0400, Joshua Maurice said:

> We started with ASCII, an 8 bit encoding of basic Latin characters.

At the risk of being hyper-technical, ASCII is a 7-bit encoding. There
are many 8-bit extensions, all of which honor the ASCII encoding for
code points less than 128. In particular, there's a family of ISO
encodings with names like ISO 8859-1, that provide more or less
coherent sets of characters with code points of 128 and above.

Yes, it's a jungle out there.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Jens Schmidt

unread,

Aug 24, 2010, 4:58:39 PM8/24/10

to

Joshua Maurice wrote:

> However, I think that you are mistaken, and that you are not familiar
> with the facts of history. AFAIK, this is the basic course of events.
> We started with ASCII, an 8 bit encoding of basic Latin characters.

ASCII is just 7 bits.

> Eventually, other people besides English speakers wanted to use
> computers, so they started getting their own encodings which were not
> "compatible".

Some of them still using 7 bits like the several variants of ISO 646.
Others made the jump to 8 bits like ISO 8859, saving the compatibility
with ASCII (but not with each other).

> So, around 1990, the first Unicode standard came out
> which contained a listing of code points for each letter and an
> encoding for these code points, a constant width 16 bit encoding known
> as UCS-2. It was called "wide char" to distinguish it from the normal
> 8 bit char. C90 was standardized around the same time, so they added a
> wide char, wchar_t. However, Unicode shortly thereafter decided that
> 16 bits was not enough to give a unique code point to all existing
> letters, and thus the encodings UTF-8, UTF-16, and UTF-32 were born.
> Sadly, a lot of people are still persisting in the belief that Unicode
> means UCS-2. It doesn't help that several popular languages jumped on
> board immediately with UCS-2, including C to a small degree and Java
> to a very large degree. So, due to various compatibility concerns and
> simple ignorance, we're left in this muddled state which we are now.
>
> What does this mean? If you're fine with using an encoding which might
> break when you compile the program with one locale and run it on
> another locale for some systems, and you're fine with using UTF-16 on
> some OSs, and UTF-32 on other OSs, then use C++ wide strings ala
> L"..." and wchar_t. It's based on the now discontinued and
> insufficient UCS-2, and no one has bothered to clean it up and make it
> portably useful, so I much prefer to avoid it in its entirety.

Type wchar_t is based on UTF-16 on some systems only, mainly MS Windows.
Others, those with 4 byte wchar_t, use UTF-32. UTF-32 happens to be an
encoding for UCS-4 based on identity.
--
Greetings,
Jens Schmidt

Krisztian

unread,

Aug 24, 2010, 4:56:44 PM8/24/10

to

> b) What is the rationale for not including it?

1) I guess Unicode is too complex in itself to "force" every compiler
vendor to implement it.
Unicode's complexity reflects the complexity of the human languages.

2) Some application even not requires the full Unicode range to be
handled correctly.
I have worked on a server application that used Unicode in the
protocol. The underlying mechanism
was needed to be fast and safe. The "string" operations was very
limited, like comparison or simple string storage.
Basically only European languages was aimed to be supported, and it
was tested only with English and German. So wchar_t simple did the job.

Jens Schmidt

unread,

Aug 24, 2010, 5:02:31 PM8/24/10

to

Goran Pusic wrote:

> In my mind, wchar_t is quite OK to represent Unicode-encoded data on
> the given platform, as wchar_t is supposed to represent platform-
> specific encoded text. So if platform uses UTF-8, wchar_t should be
> used to represent UTF-8 text. Yes, that means that there are
> surrogates ( a lot of them :-) ), and that "one datum!=one unicode
> code point". But that is already the case when platform uses UTF-8 and
> UTF-16 (a __vast__ majority of all platforms, and UTF-32 is might
> wasteful for any platform), so there's not much harm done, really.

You missed the reason for having the code conversion functions in the
locale facets. A typical system has more than one encoding for Unicode.
One of them is for storage and communication. Its most important
features are compactness and unambiguousness: UTF-8.
The other is for processing. Its most important feature is time
efficiency both for access and for modification. Here some systems
decided on UTF-16 (mostly those, which where developed when Unicode
was UCS-2), some on UTF-32 (later, when Unicode expanded to UCS-4).

A lot of the complexity in the conversion systems stems from the
possibility of other non-Unicode (ISO 646, ISO 8859, EUC, Shift-Jis,
Big5, ...) encodings both internally and externally or even each of
UTF-8, UTF-16, and UTF-32 for all purposes.

BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
Many modern architectures are quite slow at bitfield access and gain
their full potential at 32 or even 64 bits per access only. So UTF-32
is not a waste, but a speed/space tradeoff in the speed direction.

> The problem, in my mind, is that gcc picks a 32-bit datum for systems
> that actually use utf-8 (e.g. linux). Why? On Windows, encoding is
> UTF-16, and gcc's wchar_t is 16-bit. It would have been quite OK if
> wchar_t meant UTF-8 under e.g. linux.

Again, "system" is more than the OS. Very few use UTF-8 for everything.
Also on Windows, UTF-16 is not the universal encoding, but just the
internal one for programs. Lots of data is stored as UTF-8.
One difference in comparison to Linux is that the file names and other
system routines use the internal encoding on Windows and the external
encoding on Linux.

> I would say that gcc people came around to this realization too late,
> when wchar_t was already in use, and so, gcc on windows could employ
> "correct" logic, the other being stuck with UTF-32 for legacy reasons.

Other way round. Windows was doomed to continue with UTF-16 by the
data type for the system interface, while on other systems the UTF-8
interface provided the flexibility to go to the "correct" UTF-32.

> So, in my mind, it boils down to this: char* _is_ UTF-8 already, use
> of wchar_t is somewhat broken, and standard is fine :-).

My view: char* is whatever the locale specifies. On my system that is
UTF-8, but other possibilities exist. Use of wchar_t is severely broken,
because one has to have system dependent code variants for UTF-16 and
UTF-32 systems. With char16_t and char32_t this will not get better:
Now even the data type name is system dependent. At least the variant
is documented. One pro for Windows: now it is possible to use standard
types for UTF-8 (char * for file contents), UTF-16 (char16_t for system
interfaces), and UTF-32 (char32_t for everything else) at the same time.

Is this too controversial or can I get a little consent?
--
Greetings,
Jens Schmidt

John Shaw

unread,

Aug 24, 2010, 7:56:17 PM8/24/10

to

On Aug 22, 3:15 pm, "Martin B." <0xCDCDC...@gmx.at> wrote:
> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> hold UTF-8 character sequences exclusively?

I have written a UTF encoding/decoding template library and included
char8_t for consistency. The char16_t and char32_t have been
recommended as Unicode type extensions for years.

1) ISO/IEC JRC1 SC22 WG14 N1040. (2003-11-07)
2) The Unicode Standard, Version 5.0 (ISBN 0-321-48091-0)

Having a distinct type for char8_t would have been nice, for the same
reasons as having them for the others. Not to hold UTF-8 character
sequences exclusively, that requirement would be counter intuitive and
counter productive.

What bothers me is that I see no standards way, at compile time, to
determine if the new types have been implemented. Therefor, I had to
rename my types (utf_char16_t, etc) and provide other workarounds, as
required for wchar_t support, due to ambiguities caused by non-
complaint typedef's of the new types.

> b) What is the rationale for not including it?

I would guess that since char works for both at present, unless you
are using extended ASCII, they felt there was no need.

Seungbeom Kim

unread,

Aug 24, 2010, 7:55:17 PM8/24/10

to

On 2010-08-22 13:15, Martin B. wrote:
>
> Should C++0x contain a distinct type for UTF-8?
>
> Current draft N3092 specifies:
> + char16_t* for UTF-16
> + char32_t* for UTF-32
> + char* for execution narrow-character set
> + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
>
> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
> hold UTF-8 character sequences exclusively?

I guess so, just as char16_t and char32_t do for UTF-16 and UTF-32.

At least, char8_t could be made an unsigned integer type! (That is,
a distinct type with the same representation as uint_least8_t.)
Having to cast to unsigned char for any serious byte handling remains
to be one of my biggest pet peeves.

> b) What is the rationale for not including it?

Probably because that's what the C committee did[N1040], I guess.
C has had a tendency to introduce new character types via typedefs,
such as wchar_t, char16_t, and char32_t (hence the suffix "_t"),
which works well for C because it doesn't have overloading anyway.
And char16_t and char32_t were meant primarily to provide clearly
defined widths for the types and to allow string literals thereof,
none of which a separate char8_t was necessary for.
[N1040] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf

Things are different in C++: it introduces new character types as
distinct types, and it supports overloading. So I believe C++ could
benefit from a separate char8_t type. However, it doesn't seem to
have been done and I do not know whether introduction of char8_t
has ever been discussed in one of the technical papers, or WG14's
N1040 was adopted with just as much "translation" as necessary.

--
Seungbeom Kim

Joshua Maurice

unread,

Aug 24, 2010, 8:04:48 PM8/24/10

to

On Aug 24, 2:02 pm, Jens Schmidt <Jens.Schmidt...@gmx.de> wrote:
> BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
> Many modern architectures are quite slow at bitfield access and gain
> their full potential at 32 or even 64 bits per access only. So UTF-32
> is not a waste, but a speed/space tradeoff in the speed direction.

This is interesting to me. I've always been told that UTF-8 or UTF-16
result in faster processing than UTF-32. Sure, it may take a couple
more instructions to manipulate, but the smaller memory footprint will
result in less cache missses, virtual page misses, etc., which results
in faster wallclock execution time. I would guess that the memory
concerns would dwarf the couple additional instructions on any real
world desktop. However, I don't have that much real world experience
though in this regard. Can anyone else comment?

--

Joshua Maurice

unread,

Aug 24, 2010, 8:05:07 PM8/24/10

to

Yes. That is correct. See
http://unicode.org/reports/tr29/tr29-6.html#Grapheme_Cluster_Boundaries

Note that while there are a lot precomposed forms for the European
languages for a large portion of the grapheme clusters, IIRC the
Unicode code points for some east Asian languages (was it Thai?) do
not have precomposed forms for a majority of the grapheme clusters.
Why did Unicode assign code points this way? I don't know. Maybe the
total number of combinations was too large to give each combination
its own Unicode code point?

Jens Schmidt

unread,

Aug 25, 2010, 1:28:47 PM8/25/10

to

Joshua Maurice wrote:

> Note that while there are a lot precomposed forms for the European
> languages for a large portion of the grapheme clusters, IIRC the
> Unicode code points for some east Asian languages (was it Thai?) do
> not have precomposed forms for a majority of the grapheme clusters.
> Why did Unicode assign code points this way? I don't know. Maybe the
> total number of combinations was too large to give each combination
> its own Unicode code point?

There are some explanations on the Unicode site for "what gets an own
codepoint". In short, precomposed forms are not welcome and avoided
where possible.
For European languages this was not possible, because another more
important rule guarantees exactly one Unicode code point for each
code point in any other pre-existing standard. Those other standards
(ISO 8859, but also others) brought lots of precomposed forms with
them.
--
Greetings,
Jens Schmidt

Goran Pusic

unread,

Aug 25, 2010, 1:27:26 PM8/25/10

to

On Aug 24, 11:02 pm, Jens Schmidt <Jens.Schmidt...@gmx.de> wrote:
> Goran Pusic wrote:
> > In my mind, wchar_t is quite OK to represent Unicode-encoded data on
> > the given platform, as wchar_t is supposed to represent platform-
> > specific encoded text. So if platform uses UTF-8, wchar_t should be
> > used to represent UTF-8 text. Yes, that means that there are
> > surrogates ( a lot of them :-) ), and that "one datum!=one unicode
> > code point". But that is already the case when platform uses UTF-8 and
> > UTF-16 (a __vast__ majority of all platforms, and UTF-32 is might
> > wasteful for any platform), so there's not much harm done, really.
>
> You missed the reason for having the code conversion functions in the
> locale facets. A typical system has more than one encoding for Unicode.
> One of them is for storage and communication. Its most important
> features are compactness and unambiguousness: UTF-8.

You mean e.g. "one encoding is best for for storage and communication,
and it's UTF-8", am I reading you correctly? If so, yes, I can live
with that idea, although, how is any other encoding ( UTF-16/32, 7,
even :-) ) less unambiguous? Due to endiannes? Yes, fair enough.

> The other is for processing. Its most important feature is time
> efficiency both for access and for modification. Here some systems
> decided on UTF-16 (mostly those, which where developed when Unicode
> was UCS-2), some on UTF-32 (later, when Unicode expanded to UCS-4).
>
> A lot of the complexity in the conversion systems stems from the
> possibility of other non-Unicode (ISO 646, ISO 8859, EUC, Shift-Jis,
> Big5, ...) encodings both internally and externally or even each of
> UTF-8, UTF-16, and UTF-32 for all purposes.
>
> BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
> Many modern architectures are quite slow at bitfield access and gain
> their full potential at 32 or even 64 bits per access only. So UTF-32
> is not a waste, but a speed/space tradeoff in the speed direction.

Ah, that sounds true.

> > The problem, in my mind, is that gcc picks a 32-bit datum for systems
> > that actually use utf-8 (e.g. linux). Why? On Windows, encoding is
> > UTF-16, and gcc's wchar_t is 16-bit. It would have been quite OK if
> > wchar_t meant UTF-8 under e.g. linux.
>
> Again, "system" is more than the OS. Very few use UTF-8 for everything.
> Also on Windows, UTF-16 is not the universal encoding, but just the
> internal one for programs. Lots of data is stored as UTF-8.

You mean "programs under windows use UTF-8 to store data", or...? If
so, I contend that this design decision is wrong.

You see, using UTF-8 under Windows simply means going back and forth
between UTF-8 and 16 whenever you need to do anything with the system.
That's busywork, why bother? And even when text data has to go out of
Windows, marking it as UTF-16LE still works, at least until receiving
system understands UTF-16LE, which it should, it's standard, after
all. (Of course, receiving system should convert it upon reception to
whatever it sees fit).

__Only__ if "sending data out" requires data in specific encoding,
because receiver requires said encoding (for whatever reason), should
program divert from OS's encoding. I believe such situations are rare,
because if text really is Unicode and it goes to another system,
chances are that other system will have a comprehensive Unicode
support. If not, a specific encoding is needed anyhow and chances are,
it's not UTF-8.

In a way, I believe that one should work with the OS, and that applies
to text encoding, too. If there's a lot of Unicode text processing,
then a library should be used, one that recognizes the system and
works with it. Does ICU do that? I don't know, I am not much into text
processing, my code cares only about correctly receiving, transporting
and displaying it. This probably taints my view, too.

> One difference in comparison to Linux is that the file names and other
> system routines use the internal encoding on Windows and the external
> encoding on Linux.
>
> > I would say that gcc people came around to this realization too late,
> > when wchar_t was already in use, and so, gcc on windows could employ
> > "correct" logic, the other being stuck with UTF-32 for legacy reasons.
>
> Other way round. Windows was doomed to continue with UTF-16 by the
> data type for the system interface, while on other systems the UTF-8
> interface provided the flexibility to go to the "correct" UTF-32.
>
> > So, in my mind, it boils down to this: char* _is_ UTF-8 already, use
> > of wchar_t is somewhat broken, and standard is fine :-).
>
> My view: char* is whatever the locale specifies. On my system that is
> UTF-8, but other possibilities exist.

I think that's not a good stance, because locale is more than text
representation, even if we only consider text. E.g. depending on
locale, collation changes, but text in UTF-8 does not.

> Use of wchar_t is severely broken,
> because one has to have system dependent code variants for UTF-16 and
> UTF-32 systems.

OTOH, wchar_t would be OK when encoding of stored data is known up-
front (that's the case anyhow) and when any text manipulation routines
do not require a specific encoding (that's not the case, except
perhaps for ICU). In that light, if gcc did what I think it should
do :-), wchar_t would be fine for all things Unicode.

> Is this too controversial or can I get a little consent?

+1 for both :-)

Goran.

--

Martin B.

unread,

Aug 26, 2010, 3:47:02 PM8/26/10

to

Everyone on a western Windows is already more or less using Win-1252 by
default, so I'd say you always run into trouble even with an
english-only application.

cheers,
Martin

Martin B.

unread,

Aug 26, 2010, 3:47:27 PM8/26/10

to

On 24.08.2010 23:02, Jens Schmidt wrote:
>[...]

> My view: char* is whatever the locale specifies.

> [...] Use of wchar_t is severely broken,

> because one has to have system dependent code variants for UTF-16 and
> UTF-32 systems. With char16_t and char32_t this will not get better:
> Now even the data type name is system dependent. At least the variant
> is documented.

I think you do not *need* system dependent code variants for UTF-16 and
UTF-32. You only need those if you want to run your program on systems
that use different conventions. But with wchar_t this difference is
outside the scope of what the compiler can check.
Here char_16/32_t will be better that wchar_t because the compiler can
actually tell you that you're missing one variant.

cheers,
Martin

--

Martin B.

unread,

Aug 26, 2010, 11:05:01 PM8/26/10

to

On 25.08.2010 01:55, Seungbeom Kim wrote:
> On 2010-08-22 13:15, Martin B. wrote:
>>
>> Should C++0x contain a distinct type for UTF-8?
>>
>> Current draft N3092 specifies:
>> + char16_t* for UTF-16
>> + char32_t* for UTF-32

>> [...]

>>
>> a) Wouldn't it make sense to have a char8_t where char8_t arrays would
>> hold UTF-8 character sequences exclusively?
>
> I guess so, just as char16_t and char32_t do for UTF-16 and UTF-32.

> [...]

>
>> b) What is the rationale for not including it?
>
> Probably because that's what the C committee did[N1040], I guess.

> [...]

> And char16_t and char32_t were meant primarily to provide clearly
> defined widths for the types and to allow string literals thereof,
> none of which a separate char8_t was necessary for.
> [N1040] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf
>
> Things are different in C++: it introduces new character types as
> distinct types, and it supports overloading. So I believe C++ could
> benefit from a separate char8_t type. However, it doesn't seem to
> have been done and I do not know whether introduction of char8_t
> has ever been discussed in one of the technical papers, or WG14's
> N1040 was adopted with just as much "translation" as necessary.
>

Interestingly, the C paper apparently only specifies string literals for
UTF-16 and UTF-32 u"" and U"".

The C++ standard adds u8"" but for some (for me) very weird reason fails
to add a distinct char8_t type.

Does anyone know a specific reason for this? Was it an active decision
based on some rationale or just an oversight?

cheers,
Martin

--

Daniel Krügler

unread,

Aug 27, 2010, 7:28:33 AM8/27/10

to

On 27 Aug., 05:05, "Martin B." <0xCDCDC...@gmx.at> wrote:
> Interestingly, the C paper apparently only specifies string literals for
> UTF-16 and UTF-32 u"" and U"".

Note that the most recent draft of C1x

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1494.pdf

also provides the u8 character prefix and maps it to
unsigned char.

Greetings from Bremen,

Daniel Krügler

Martin B.

unread,

Aug 27, 2010, 4:56:07 PM8/27/10

to

On 27.08.2010 13:28, Daniel Krügler wrote:
> On 27 Aug., 05:05, "Martin B."<0xCDCDC...@gmx.at> wrote:
>> Interestingly, the C paper apparently only specifies string literals for
>> UTF-16 and UTF-32 u"" and U"".
>
> Note that the most recent draft of C1x
>
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1494.pdf
>
> also provides the u8 character prefix and maps it to
> unsigned char.
>

Well. I think that mapping u8 literals to unsigned char instead of char
*would* be a *great* improvement, but looking at the document you linked:

N1494, p71, §6.4.5, Item 6:
[...] For UTF−8 string literals, the array elements have type char, and
are initialized with the characters of the multibyte character sequence,
as encoded in UTF−8. [...]

Seems to be the same as in C++ (N3092).

cheers,
Martin

Seungbeom Kim

unread,

Aug 27, 2010, 11:18:00 PM8/27/10

to

On 2010-08-27 04:28, Daniel Krügler wrote:
> On 27 Aug., 05:05, "Martin B." <0xCDCDC...@gmx.at> wrote:
>> Interestingly, the C paper apparently only specifies string literals for
>> UTF-16 and UTF-32 u"" and U"".
>
> Note that the most recent draft of C1x
>
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1494.pdf
>
> also provides the u8 character prefix and maps it to
> unsigned char.

Unsigned char sounds better than plain char, and I thought, why doesn't
C++ also do that? ... But, 6.4.5/6 of n1494 linked above says:

"For UTF−8 string literals, the array elements have type char, [...]."

So it's not unsigned char here, either. Or did I miss something?

--
Seungbeom Kim

Daniel Krügler

unread,

Aug 28, 2010, 4:45:54 PM8/28/10

to

On 28 Aug., 05:18, Seungbeom Kim <musip...@bawi.org> wrote:
> On 2010-08-27 04:28, Daniel Krügler wrote:
>
> > On 27 Aug., 05:05, "Martin B." <0xCDCDC...@gmx.at> wrote:
> >> Interestingly, the C paper apparently only specifies string literals for
> >> UTF-16 and UTF-32 u"" and U"".
>
> > Note that the most recent draft of C1x
>
> >http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1494.pdf
>
> > also provides the u8 character prefix and maps it to
> > unsigned char.
>
> Unsigned char sounds better than plain char, and I thought, why doesn't
> C++ also do that? ... But, 6.4.5/6 of n1494 linked above says:
>

> "For UTF-8 string literals, the array elements have type char, [...]."

>
> So it's not unsigned char here, either. Or did I miss something?

I apologize for the confusion: It was too late when I
looked into the C-draft: I did know that they have
taken basically the same approach as C++ and searched
in the draft for the proper location where the reference
was given. During the search I also stopped shortly
in 6.4.4.4/9, where the completely different reference
to unsigned char burnt into some brain ROM when latter
on ending at the places where the proper reference to
the u8 prefix was given.

Greetings from Bremen,

Daniel Krügler

--

Miles Bader

unread,

Aug 28, 2010, 4:41:43 PM8/28/10

to

Joshua Maurice <joshua...@gmail.com> writes:
>> BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
>> Many modern architectures are quite slow at bitfield access and gain
>> their full potential at 32 or even 64 bits per access only. So UTF-32
>> is not a waste, but a speed/space tradeoff in the speed direction.
>
> This is interesting to me. I've always been told that UTF-8 or UTF-16
> result in faster processing than UTF-32. Sure, it may take a couple
> more instructions to manipulate, but the smaller memory footprint will
> result in less cache missses, virtual page misses, etc., which results
> in faster wallclock execution time. I would guess that the memory
> concerns would dwarf the couple additional instructions on any real
> world desktop. However, I don't have that much real world experience
> though in this regard. Can anyone else comment?

I seem to recall that at least at some point, there was a recommendation
that people use UTF-8 only for storage, and then use something like UCS4
for internal processing. [I don't really know the reasoning, other than
perhaps a very simplistic "oh single code points are easier!"]

However, in my experience, that recommendation is typically ignored
in practice, and many, many, programs very successfully use UTF-8 for
internal text processing as well.

[I think if was going to write a text-editor or something, I'd probably
use UTF-8 internally too.]

-Miles

--
Opposition, n. In politics the party that prevents the Goverment from running
amok by hamstringing it.

Walter Bright

unread,

Aug 28, 2010, 5:59:12 PM8/28/10

to

Jens Schmidt wrote:
> BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
> Many modern architectures are quite slow at bitfield access and gain
> their full potential at 32 or even 64 bits per access only. So UTF-32
> is not a waste, but a speed/space tradeoff in the speed direction.

Are you sure about that? My experience is that UTF-8 is faster because:

1. nearly all the data is ASCII anyway, the code path if the high bit is set is
rarely taken

2. UTF-32 consumes 4 x the memory, which means if you're dealing with a lot of
data you're consuming (and paging and cache missing) memory at a fantastic rate

--

nm...@cam.ac.uk

unread,

Aug 29, 2010, 11:42:42 AM8/29/10

to

In article <87pqx3o...@catnip.gol.com>, Miles Bader <mi...@gnu.org> wrote:
>
> I seem to recall that at least at some point, there was a recommendation
> that people use UTF-8 only for storage, and then use something like UCS4
> for internal processing. [I don't really know the reasoning, other than
> perhaps a very simplistic "oh single code points are easier!"]

I haven't been close to this area for many years, but I am pretty
certain that it's because the rules for such things as sequence
equivalence in UTF-8 are so evil that it is almost inconceivable
that most programs will get them right. This was one of the main
objections to Unicode by the ISO people, back during the duel.

> However, in my experience, that recommendation is typically ignored
> in practice, and many, many, programs very successfully use UTF-8 for
> internal text processing as well.

For a limited meaning of "very successfully"! I do not use such
programs very often, and almost invariably notice a misbehaviour
that is likely to be in this area when I do.

Regards,
Nick Maclaren.

--

Joshua Maurice

unread,

Aug 29, 2010, 11:41:03 AM8/29/10

to

On Aug 28, 1:41 pm, Miles Bader <mi...@gnu.org> wrote:

> Joshua Maurice <joshuamaur...@gmail.com> writes:
>>> BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
>>> Many modern architectures are quite slow at bitfield access and gain
>>> their full potential at 32 or even 64 bits per access only. So UTF-32
>>> is not a waste, but a speed/space tradeoff in the speed direction.
>
>> This is interesting to me. I've always been told that UTF-8 or UTF-16
>> result in faster processing than UTF-32. Sure, it may take a couple
>> more instructions to manipulate, but the smaller memory footprint will
>> result in less cache missses, virtual page misses, etc., which results
>> in faster wallclock execution time. I would guess that the memory
>> concerns would dwarf the couple additional instructions on any real
>> world desktop. However, I don't have that much real world experience
>> though in this regard. Can anyone else comment?
>
> I seem to recall that at least at some point, there was a recommendation
> that people use UTF-8 only for storage, and then use something like UCS4
> for internal processing. [I don't really know the reasoning, other than
> perhaps a very simplistic "oh single code points are easier!"]

As I try to emphasize, this is false AFAIK. To repeat my above list of
possible string usage:

- Blind transport of data from point A to point B.
In this case, the encoding does not matter.

- Changing encoding.
I don't see how using UTF-32 over UTF-8 makes this any easier as
you're probably using a library to do the translation.

- User Interface.
This is above my level of expertise. However, I would expect that
again the "ease of manipulation" of UTF-32 does not exist.

- Collation
Again you're using a library if you're doing this correctly, so there
is no "ease of manipulation" with UTF-32 (apart from whatever the
library accepts. IIRC, ICU still uses UTF-16 as the only encoding on
which it does collation, and UTF-16 is a required intermediary
encoding to do translations from arbitrary encodings to arbitrary
encodings.)

- Substringing and concatenating
Presumably this is what they mean when they say that UTF-32 is easier
than UTF-8. I suspect that people who say this are "silly English
speakers" who do not know what a grapheme cluster is. They think that
the unit of choice for substringing is Unicode code points, though I
strongly suspect that it's probably grapheme clusters for actual real
world scenarios.

--

Jens Schmidt

unread,

Aug 29, 2010, 11:48:18 AM8/29/10

to

Walter Bright wrote:

> Jens Schmidt wrote:
>> BTW, UTF-32 is probably the fastest encoding for processing Unicode data.
>> Many modern architectures are quite slow at bitfield access and gain
>> their full potential at 32 or even 64 bits per access only. So UTF-32
>> is not a waste, but a speed/space tradeoff in the speed direction.
>
> Are you sure about that?

I have no idea. Probably the decision between the various formats on speed
currently is just premature optimisation.

Actually, I now consider all variants about equal. One should use whatever
format is easiest to understand and handle. Only if this results in a
bottleneck verified by measuring one should have to think about it.

> My experience is that UTF-8 is faster because:
> 1. nearly all the data is ASCII anyway, the code path if the high bit is
> set is rarely taken

Only if you are in America, Western Europe, or Australia. In large parts of
the world (Asia, Eastern Europe) nearly all characters are multibyte in
UTF-8.

> 2. UTF-32 consumes 4 x the memory, which means if you're dealing with a
> lot of data you're consuming (and paging and cache missing) memory at a
> fantastic rate

Is character data really that relevant in most programs? Many C++ Libraries
use the short string optimisation in their string representation. They do
this, because lots of strings are actually short. For those the handling
overhead should dwarf any size increase of 4× some small memory¹.

[¹]² A real "times" character just to introduce some non-ASCII here.
[²] ... and some more for the footnote symbols. :-)
--
Greetings,
Jens Schmidt

Miles Bader

unread,

Aug 30, 2010, 12:31:58 AM8/30/10

to

nm...@cam.ac.uk writes:
> I haven't been close to this area for many years, but I am pretty
> certain that it's because the rules for such things as sequence
> equivalence in UTF-8 are so evil that it is almost inconceivable
> that most programs will get them right.

I'm not sure what you mean by "sequence equivalence." If you mean
equivalent UTF-8 encodings for the same unicode character, why would
that be a problem, if you control the encoder (or validate your input)?

The really annoying stuff in dealing with unicode seem to be at a level
above the encoding, e.g. with sequences of unicode codepoints (combining
characters etc etc), and of course using ucs4 does not help with such
things.

-Miles

--
Come now, if we were really planning to harm you, would we be waiting here,
beside the path, in the very darkest part of the forest?

Walter Bright

unread,

Aug 30, 2010, 12:42:46 AM8/30/10

to

Jens Schmidt wrote:

> Walter Bright wrote:
>> My experience is that UTF-8 is faster because:
>> 1. nearly all the data is ASCII anyway, the code path if the high bit is
>> set is rarely taken
>
> Only if you are in America, Western Europe, or Australia. In large parts of
> the world (Asia, Eastern Europe) nearly all characters are multibyte in
> UTF-8.

What's flying around in the user's computer is not necessarily the same as what
the user actually is reading.

>> 2. UTF-32 consumes 4 x the memory, which means if you're dealing with a
>> lot of data you're consuming (and paging and cache missing) memory at a
>> fantastic rate
>
> Is character data really that relevant in most programs? Many C++ Libraries
> use the short string optimisation in their string representation. They do
> this, because lots of strings are actually short. For those the handling
> overhead should dwarf any size increase of 4× some small memory¹.

If your program does anything with the internet, it deals with a *lot* of strings!

--

nm...@cam.ac.uk

unread,

Aug 30, 2010, 10:33:50 AM8/30/10

to

In article <874oedm...@catnip.gol.com>, Miles Bader <mi...@gnu.org> wrote:
>
>> I haven't been close to this area for many years, but I am pretty
>> certain that it's because the rules for such things as sequence
>> equivalence in UTF-8 are so evil that it is almost inconceivable
>> that most programs will get them right.
>
> I'm not sure what you mean by "sequence equivalence." If you mean
> equivalent UTF-8 encodings for the same unicode character, why would
> that be a problem, if you control the encoder (or validate your input)?

It's a form of lexical analysis for a language with a context-
dependent grammar. And not even to the extent of C (and hence C++),
but a much worse case. So even a simple string comparator has to
either embed parsing in its code, or call the parser, or convert
the format to UCS-4 before doing the job.

Regards,
Nick Maclaren.

--

Martin B.

unread,

Aug 30, 2010, 10:35:59 AM8/30/10

to

On 28.08.2010 22:45, Daniel Krügler wrote:
> On 28 Aug., 05:18, Seungbeom Kim<musip...@bawi.org> wrote:
>> On 2010-08-27 04:28, Daniel Krügler wrote:
>>
>>> On 27 Aug., 05:05, "Martin B."<0xCDCDC...@gmx.at> wrote:
>>>> Interestingly, the C paper apparently only specifies string literals for
>>>> UTF-16 and UTF-32 u"" and U"".
>>
>>> Note that the most recent draft of C1x
>>
>>> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1494.pdf
>>
>>> also provides the u8 character prefix and maps it to
>>> unsigned char.
>>
>> Unsigned char sounds better than plain char, and I thought, why doesn't
>> C++ also do that? ... But, 6.4.5/6 of n1494 linked above says:
>>
>> "For UTF-8 string literals, the array elements have type char, [...]."
>>
>> So it's not unsigned char here, either. Or did I miss something?
>
> I apologize for the confusion: It was too late when I

> looked into the C-draft: [...]
>

Using unsigned char instead of char would still be better. Not as good as a dedicated char8_t, but still an improvement.

I would like to take the opportunity to repeat my initial questions here:

a) Wouldn't it make sense to have a char8_t where char8_t arrays would
hold UTF-8 character sequences exclusively?

b) What is the rationale for not including it?

It seems they apply equally to the C std draft and to the C++ FCD.

cheers,
Martin

Bo Persson

unread,

Aug 30, 2010, 7:58:41 PM8/30/10

to

Martin B. wrote:
>
> I would like to take the opportunity to repeat my initial questions
> here:
> a) Wouldn't it make sense to have a char8_t where char8_t arrays
> would hold UTF-8 character sequences exclusively?

Maybe, but how would we assure that?

>
> b) What is the rationale for not including it?

Just a guess that already having three 8-bit character types, some of
which are identical, doesn't encourage introducing a fourth. To be
consistent with the type system, wouldn't a char8_t have to be
convertible to char (or unsigned char, or both) anyway?

Bo Persson

Martin B.

unread,

Sep 13, 2010, 6:29:20 PM9/13/10

to

On 31.08.2010 01:58, Bo Persson wrote:
> Martin B. wrote:
>>
>> I would like to take the opportunity to repeat my initial questions
>> here:
>> a) Wouldn't it make sense to have a char8_t where char8_t arrays
>> would hold UTF-8 character sequences exclusively?
>
> Maybe, but how would we assure that?
>

Assure what? That char8_t arrays hold only valid UTF-8 strings? We don't
need to assure that. Same as we don't need to assure valid UTF-16
strings in char16_t arrays.
BUT if there where a separate UTF-8 type then the user couldn't
accidentally mix UTF-8 string literals vs. plain-char string literals!

>>
>> b) What is the rationale for not including it?
>
> Just a guess that already having three 8-bit character types, some of
> which are identical, doesn't encourage introducing a fourth. To be
> consistent with the type system, wouldn't a char8_t have to be
> convertible to char (or unsigned char, or both) anyway?
>

Three?

Anyways - I'm not concerned about the convertibility of single char
values. You could not directly convert strings (char8_t* / char* /
uchar*) to each other and _the_compiler_ could tell the difference btw.
UTF-8 string literals and char* string literals!

cheers,
Martin

Miles Bader

unread,

Sep 13, 2010, 6:32:29 PM9/13/10

to

nm...@cam.ac.uk writes:
>> I'm not sure what you mean by "sequence equivalence." If you mean
>> equivalent UTF-8 encodings for the same unicode character, why would
>> that be a problem, if you control the encoder (or validate your input)?
>
> It's a form of lexical analysis for a language with a context-
> dependent grammar. And not even to the extent of C (and hence C++),
> but a much worse case. So even a simple string comparator has to
> either embed parsing in its code, or call the parser, or convert
> the format to UCS-4 before doing the job.

Not sure what you're trying to claim, but AFAICS, there are two common
sorts of tasks:

(1) "bulk" tasks, e.g. copying, or ignoring-unicode-issues comparison
which can generally done "raw", i.e. on raw bytes.

These can be done similarly on either UTF-8 or UCS4. In either
case, you treat the strings as sequences of bytes with certain
constraints (in either case, pointers should be probably be
aligned on character boundaries, though "realigning" is not
hard). "Decoding" is not generally necessary.

(2) "Codepoint aware" tasks. These typically involve actually
decoding each character. This is more complicated when using
UTF-8, but not significantly so (UTF-8 is a very simple
encoding), because UTF-8 is a very simple multibyte encoding.

For either case (UCS4 or UTF-8) you aren't going to just read in
external files into your strings without re-encoding/validating them,
but its pretty much equivalently complex for either.

Of course, UTF-8 has traditional problems associating with multibyte
encodings -- e.g., you can't use direct indexing by character, and more
difficulties estimating memory allocation -- and UCS4 has the
traditional bloat associated with fixed-length encodings.

Which of those is a bigger issue seems to depend a lot on exactly
which tasks the application is going to perform, and how it is
designed.

So if you believe there are _additional_ problems with UTF-8 that make
it a poorer choice generally (as you've been implying), can you state
them more clearly?

-Miles

--
=====
(^o^;
(()))
*This is the cute octopus virus, please copy it into your sig so it can spread.

Jens Schmidt

unread,

Sep 13, 2010, 7:00:54 PM9/13/10

to

Walter Bright wrote:

> Jens Schmidt wrote:
>> Only if you are in America, Western Europe, or Australia. In large par=
ts
>> of the world (Asia, Eastern Europe) nearly all characters are multibyt=
e
>> in UTF-8.
>
> What's flying around in the user's computer is not necessarily the same=

as
> what the user actually is reading.

Quite right. But are these strings still strings or parsed syntax trees?

> If your program does anything with the internet, it deals with a *lot* =
of
> strings!

Actually, in the internet a lot of strings are handled case insensitive
(DNS, HTML). Also other properties of strings and characters have to be
assessed. For security, mixing of characters from various blocks must be
suppressed. A Greek =CE=92 is not a Latin B and a Cyrillic =D0=A0 not a L=
atin P.

All these checks and conversions are much easier (faster?) if performed o=
n
UTF-32 or UTF-16.

Did anybody ever code all three variants and measure them on various
architectures?
--
Greetings,
Jens Schmidt