Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

unicode and string

120 views
Skip to first unread message

Fulvio Esposito

unread,
Jan 5, 2012, 12:05:14 PM1/5/12
to
Hi all,
I was recently studying unicode and internationalization and
some questions come to my mind about C++ string.

Correct me if I'm wrong. std::string and std::wstring member
functions simply doesn't work for UTF-8 or UTF-16 encoded
strings 'cause they wrongly assume "code_point==code_unit"
(for example length() returns the length of the sequence
and not the size of the unicode string, operator[] could
not return the code point if it's represented by two or
more code unit, etc.).

So in the end, what's the best strategy to handle unicode
strings in C++? Many suggest to use UTF-8 and std::string,
many others UTF-16 and std::wstring (but on linux wchar_t
are often 32-bit wide :S), ICU uses UTF-16 by default but
has its own UnicodeString.

As a use case, imagine a GUI Toolkit, what should be the
type of the Text property for a TextBox?

Fulvio Esposito


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

cpisz

unread,
Jan 5, 2012, 4:17:02 PM1/5/12
to
On Jan 5, 11:05 am, Fulvio Esposito <esposito.ful...@gmail.com> wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> Correct me if I'm wrong. std::string and std::wstring member
> functions simply doesn't work for UTF-8 or UTF-16 encoded
> strings 'cause they wrongly assume "code_point==code_unit"
> (for example length() returns the length of the sequence
> and not the size of the unicode string, operator[] could
> not return the code point if it's represented by two or
> more code unit, etc.).
>
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?

I'm going through the same kind of thing. In a Windows environment all
the literals are 'UTF-16', but more accurately it is the subset of
UTF-16 that fits into 2 bytes. I have services that talk across the
network, some talking UTF-8, some talking UTF16BE, some talking
UTF16LE, some talking ANSI C strings. It is a royal pain in my arse.
Especially when the data goes bad somewhere and I have to figure out
if the bytes obtained are indeed valid in the encoding expected.

The best advice I can give for now is to treat the data as bytes to
and from files, across sockets, etc. Use standard streams otherwise
and keep in mind that you can only represent those character that will
fit into 2 bytes.

Cx11 is supposed to change this with better locales and factets I
believe.

Martin B.

unread,
Jan 5, 2012, 4:17:10 PM1/5/12
to
On 05.01.2012 18:05, Fulvio Esposito wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> (...)
> So in the end, what's the best strategy to handle unicode
> strings in C++? ...
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?
>

For this use case, I'd use *the string class that the GUI toolkit provides*. If it doesn't provide one, it should document what kind of "strings" it expects and you're left to choose.

cheers,
Martin

Fulvio Esposito

unread,
Jan 6, 2012, 12:28:06 AM1/6/12
to
> For this use case, I'd use *the string class that
> the GUI toolkit provides*. If it doesn't provide one,
> it should document what kind of "strings" it expects
> and you're left to choose.
>

I was thinking about writing a GUI Toolkit, and how to
tackle localization problems. If I write the toolkit
and I wanna use only standard library, how can I deal
best with Unicode strings? What's a resonably way to
handle it?

Regards,
Fulvio

Goran

unread,
Jan 6, 2012, 3:54:53 PM1/6/12
to
On Jan 5, 6:05 pm, Fulvio Esposito <esposito.ful...@gmail.com> wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> Correct me if I'm wrong. std::string and std::wstring member
> functions simply doesn't work for UTF-8 or UTF-16 encoded
> strings 'cause they wrongly assume "code_point==code_unit"
> (for example length() returns the length of the sequence
> and not the size of the unicode string, operator[] could
> not return the code point if it's represented by two or
> more code unit, etc.).
>
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?

I think that you need to put aside the idea that a Unicode string
needs to have one unicode code point per element, because this isn't
the case in either Unix variants nor Windows. It isn't the case
because none are using Unicode encoding that allows that (UTF-32). If
you need per-code-point processing, go from whatever representation
you get from the system, to e.g. ICU.

Finaly, cspisz is wrong, Windows does know UTF-16 (what he describes
is UCS2, and that has been abandoned in windows about a decade ago).

As for "best strategy", I would go for "play with the system", that
is, use system's native encoding as much as possible, and, for cross-
system storage or transport, use encoding that you "main" system knows
best, and convert when on another system. So for example, if your main
system is Linux, go for UTF-8 for storage and transport. But never,
ever, forget to convert that to UTF-16 when using your stuff to
interact with Windows. That would typically mean using
MultiByteToWideChar with CP_UTF8 to get your wstring.

Goran.

Alf P. Steinbach

unread,
Jan 6, 2012, 3:57:27 PM1/6/12
to
On 05.01.2012 18:05, Fulvio Esposito wrote:
>
[snip]
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.

It's a bit more complex.

On a typical modern Linux `char` means UTF-8 and `wchar_t` means UTF-32, ...

while in Windows `char` definitiely means Windows ANSI (which is a
locale specific encoding, defined by GetACP API function) and `wchar_t`
definitely means UTF-16 or, in consoles, the UCS-2 subset.

The Windows meanings of the built-in types are at odds with C++98, e.g.
the arguments to `main`, and they're at odds with C++11, e.g. `u8`
literals, which produce entirely the wrong type in Windows.

So, with the built-in types as victims of some apparent political war,
and therefore ungood, IMHO the only reasonable thing to do, starting at
the fundamental level, is to define a new basic encoding value type, one
that is type-wise different from `char` and `wchar_t`.

One hurdle is then to make such a new encoding value type work with
std::basic_string, which is desirable.

If you define the type as e.g.

struct EncodingValue { char value; };

then, while in practice the size of that beast will be very suitable, as
soon as you define a constructor you will likely run into problems with
std::basic_string implementations, since for the short string
optimization the implementation may put such in a union, and if you
don't define a constructor then you can't support existing code that
does things like char_type( intValue ), for a generic char_type.

C++11 provides a way out, namely the based enum,

enum EncodingValue: char {};

Then you can both support existing char_type( intValue ) constructs, and
std::basic_string implementations with dirty unions inside.

However, given this beast that Possibly Can Do The Job(TM), the question
now becomes what the job really is. Personally on the most basic level I
want such a type that is defined in a system-specific manner, like
`char` in Linux and like `wchar_t` in Windows. But an alternative is
such a type that is defined like `Uint32` everywhere.

Both have trade-offs.


> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?

Oh, I think definitely some type based on a custom encoding value type
as discussed above. But then? The question of what the Job is, is
difficult, and has perhaps many possible answers...

Cheers & hth.,

- Alf

Jean-Marc Bourguet

unread,
Jan 6, 2012, 4:04:34 PM1/6/12
to
Fulvio Esposito <esposit...@gmail.com> writes:

> Correct me if I'm wrong. std::string and std::wstring member functions
> simply doesn't work for UTF-8 or UTF-16 encoded strings 'cause they
> wrongly assume "code_point==code_unit" (for example length() returns
> the length of the sequence and not the size of the unicode string,
> operator[] could not return the code point if it's represented by two
> or more code unit, etc.).

I think you are wrong for string.

The character handling model in C and C++ is

* char is used for what Unicode TR17 calls a compound CES whose CEFs can
be variable width (the C and C++ standardese terms indicating this are
"multibyte characters" and "shift states")

* wchar_t is indeed used for a simple CES associated with a CEF which is
of fixed width one.

Those two CES are locale dependant.

So with the adequate locale, having UTF-8 data in std::string seems the
correct thing.

C11 and C++11 add UTF-8 string literals (u8" " with type char), UTF-16
string and character literals (u" ", u' ' char16_t) and UTF-32 string
and character literals (U" ", U' ', char32_t). There are some more
things available in the library, but the support is pretty minimal.

> So in the end, what's the best strategy to handle unicode strings in
> C++? Many suggest to use UTF-8 and std::string,

This correspond to the intented use with a Unicode locale.

> many others UTF-16 and std::wstring (but on linux wchar_t are often
> 32-bit wide :S),

A locale could meaningfully use UCS-16 with 16 bit wchar_t or UTF-32
with a more than 21 bit wchar_t.

> As a use case, imagine a GUI Toolkit, what should be the type of the
> Text property for a TextBox?

For the interface, I'd tend to use appropriately the locale mecanism,
i.e. accepting and returning std::string and std::wstring in the narrow
and wide encoding for the global locale (or a user defined one, but I
don't think it worth the pain).

Internally I'd convert to an Unicode representation (with a fast path
without conversion for Unicode locales, they are of common use
nowadays). To do the conversion to Unicode, you'll have to rely on
implementation dependence as there is no C++ way to get access to the
CES used by a given locale (for instance POSIX has nl_info which gives
that information for the C locale) nor to convert it to Unicode (C11 and
C++11 have mbrtoc16 and mbrtoc32 but there is no interface with the C++
locale mechanism and I've no idea how well they are available -- they
come form an earlier TR on the subject, which could have helped their
avaibility --, POSIX has iconv for the conversion).

Yours,

--
Jean-Marc

Thiago Adams

unread,
Jan 6, 2012, 4:10:10 PM1/6/12
to
> I was thinking about writing a GUI Toolkit, and how to
> tackle localization problems. If I write the toolkit
> and I wanna use only standard library, how can I deal
> best with Unicode strings? What's a resonably way to
> handle it?

Use wstring, it will work.

You may have problems if you have to save or transmit strings; in that
case you have to choose an encoding.

The sample below shows how to save and load a std::wstring using UTF8
encoding.


#include <fstream>
#include <codecvt>
#include <string>
using namespace std;

int main()
{

// writing
{
std::locale ulocale(locale(), new codecvt_utf8<wchar_t>) ;
std::wofstream ofs("test.txt");
ofs.imbue(ulocale);
ofs << L"maçã"; //apple in portuguese

}

// reading
{
std::locale ulocale(locale(), new codecvt_utf8<wchar_t>) ;
std::wifstream ifs("test.txt");
ifs.imbue(ulocale);
std::wstring ws;
std::getline(ifs, ws);
}
}


---
http://www.thradams.com/

cpp4ever

unread,
Jan 6, 2012, 8:25:04 PM1/6/12
to
On 05/01/12 17:05, Fulvio Esposito wrote:
> Hi all,
> I was recently studying unicode and internationalization and
> some questions come to my mind about C++ string.
>
> Correct me if I'm wrong. std::string and std::wstring member
> functions simply doesn't work for UTF-8 or UTF-16 encoded
> strings 'cause they wrongly assume "code_point==code_unit"
> (for example length() returns the length of the sequence
> and not the size of the unicode string, operator[] could
> not return the code point if it's represented by two or
> more code unit, etc.).
>
> So in the end, what's the best strategy to handle unicode
> strings in C++? Many suggest to use UTF-8 and std::string,
> many others UTF-16 and std::wstring (but on linux wchar_t
> are often 32-bit wide :S), ICU uses UTF-16 by default but
> has its own UnicodeString.
>
> As a use case, imagine a GUI Toolkit, what should be the
> type of the Text property for a TextBox?
>
> Fulvio Esposito
>
>

Hmmmm, not something I'd like to try to do, but then I use Qt GUI
toolkit which already provides unicode string handling for QString. Rest
assured your concerns are well justified, having experienced problems
with Japanese characters when unicode was not maintained for all strings.

regards

cpp4ever

Martin B.

unread,
Jan 12, 2012, 2:50:18 PM1/12/12
to
On 06.01.2012 21:57, Alf P. Steinbach wrote:
> The Windows meanings of the built-in types are at odds with C++98, e.g.
> the arguments to `main`, and they're at odds with C++11, e.g. `u8`
> literals, which produce entirely the wrong type in Windows.

I'm confused. Wich VS version has the u8 string literals implemented?
And inhowfar is it broken?

cheers,
Martin

Alf P. Steinbach

unread,
Jan 12, 2012, 6:06:00 PM1/12/12
to
On 12.01.2012 20:50, Martin B. wrote:
> On 06.01.2012 21:57, Alf P. Steinbach wrote:
>> The Windows meanings of the built-in types are at odds with C++98, e.g.
>> the arguments to `main`, and they're at odds with C++11, e.g. `u8`
>> literals, which produce entirely the wrong type in Windows.
>
> I'm confused. Wich VS version has the u8 string literals implemented?
> And inhowfar is it broken?

AFAIK no current version of Visual C++ implements the new C++11 string literal prefixes (up to and including the technical preview of MSVC 11), although the new C++11 types are there.

MinGW g++ 4.6.1 for Windows does, however, implement the u8 prefix.

C++11 §2.14.5/7 defines the type of an u8 string literal as an array of `char`. That works nicely for the *nix world, where `char` now by default means UTF-8 encoding. In Windows, however, `char` means Windows ANSI encoding (e.g., that's the execution character set for MSVC).

That means that the C++ type checking does not prevent you ending up with gobbledygook, treating an UTF-8 encoded string as Windows ANSI.

One might say that the `char` type was inadvertently too much overloaded with meanings (default single-byte character set encoding value type, byte, UTF-8 value type), but given that the problems with overloaded `char` meanings are well known the addition of an extra meaning that will only surface as problematic in Windows, smells a bit of politics to me -- and if so it probably means: difficult to fix...

Cheers,

- Alf

Jean-Marc Bourguet

unread,
Jan 13, 2012, 9:25:06 PM1/13/12
to
"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> C++11 §2.14.5/7 defines the type of an u8 string literal as an array of
> char`. That works nicely for the *nix world, where `char` now by default
> means UTF-8 encoding. In Windows, however, `char` means Windows ANSI
> encoding (e.g., that's the execution character set for MSVC).

char in C++ means encoded in a multibyte statefull encoding which depend on
the locale. In "C" locale it is often just ASCII (7 bits).

> That means that the C++ type checking does not prevent you ending up with
> gobbledygook, treating an UTF-8 encoded string as Windows ANSI.

What is ANSI? Code page 1250? 1251? 1252? 1253? Something else (Shift JIS
for instance)? One of those dependent on the Windows version and
configuration? I fear it is the later one.

> One might say that the `char` type was inadvertently too much overloaded
> with meanings (default single-byte character set encoding value type,
> byte, UTF-8 value type), but given that the problems with overloaded
> `char` meanings are well known the addition of an extra meaning that will
> only surface as problematic in Windows, smells a bit of politics to me --
> and if so it probably means: difficult to fix...

In the standard model (from C89 and Amd1 in 94/95) there was never an
intention to have the encoding part of the type. Just two encodings per
locale, a statefull multibyte one in char, a one unit per character one in
wchar_t. And the precise encoding choice has been dependent on the locale
for as long as there was encoding support in C and C++ (you can setup
things so that wchar_t in some locales is related to one of the EUC
encodings -- those are variable length but IIRC just prepending 0 bytes to
make them fixed width will work --, in other UTF-32 for instance). The
encoding used for litterals has always been implementation dependant.

IMHO, the problem isn't choosing between alternate models better suited to
the current days, it is finding one which provides a path of transition
from the current one without impacting those who depend on it.

Yours,

--
Jean-Marc

Alf P. Steinbach

unread,
Jan 14, 2012, 4:47:13 PM1/14/12
to
On 14.01.2012 03:25, Jean-Marc Bourguet wrote:
> "Alf P. Steinbach"<alf.p.stein...@gmail.com> writes:
>
>> C++11 §2.14.5/7 defines the type of an u8 string literal as an array of
>> char`. That works nicely for the *nix world, where `char` now by default
>> means UTF-8 encoding. In Windows, however, `char` means Windows ANSI
>> encoding (e.g., that's the execution character set for MSVC).
>
> char in C++ means encoded in a multibyte statefull encoding which depend on
> the locale. In "C" locale it is often just ASCII (7 bits).

Yes, you can say that the problem resides with the standard not
reflecting and catering for the in-practice.


>> That means that the C++ type checking does not prevent you ending up with
>> gobbledygook, treating an UTF-8 encoded string as Windows ANSI.
>
> What is ANSI? Code page 1250? 1251? 1252? 1253? Something else (Shift JIS
> for instance)? One of those dependent on the Windows version and
> configuration? I fear it is the later one.

Right you are: it's a locale dependent encoding.

As a result, with Visual C++ having that encoding as its C++ narrow
character execution character set, the executable that you get when you
build with the Visual C++ compiler depends on the configured locale,
i.e. the same binary source code produces different binary executables.

Oh, that was just trivia, but it serves to illustrate that this is truly
a mess, not just in the C++ standard.


>> One might say that the `char` type was inadvertently too much overloaded
>> with meanings (default single-byte character set encoding value type,
>> byte, UTF-8 value type), but given that the problems with overloaded
>> `char` meanings are well known the addition of an extra meaning that will
>> only surface as problematic in Windows, smells a bit of politics to me --
>> and if so it probably means: difficult to fix...
>
> In the standard model (from C89 and Amd1 in 94/95) there was never an
> intention to have the encoding part of the type. Just two encodings per
> locale, a statefull multibyte one in char, a one unit per character one in
> wchar_t. And the precise encoding choice has been dependent on the locale
> for as long as there was encoding support in C and C++ (you can setup
> things so that wchar_t in some locales is related to one of the EUC
> encodings -- those are variable length but IIRC just prepending 0 bytes to
> make them fixed width will work --, in other UTF-32 for instance). The
> encoding used for litterals has always been implementation dependant.

The encoding for `char` literals is implementation dependent yes,
because the C++ execution character set is implementation dependent.

But you could rely on the encoding for `char` literals being the C++
execution character set.

Now with C++11 you can not rely on that.


> IMHO, the problem isn't choosing between alternate models better suited to
> the current days, it is finding one which provides a path of transition
> from the current one without impacting those who depend on it.

That sentence sounds as if finding a better way would be somehow
difficult. Well that's meaningless and highly misleading: the C++11
standard does employ a better way for the other new prefixes, just not
for "u8", which will probably cause a bit of trouble for Windows
programmers. The sentence above also also sounds as if finding a better
way is somehow in conflict with finding something better suited to the
current, and also that is meaningless and highly misleading.


Cheers & hth.,

- Alf


--

Martin B.

unread,
Jan 14, 2012, 4:57:33 PM1/14/12
to
On 13.01.2012 00:06, Alf P. Steinbach wrote:
> On 12.01.2012 20:50, Martin B. wrote:
>> On 06.01.2012 21:57, Alf P. Steinbach wrote:
>>> The Windows meanings of the built-in types are at odds with C++98, e.g.
>>> the arguments to `main`, and they're at odds with C++11, e.g. `u8`
>>> literals, which produce entirely the wrong type in Windows.
>>
>> I'm confused. Wich VS version has the u8 string literals implemented?
>> And inhowfar is it broken?
>
> AFAIK no current version of Visual C++ implements the new C++11 string
> literal prefixes (up to and including the technical preview of MSVC 11),
> although the new C++11 types are there.
>
> MinGW g++ 4.6.1 for Windows does, however, implement the u8 prefix.
>
> C++11 §2.14.5/7 defines the type of an u8 string literal as an array of
> `char`. That works nicely for the *nix world, where `char` now by
> default means UTF-8 encoding. In Windows, however, `char` means Windows
> ANSI encoding (e.g., that's the execution character set for MSVC).
>
> That means that the C++ type checking does not prevent you ending up
> with gobbledygook, treating an UTF-8 encoded string as Windows ANSI.
>

Ah yes. Now I remember. I actually started a thread a wile back
regarding this:

http://groups.google.com/group/comp.lang.c++.moderated/browse_thread/thread/e0206cac5b8c1417/152a0e2f7a0dd8ed

and

http://groups.google.com/group/comp.std.c++/browse_thread/thread/24c6cc6ae3713c94/1b0e33fbf0f8120c

Personally I feel it was a *very* bad decision not to have a distinct
UTF-8 character (literal) type. (I mean, we have char16_t and char32_t,
why the hell not char8_t and be done with it!)

cheers,
Martin

Miles Bader

unread,
Jan 15, 2012, 8:13:25 AM1/15/12
to
"Martin B." <0xCDC...@gmx.at> writes:
> Personally I feel it was a *very* bad decision not to have a distinct
> UTF-8 character (literal) type. (I mean, we have char16_t and char32_t,
> why the hell not char8_t and be done with it!)

Presumably the issue is that in sane (non-MS) implementations, utf-8
literals work perfectly well with existing char-based infrastructure,
and an increasingly large number of interfaces simply assume all char*
strings are encoded using utf-8, and they didn't want the giant ball
of hair that would come with a really distinct type.

[Granted, MS-style wide-strings result in a similar giant ball of hair
("hey, can you duplicate all yer interfaces and datatypes, only with
wchar_t? ... hey, now how about with char16_t? ... er, hey, .."), but
if anything that serves as a _warning_...]

Maybe they could have had some sort of automatic promotion
(automagically converting char8_t* to char*t) and made it all work
out, I dunno, but given the widespread use of utf-8 in char* strings
and the potential for snowballing complexity, this may be the most
practical route-of-least-effort.

-Miles

--
Come now, if we were really planning to harm you, would we be waiting here,
beside the path, in the very darkest part of the forest?

Jean-Marc Bourguet

unread,
Jan 15, 2012, 8:17:00 AM1/15/12
to
"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> On 14.01.2012 03:25, Jean-Marc Bourguet wrote:
>> "Alf P. Steinbach"<alf.p.stein...@gmail.com> writes:
>>
>>> C++11 §2.14.5/7 defines the type of an u8 string literal as an array of
>>> char`. That works nicely for the *nix world, where `char` now by default
>>> means UTF-8 encoding. In Windows, however, `char` means Windows ANSI
>>> encoding (e.g., that's the execution character set for MSVC).
>>
>> char in C++ means encoded in a multibyte statefull encoding which depend on
>> the locale. In "C" locale it is often just ASCII (7 bits).
>
> Yes, you can say that the problem resides with the standard not
> reflecting and catering for the in-practice.

AFAIK, what was standardized in C89 and Amd1 was the existing practice at
the time. And I'd not be surprised that the practice is continued by those
who started it.

> Oh, that was just trivia, but it serves to illustrate that this is truly
> a mess, not just in the C++ standard.

The mess in character encoding issues started at the latest in the XIXth
century.

> The encoding for `char` literals is implementation dependent yes, because
> the C++ execution character set is implementation dependent.
>
> But you could rely on the encoding for `char` literals being the C++
> execution character set.
>
> Now with C++11 you can not rely on that.

"The execution character set" has a content which is locale dependant, in
other words there is one of them per locale. If one want to use the
Unicode terminology (instead of the, IMHO misguiding, standard one): there
is one ACR per locale with two CES:

- a narrow one using char, which is a variable width (multibyte) and
compound (notion of shift state) onem

- a wide one using wchar_t, which is a simple one of fixed width one.

You have some constraints on the encoding used by these two CES for the
characters in the basic execution character set, but they don't say
anything about what isn't there.

>> IMHO, the problem isn't choosing between alternate models better suited to
>> the current days, it is finding one which provides a path of transition
>> from the current one without impacting those who depend on it.
>
> That sentence sounds as if finding a better way would be somehow
> difficult.

I'll try to be clearer. Finding a better model is easy. Finding a better
model for which there is a easy path of transition for those who use the
current model isn't. Deciding to have an hard transition for those who use
the current model isn't either (there was opposition enough to prevent the
removal of the trigraphs, and that was far more consensual).

Yours,

--
Jean-Marc

Alf P. Steinbach

unread,
Jan 15, 2012, 6:02:34 PM1/15/12
to
On 15.01.2012 14:17, Jean-Marc Bourguet wrote:
> "Alf P. Steinbach"<alf.p.stein...@gmail.com> writes:
>
[snipped low s/n part]
>
>>> IMHO, the problem isn't choosing between alternate models better suited to
>>> the current days, it is finding one which provides a path of transition
>>> from the current one without impacting those who depend on it.
>>
>> That sentence sounds as if finding a better way would be somehow
>> difficult.
>
> I'll try to be clearer. Finding a better model is easy. Finding a better
> model for which there is a easy path of transition for those who use the
> current model isn't.

I'm sorry but that is wrong and highly misleading.

C++11 does use a better model for the other new prefixes.

Using that model for "u8" would not make any transition harder.

On the contrary, not using that model is, to the extent that lack of
type checking is, a problem for a transition to C++11, in Windows.

Contrary to your claim, finding a better model was trivial: it was all
around that issue, already adopted for the similar prefixes.


> Deciding to have an hard transition for those who use
> the current model isn't either

I can't make head or tails of that, sorry.


Cheers & hth.,

- Alf


--

Martin B.

unread,
Jan 15, 2012, 6:14:07 PM1/15/12
to
On 15.01.2012 14:13, Miles Bader wrote:
> "Martin B."<0xCDC...@gmx.at> writes:
>> Personally I feel it was a *very* bad decision not to have a distinct
>> UTF-8 character (literal) type. (I mean, we have char16_t and char32_t,
>> why the hell not char8_t and be done with it!)
>
> Presumably the issue is that in sane (non-MS) implementations, utf-8
> literals work perfectly well with existing char-based infrastructure,
> and an increasingly large number of interfaces simply assume all char*
> strings are encoded using utf-8,

Examples! Numbers! :-)

>
> [Granted, MS-style wide-strings result in a similar giant ball of hair
> but if anything that serves as a _warning_...]
>
> Maybe they could have ...
> made it all work
> out, I dunno, but given the widespread use of utf-8 in char* strings

Examples! Numbers! :-)

> and the potential for snowballing complexity, this may be the most
> practical route-of-least-effort.
>

I still fail to see the point.

Since you seem to imply that char* == utf-8 is very widespread, let me
myself throw in two statements:

+ libxml2 -- as far as I can tell a widespread -- XML library, does use
*unsigned* char as utf-8 datatype, and specifically *not* char.

+ I strongly believe that *most* Windows C++ applications that use
narrow char (`char`) do *not* use it as utf-8. Indeed, I strongly
believe that there are a bazallion of Windows C++ apps out there for
which the situation `char == utf-8` is completely broken:

++ Most Windows apps that (still) use narrow char to interface with the
narrow Windows API versions would not use char with utf-8.

++ Last I checked, very many programs on Windows that write text files
for some purpose, do so in a narrow 8bit encoding on a western Windows.
(Some variation of the ISO Latin encoding.) I think we can assume these
program use char for those strings and it's not utf-8 either.


To sum up, and to phrase it a bit more strongly:

The fact that C++11 introduces char16_t and char32_t but no char8_t is
crappy.

The fact that u8"" literals map to `char` of all things is rather
horrible! I, personally, would be better serverd if it mapped to
`unsigned char` but I can imagine that this could create problems elsewhere.


cheers,
Martin


--

James K. Lowden

unread,
Jan 19, 2012, 5:56:03 PM1/19/12
to
On Sun, 15 Jan 2012 05:13:25 -0800 (PST)
Miles Bader <mi...@gnu.org> wrote:

> utf-8
> literals work perfectly well with existing char-based infrastructure,
> and an increasingly large number of interfaces simply assume all char*
> strings are encoded using utf-8,

Well, not "perfectly", right? char* strings and std::string lack
character semantics when used with utf-8. That is,

std::string::operator[]

returns a byte, and no operator returns a character.

> and they didn't want the giant ball
> of hair that would come with a really distinct type.

Why is a distinct type of character a hairball? With a basic type such
as

template <typename C>
class code_point
{
enum encoding_t { ... } encoding;
C value;
};

a class std::encoded_string could be derived from basic_string with
code_point as the first template argument. encoded_string then has
character semantics, and every code_point carries enough information to
map it to another encoding.

We talk about environments and strings having an encoding, but really,
by definition, each character is encoded. ISTM representing that
reality in a class is very basic OO choice, not hairball at all.

--jkl


--
0 new messages