On 21.11.2012 21:41, Zhihao Yuan wrote:
> On Friday, November 16, 2012 3:00:02 AM UTC-6, Martin B. wrote:
>> So, is there any library out there that does this? Does anyone use
>> this? Would it make sense to just define overloaded output
>> operators so that cout would also accept wide strings?
>
> To define overloads can only fix the stream library. Actually we
> can go further, by eliminating the gaps among the four different
> kinds of string. Beman Dawes has an elegant proposal, ``String
> Interoperation Library''
> <
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html>,
> for this.
For anyone not wishing to read the whole thing, I think one point
alone deserves explicit quoting:
(
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html#comp-UTF-8
)
> Explicit UTF-8 encoded types char8_t and u8string
>
> Specifies a character type and a string type that are unambiguously
> UTF-8 encoded.
>
> UTF-8 is the most important, and often the only, byte -sized
> character encoding required by many internationalized
> applications. Yet it is the only one of the critical Unicode
> encodings (UTF-8, UTF-16, UTF-32) that does not have its own C++
> character type. This causes endless technical problems, such as the
> inability to overload on a UTF-8 character type, for those who want
> to write portable code. It causes developers who otherwise think
> highly of C++ to believe the standards committee is stuck in the
> distant past when dinosaurs roamed the earth.
To which I might add a quote of a thread I started in 2010:
(
https://groups.google.com/forum/?fromgroups=#!topic/comp.lang.c++.moderated/4CBsrFuMFBc
)
From: Seungbeom Kim <
musip...@bawi.org>
> Newsgroups: comp.lang.c++.moderated
> Subject: Re: Should C++0x contain a distinct type for UTF-8?
> Date: Tue, 24 Aug 2010 17:55:17 CST
> On 2010-08-22 13:15, Martin B. wrote:
> >
> > Should C++0x contain a distinct type for UTF-8?
> >
> > Current draft N3092 specifies:
> > + char16_t* for UTF-16
> > + char32_t* for UTF-32
> > + char* for execution narrow-character set
> > + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
> >
> > a) Wouldn't it make sense to have a char8_t where char8_t arrays
> > would hold UTF-8 character sequences exclusively?
>
> I guess so, just as char16_t and char32_t do for UTF-16 and UTF-32.
>
> At least, char8_t could be made an unsigned integer type! (That is,
> a distinct type with the same representation as uint_least8_t.)
> Having to cast to unsigned char for any serious byte handling
> remains to be one of my biggest pet peeves.
>
> > b) What is the rationale for not including it?
>
> Probably because that's what the C committee did[N1040], I guess.
> C has had a tendency to introduce new character types via typedefs,
> such as wchar_t, char16_t, and char32_t (hence the suffix "_t"),
> which works well for C because it doesn't have overloading anyway.
> (...)
>
> Things are different in C++: it introduces new character types as
> distinct types, and it supports overloading. So I believe C++ could
> benefit from a separate char8_t type. However, it doesn't seem to
> have been done and I do not know whether introduction of char8_t
> has ever been discussed in one of the technical papers, or WG14's
> N1040 was adopted with just as much "translation" as necessary.
So I might ask whether there is a current proposal to add a char8_t?
cheers,
Martin