Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Are there any ideas out there for character-type independent streams (for std C++)?

237 views
Skip to first unread message

Martin Ba

unread,
Nov 16, 2012, 3:55:18 AM11/16/12
to
I'll keep this short:

I remember reading a statement that I found to make sense that was along
the lines of:

Quote Unknown:
> ... cout and wcout don't make sense. You should
> have one (output) stream and just be able to
> pass it both narrow and wide (strings) like:
>
> wstring aWcharString( ... );
> string aCharString( ... );
> char16_t const* aUTF16String = u"...";
> unified_out << aWcharString << aCharString << aUTF16String;
>

So, is there any library out there that does this? Does anyone use this?
Would it make sense to just define overloaded output operators so that
cout would also accept wide strings?

cheers,
Martin


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Francis Glassborow

unread,
Nov 21, 2012, 2:56:37 PM11/21/12
to
On 16/11/2012 08:55, Martin Ba wrote:
> I'll keep this short:
>
> I remember reading a statement that I found to make sense that was along
> the lines of:
>
> Quote Unknown:
>> ... cout and wcout don't make sense. You should
>> have one (output) stream and just be able to
>> pass it both narrow and wide (strings) like:

I think that is missing the point. It isn't about the source but the
destination. Consider:

int main(){
int i{0};
cout << i << endl; //1
wcout << i << endl; //2
return 0;
}

Line 1 generates output differently from (or at least can do so) from
line 2. This is possibly more noticeable when using file streams where
the output needs to be in a form that can be recovered with a
corresponding input function/object

There is no inherent reason that cout could not handle a wstring, but it
would do so differently from the way that wcout would handle it.

Francis

Zhihao Yuan

unread,
Nov 21, 2012, 3:41:06 PM11/21/12
to
On Friday, November 16, 2012 3:00:02 AM UTC-6, Martin B. wrote:
> So, is there any library out there that does this? Does anyone use this?
> Would it make sense to just define overloaded output operators so that
> cout would also accept wide strings?

To define overloads can only fix the stream library. Actually we can
go further, by eliminating the gaps among the four different kinds of
string. Beman Dawes has an elegant proposal, ``String Interoperation
Library''
<http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html>,
for this.

Ulrich Eckhardt

unread,
Nov 21, 2012, 6:35:46 PM11/21/12
to
Am 16.11.2012 09:55, schrieb Martin Ba:
> I remember reading a statement that I found to make sense that was along
> the lines of:
>
> Quote Unknown:
>> ... cout and wcout don't make sense. You should
>> have one (output) stream and just be able to
>> pass it both narrow and wide (strings) like:
>>
>> wstring aWcharString( ... );
>> string aCharString( ... );
>> char16_t const* aUTF16String = u"...";
>> unified_out << aWcharString << aCharString << aUTF16String;
>>
>
> So, is there any library out there that does this? Does anyone use this?
> Would it make sense to just define overloaded output operators so that
> cout would also accept wide strings?

You can write strings of type char, unsigned char and signed char to
wide-character streams. That said, I don't know if you can write
std::string to wide-character streams, but I wouldn't mind adding an
overload.

The problem is that sometimes, std::string actually contains UTF-8,
while in other cases it contains ASCII or various codepages. Since the
string doesn't know its encoding (neither the type nor the actual
content define it), it requires guessing. While wchar_t isn't off much
better, there are effectively only two encodings for it, UTF-16 and
UTF-32, that are easily distinguished via the target platform at compile
time.

Uli

Martin Ba

unread,
Nov 22, 2012, 2:24:09 PM11/22/12
to
On 21.11.2012 21:41, Zhihao Yuan wrote:
> On Friday, November 16, 2012 3:00:02 AM UTC-6, Martin B. wrote:
>> So, is there any library out there that does this? Does anyone use
>> this? Would it make sense to just define overloaded output
>> operators so that cout would also accept wide strings?
>
> To define overloads can only fix the stream library. Actually we
> can go further, by eliminating the gaps among the four different
> kinds of string. Beman Dawes has an elegant proposal, ``String
> Interoperation Library''
> <http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html>,
> for this.

For anyone not wishing to read the whole thing, I think one point
alone deserves explicit quoting:

(
http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html#comp-UTF-8

)

> Explicit UTF-8 encoded types char8_t and u8string
>
> Specifies a character type and a string type that are unambiguously
> UTF-8 encoded.
>
> UTF-8 is the most important, and often the only, byte -sized
> character encoding required by many internationalized
> applications. Yet it is the only one of the critical Unicode
> encodings (UTF-8, UTF-16, UTF-32) that does not have its own C++
> character type. This causes endless technical problems, such as the
> inability to overload on a UTF-8 character type, for those who want
> to write portable code. It causes developers who otherwise think
> highly of C++ to believe the standards committee is stuck in the
> distant past when dinosaurs roamed the earth.

To which I might add a quote of a thread I started in 2010:

(
https://groups.google.com/forum/?fromgroups=#!topic/comp.lang.c++.moderated/4CBsrFuMFBc

)

From: Seungbeom Kim <musip...@bawi.org>
> Newsgroups: comp.lang.c++.moderated
> Subject: Re: Should C++0x contain a distinct type for UTF-8?
> Date: Tue, 24 Aug 2010 17:55:17 CST
> On 2010-08-22 13:15, Martin B. wrote:
> >
> > Should C++0x contain a distinct type for UTF-8?
> >
> > Current draft N3092 specifies:
> > + char16_t* for UTF-16
> > + char32_t* for UTF-32
> > + char* for execution narrow-character set
> > + wchar_t* for execution wide-character set
> + unsigned char*, possibly for raw data buffers etc.
> >
> > a) Wouldn't it make sense to have a char8_t where char8_t arrays
> > would hold UTF-8 character sequences exclusively?
>
> I guess so, just as char16_t and char32_t do for UTF-16 and UTF-32.
>
> At least, char8_t could be made an unsigned integer type! (That is,
> a distinct type with the same representation as uint_least8_t.)
> Having to cast to unsigned char for any serious byte handling
> remains to be one of my biggest pet peeves.
>
> > b) What is the rationale for not including it?
>
> Probably because that's what the C committee did[N1040], I guess.
> C has had a tendency to introduce new character types via typedefs,
> such as wchar_t, char16_t, and char32_t (hence the suffix "_t"),
> which works well for C because it doesn't have overloading anyway.
> (...)
>
> Things are different in C++: it introduces new character types as
> distinct types, and it supports overloading. So I believe C++ could
> benefit from a separate char8_t type. However, it doesn't seem to
> have been done and I do not know whether introduction of char8_t
> has ever been discussed in one of the technical papers, or WG14's
> N1040 was adopted with just as much "translation" as necessary.

So I might ask whether there is a current proposal to add a char8_t?

cheers,
Martin

Bo Persson

unread,
Nov 23, 2012, 3:59:39 PM11/23/12
to
No, I don't think so.

We already have three 8-bit char types, some of which CAN be used with a
UTF-8 encoding. The committee was very reluctant to add a forth 8-bit
char type, possibly identical to one or more of the old ones on some
platforms.

It's already confusing enough.


Bo Persson

Francis Glassborow

unread,
Nov 23, 2012, 9:15:34 PM11/23/12
to
>
> No, I don't think so.
>
> We already have three 8-bit char types, some of which CAN be used with a
> UTF-8 encoding. The committee was very reluctant to add a forth 8-bit
> char type, possibly identical to one or more of the old ones on some
> platforms.
>
> It's already confusing enough.


Even more so because C++ does not have any 8-bit types :) Granted the
three char types only require 8-bits but they can (on on some platforms
are) wider.

The basic question is whether we need a pure 8-bit type (one that is
required to be only 8 bits wide).

Francis

Daniel Krügler

unread,
Nov 24, 2012, 1:48:11 PM11/24/12
to
Am 24.11.2012 03:15, schrieb Francis Glassborow:
>> We already have three 8-bit char types, some of which CAN be used with a
>> UTF-8 encoding. The committee was very reluctant to add a forth 8-bit
>> char type, possibly identical to one or more of the old ones on some
>> platforms.
>>
>> It's already confusing enough.
>
> Even more so because C++ does not have any 8-bit types :) Granted the
> three char types only require 8-bits but they can (on on some platforms
> are) wider.

I *think* that Bo actually meant: three char types with at least 8-bit
width.

> The basic question is whether we need a pure 8-bit type (one that is
> required to be only 8 bits wide).

I really see no reason for that. It would break several compilers that
we support (having CHAR_BIT of 32).

- Daniel

Richard Damon

unread,
Nov 24, 2012, 6:58:24 PM11/24/12
to
On 11/23/12 9:15 PM, Francis Glassborow wrote:
>>
>> No, I don't think so.
>>
>> We already have three 8-bit char types, some of which CAN be used with a
>> UTF-8 encoding. The committee was very reluctant to add a forth 8-bit
>> char type, possibly identical to one or more of the old ones on some
>> platforms.
>>
>> It's already confusing enough.
>
>
> Even more so because C++ does not have any 8-bit types :) Granted the
> three char types only require 8-bits but they can (on on some platforms
> are) wider.
>
> The basic question is whether we need a pure 8-bit type (one that is
> required to be only 8 bits wide).
>
> Francis
>
>

That would be uint8_t (or int8_t) from <stdint.h>
If it is available, it MUST be exactly 8 bits wide.

There might be some use in defining a type that MUST be unicode, but
does the standard really want to force all implementations to use
unicode? Should it also define a type that must be ASCII or EBCDIC?

What might be more useful is a variation of typedef that creates a new
type for overloading, so you could do something like

typedef uint8_t new char8_t;

and then overload functions on both types.

Then the community could develop this sort of library, and either become
"other standards" (like POSIX), or eventually adopted as into the
language as existing practice.

Seungbeom Kim

unread,
Nov 25, 2012, 5:20:44 AM11/25/12
to

On 2012-11-23 12:59, Bo Persson wrote:
> Martin Ba skrev 2012-11-22 20:24:
>>
>> So I might ask whether there is a current proposal to add a
>> char8_t?
>
> No, I don't think so.
>
> We already have three 8-bit char types, some of which CAN be used
> with a UTF-8 encoding. The committee was very reluctant to add a
> forth 8-bit char type, possibly identical to one or more of the old
> ones on some platforms.
>
> It's already confusing enough.

Why is it worthwhile to add char16_t and char32_t, while it's only
confusing for char8_t? If we can live without char8_t, can't we
equally live without char16_t and char32_t, using uint_least_{16,32}_t
instead?

I know that the changed started in C, not including char8_t, and C++
merely adopted it with some slight changes; is it *the* reason? In
other words, could it have been different if it started within C++?

--
Seungbeom Kim

Martin B.

unread,
Nov 25, 2012, 3:15:48 PM11/25/12
to

On 23.11.2012 21:59, Bo Persson wrote:
> Martin Ba skrev 2012-11-22 20:24:
>> On 21.11.2012 21:41, Zhihao Yuan wrote:
>> > >
>> > > a) Wouldn't it make sense to have a char8_t where char8_t
>> > > arrays would hold UTF-8 character sequences exclusively?
>> ....
>>
>> So I might ask whether there is a current proposal to add a
>> char8_t?
>>
> No, I don't think so.
>
> We already have three 8-bit char types, some of which CAN be used
> with a UTF-8 encoding. The committee was very reluctant to add a
> forth 8-bit char type, possibly identical to one or more of the old
> ones on some platforms.
>
> It's already confusing enough.
>

It's confusing, no doubt.

But the UTF-8 support as it stand is a mess in my opinion:
* u8 literals mapping to *the same* type as "normal" character
literals
* therefore, no possibility to distinguish btw. default encoding and
explicit UTF-8

cheers,
Martin


--
Good C++ code is better than good C code, but
bad C++ can be much, much worse than bad C code.

Frank Birbacher

unread,
Nov 26, 2012, 5:17:32 PM11/26/12
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi!

Am 22.11.12 00:35, schrieb Ulrich Eckhardt:
> While wchar_t isn't off much better, there are effectively only two
> encodings for it, UTF-16 and UTF-32, that are easily distinguished
> via the target platform at compile time.

Well, could also be UCS-2. Mixing it up with UTF-16 is not good if you
really use the character range.

While we are at it: doesn't the standard assume for std::basic_string
that one element (value_type, charT) represents exactly one character
and size() gives the length? Are multibyte/multiword encodings
actually permitted formally?

Frank
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: keyserver x-hkp://pool.sks-keyservers.net

iEYEARECAAYFAlCz1+oACgkQhAOUmAZhnmoNegCeJ9ktOCHPTQJ2QGLkdrCKVT4D
6c8Ani8+kTybSdo3ugb6W+EuoIte7j48
=v4PU
-----END PGP SIGNATURE-----

Zhihao Yuan

unread,
Nov 27, 2012, 11:49:10 PM11/27/12
to
On Monday, November 26, 2012 4:20:06 PM UTC-6, Frank Birbacher wrote:
> While we are at it: doesn't the standard assume for std::basic_string
> that one element (value_type, charT) represents exactly one character
> and size() gives the length? Are multibyte/multiword encodings
> actually permitted formally?

Multibyte encoding is allowed but not ``supported''.

2.14.5/15:
... Note: The size of a char16_t string literal is the total number
of code units, not the number of characters ...

The problem with wchar_t is that its range is limited by the locale.
So if we need a portable and fully Unicode-supported string, char32_t
should be used instead.

Mathias Gaunard

unread,
Nov 28, 2012, 12:03:42 AM11/28/12
to
On 26 nov, 23:20, Frank Birbacher <bloodymir.c...@gmx.net> wrote:

> While we are at it: doesn't the standard assume for std::basic_string
> that one element (value_type, charT) represents exactly one character

What does it mean to make that assumption?
What is a character?
0 new messages