What is the status of C++ regarding Unicode?

Robbie Hatley

unread,

Mar 27, 2015, 6:21:02 AM3/27/15

to

I've been away from doing C++ programming for a while, so I've
lost track of the latest developments. What is the current
status of C++ regarding Unicode? I notice that in my compiler
(g++ version 4.9.2 launched from Bash, on Cygwin, on Windows 8.1)
if I write a program like THIS:

// parrot-test.cpp
#include <iostream>
using std::cin;
using std::cout;
using std::endl;
using std::string;
int main(void) {
string Text;
cin >> Text;
cout << "Length of Text = " << Text.size() << endl;
cout << "Content of Text = " << Text << endl;
return 0;
}

If I type the following as input:
麦藁雪、富士川町、山梨県

I get the following results:

Aragorn@Ketch
/rhe/src/test
%parrot-test
麦藁雪、富士川町、山梨県
Length of Text = 36
Content of Text = 麦藁雪、富士川町、山梨県

Aragorn@Ketch
/rhe/src/test
%

Which surprised me. I expected that the program would either
crash (unhandled exception), or print gibberish. But it actually
correctly printed "麦藁雪、富士川町、山梨県" (Mugiwara Yuki, Fujikawa-cho,
Yamanashi-ken).

However, the size is wrong. That's 12 characters, but C++ reports
36. Apparently it's counting bytes, not characters. So what's
happening there? C++ chops the utf8 codepoints into bytes and
stores each byte in a "char", then just parrots those bytes back
on printing? If so, doesn't that make it hard to do things such
as work with substrings, or do searches and substitutions?

Are there ways to search for non-ASCII strings -- say, "富士"
("Fuji") -- in a C++ std::string? Perhaps by specifying utf8
instead of ASCII as the base character type for std::string?
Can that be done?

And I suppose that non-ASCII characters aren't going to be allowed
as identifiers in C++ any time soon? I note that the following won't
even compile:

// parrot-test-2.cpp
#include <iostream>
using std::cin;
using std::cout;
using std::endl;
using std::string;
int main(void) {
string 麦藁雪;
cin >> 麦藁雪;
cout << "Length of 麦藁雪 = " << 麦藁雪.size() << endl;
cout << "Content of 麦藁雪 = " << 麦藁雪 << endl;
return 0;
}

--
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/
https://www.facebook.com/robbie.hatley

fefe

unread,

Mar 27, 2015, 6:54:39 AM3/27/15

to

On Friday, March 27, 2015 at 6:21:02 PM UTC+8, Robbie Hatley wrote:
> However, the size is wrong. That's 12 characters, but C++ reports
> 36. Apparently it's counting bytes, not characters. So what's
> happening there? C++ chops the utf8 codepoints into bytes and
> stores each byte in a "char", then just parrots those bytes back
> on printing? If so, doesn't that make it hard to do things such
> as work with substrings, or do searches and substitutions?

You're right, what stored in the string are a serial of bytes. If you want to store characters, you will have to use wstring, or u16string, u32string. The size of char is just to small to store any Unicode character.

Conversion from byte string (utf8 or any other encoding) to unicode is done by codecvt in C++, but most of it depends on the actual implementation.

As to operations on std::string with non-ASCII character, it would depend on the encoding used. For UTF8, substring may cause problem is split the string in a arbitrary place, but searches and substitutions should work OK. This would not be true for other encodings.

> Are there ways to search for non-ASCII strings -- say, "富士"
> ("Fuji") -- in a C++ std::string? Perhaps by specifying utf8
> instead of ASCII as the base character type for std::string?
> Can that be done?

You can not specify the encoding for a string. It is just a serial of bytes. However, if everything in the environment (source file, system environment variables, etc) is using UTF8, string.find("富士"); will just work as expected. Just remember the result is not the character position, but the byte position. Thing would get complicated if other encoding is used.

> And I suppose that non-ASCII characters aren't going to be allowed
> as identifiers in C++ any time soon? I note that the following won't
> even compile:

C++ never forbids using non-ASCII character as identifiers, it just optional and the implementations (such as gcc) didn't implement such features.

Chris Vine

unread,

Mar 27, 2015, 11:16:13 AM3/27/15

to

On Fri, 27 Mar 2015 03:20:52 -0700
Robbie Hatley <see.m...@for.my.address> wrote:
>
> I've been away from doing C++ programming for a while, so I've
> lost track of the latest developments. What is the current
> status of C++ regarding Unicode? I notice that in my compiler
> (g++ version 4.9.2 launched from Bash, on Cygwin, on Windows 8.1)
> if I write a program like THIS:

[snip]

There was quite an extensive discussion of this a month or so ago, so
rather than rehash it, it is here:
https://groups.google.com/forum/#!topic/comp.lang.c++/OklSqVoisyY

unicode support is much as it was in C++ except that in C++11/14 you can
now specify unicode string literals, there are char16_t and char32_t
types and corresponding string types, and unicode code conversion
facets are available. Common misconceptions about unicode, in
particular about what is a "character" (unicode refers to code units
and code points), what unicode normalization involves, how glyphs and
graphemes relate to unicode and so forth of course remain.

Chris

Jorgen Grahn

unread,

Mar 28, 2015, 4:57:34 AM3/28/15

to

On Fri, 2015-03-27, fefe wrote:
> On Friday, March 27, 2015 at 6:21:02 PM UTC+8, Robbie Hatley wrote:
>> However, the size is wrong. That's 12 characters, but C++ reports
>> 36. Apparently it's counting bytes, not characters. So what's
>> happening there? C++ chops the utf8 codepoints into bytes and
>> stores each byte in a "char", then just parrots those bytes back
>> on printing? If so, doesn't that make it hard to do things such
>> as work with substrings, or do searches and substitutions?

> You're right, what stored in the string are a serial of bytes. If
> you want to store characters, you will have to use wstring, or
> u16string, u32string. The size of char is just to small to store any
> Unicode character.
>
>
> Conversion from byte string (utf8 or any other encoding) to unicode

Surely utf8 /is/ unicode?

The link someone posted last time around: http://utf8everywhere.org/

(Didn't read the rest. I'm rather ignorant of i18n issues.)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Richard Damon

unread,

Mar 28, 2015, 11:38:08 AM3/28/15

to

On 3/27/15 6:20 AM, Robbie Hatley wrote:
>
> I've been away from doing C++ programming for a while, so I've
> lost track of the latest developments. What is the current
> status of C++ regarding Unicode? I notice that in my compiler
> (g++ version 4.9.2 launched from Bash, on Cygwin, on Windows 8.1)
> if I write a program like THIS:
>

C++ does not require that "characters", in general, use a "Unicode"
encoding. The standard does provide ways to define strings that do have
Unicode encoding, but the normal string functions do not have to process
them as you might expect.

On many systems, Unicode will "work", or can be made to work, but for
some systems it won't.

It would be nice if the standard said that the "C" locale (or some other
standard defined locale like "Unicode") was Unicode compliant, but they
haven't.

One big feature of Unicode (in UTF-8 encoding), was that with a few
basic rules, a program can process a Unicode data stream as just a
string of bytes. The biggest of which is that the program should only
manipulate strings at know "character" boundaries, as the UTF-8 encoding
carefully makes sure that no character "piece" looks like a normal ASCII
character, and in general that in general no Unicode codepoint looks
like a piece of another codepoint. Thus you can parse a string into
"lines" based on new lines, or "words" based on spaces (or other listed
punctuation) and just past the unicode through.

Now, if you actually want to count "Characters" in Unicode, you need to
be fully Unicode aware, as this is a complicated attribute. With a bit
of smarts you can count "codepoints", as these are fairly simple to
define, but while these are sometimes called "characters" they really
aren't (at least if you want to consider a character as a Glyph that is
printed) as some are just control codes (so produce no visible output)
and some are combining marks that affect an adjacent codepoint in
forming the final "character" for output as a glyph.

Vir Campestris

unread,

Mar 28, 2015, 6:07:40 PM3/28/15

to

On 28/03/2015 08:57, Jorgen Grahn wrote:
> Surely utf8/is/ unicode?

Oh no, not at all.

Unicode is a wide character set; there are more than 256 values.

Now most of us use characters that are only in the first 256; so a
byte-wide set works fine. On the other hand in Chinese there are way
more characters than that. UTF-8 is a way of encoding the characters
that is efficient for us westerners. It isn't bad for China either, as a
lot of the characters can be compressed down to 2 bytes - which means
UTF-8 is no bigger than 16-bit Unicode (UCS-2?). WIndows uses 16-bit
characters for its native APIs. But to meet all the characters - and
this includes some musical ones - you have to have more than 16 bits.
UTF-8 does this - but takes up to 4 bytes per character. and ++ on a
char string doesn't work any more - you have to know you're using UTF-8
and take special measures to get whole characters.

AIUI Unix has used UTF-8 since the year dot, and hence Linux since
birth. DOS had national variants :( - which is why the Japanese (used
to?) use the yen currency character for backslash in path seperators.

Andy

Nobody

unread,

Mar 28, 2015, 7:07:33 PM3/28/15

to

On Sat, 28 Mar 2015 22:07:05 +0000, Vir Campestris wrote:

> AIUI Unix has used UTF-8 since the year dot,

Unix is older than either Unicode or UTF-8.

The kernel doesn't use *any* encoding. In general, its interfaces use byte
sequences without attempting to assign any meaning to the bytes,
except that the null byte is commonly used as a string terminator and
'\x2f' (the slash character in ASCII) is used as a directory separator in
pathnames.

[One exception is that Linux has a (non-standard) termios flag (IUTF8)
which causes the VERASE character (typically \x08 or \x7f) to delete the
last UTF-8 character from the input buffer rather than the last byte.]

User-space functions which need to know the encoding (e.g. <ctype.h>
functions, mbstowcs, etc) default to US-ASCII; this can be changed by
setlocale(), but exactly which locales are supported (and whether any of
them use UTF-8) is implementation-dependent.

Even here, only encodings which are substantially compatible with US-ASCII
can be used. Characters which are part of important protocols (e.g. \0 as
the terminator, '/' as the directory separator, ':' as the path separator,
'=' for separating environment variables from their values, etc) must have
the same encoding as US-ASCII and their corresponding bytes cannot appear
anywhere within a multi-byte character. This precludes the use of e.g.
EBCDIC or UTF-16 as a locale encoding.

In short, Unix has used US-ASCII since the year dot, with everything else
having been tacked on afterwards.

The extent to which a library or interface uses UTF-8 tends to be directly
proportional to its proximity to a) open-source projects (which are less
concerned about legacy compatibility) and b) the internet (which typically
wants some degree of internationalisation at minimal cost).

Chris Vine

unread,

Mar 28, 2015, 7:49:31 PM3/28/15

to

On Sat, 28 Mar 2015 22:07:05 +0000

Vir Campestris <vir.cam...@invalid.invalid> wrote:
> On 28/03/2015 08:57, Jorgen Grahn wrote:
> > Surely utf8/is/ unicode?
>
> Oh no, not at all.

Oh yes.

> Unicode is a wide character set

That is wrong. Unicode is a defined set of code points. There are a
number of encodings which unicode recognizes for the purposes of
encoding those code points, namely UTF-7, UTF-8, UTF-16 and UTF-32.
Because the latter in fact is a one to one encoding with code points,
it is identical to UCS-4.

> Now most of us use characters that are only in the first 256; so a
> byte-wide set works fine. On the other hand in Chinese there are way
> more characters than that. UTF-8 is a way of encoding the characters
> that is efficient for us westerners. It isn't bad for China either, as
> a lot of the characters can be compressed down to 2 bytes - which
> means UTF-8 is no bigger than 16-bit Unicode (UCS-2?).

UTF-16 is the 16 bit encoding for unicode. UCS-2 only supports the
basic multilingual plane, whereas UTF-16 can represent all unicode code
points (and accordingly is a variable length encoding as it uses
surrogate pairs of 16 bit code units). UTF-16 occupies more space than
UTF-8 for the average european script. It is reputed to occupy slightly
less for the average Japanese script.

> Windows uses 16-bit characters for its native APIs. But to meet all

> the characters - and this includes some musical ones - you have to
> have more than 16 bits. UTF-8 does this - but takes up to 4 bytes per
> character. and ++ on a char string doesn't work any more - you have
> to know you're using UTF-8 and take special measures to get whole

> characters. But to meet all the characters - and this includes some

> musical ones - you have to have more than 16 bits. UTF-8 does this -
> but takes up to 4 bytes per character. and ++ on a char string
> doesn't work any more - you have to know you're using UTF-8 and take
> special measures to get whole characters.

Many "characters" (given the meaning most people think it has) require
more than two unicode code points in normalized non-precomposed form.
Some "characters" are not representable in precomposed form. Such
representations require more than one UTF-32 code unit, more than two
UTF-16 code units and can require more than four UTF-8 code units.

Because UTF-16 is a variable length encoding, your '++' does not work
(for your meaning of "work") with UTF-16 either. Because of combining
characters, nor does UTF-32 if by "character" you mean a grapheme,
which is what most people think it means (namely, what they see as a
"character" in their terminal).

> AIUI Unix has used UTF-8 since the year dot, and hence Linux since
> birth. DOS had national variants :( - which is why the Japanese (used
> to?) use the yen currency character for backslash in path seperators.

No. For narrow encodings, unix used to be as incoherent as microsoft
code pages for its narrow codesets. ISO-8859 was common for
non-cyrillic european scripts, KOI8-R for cyrillic, and EUC ("Extended
Unix Code") for JKC scripts. JIS and Shift-JIS was also in use for
Japanese scripts and GB 2312 for Chinese scripts.

Chris

Richard Damon

unread,

Mar 28, 2015, 8:55:31 PM3/28/15

to

On 3/28/15 7:07 PM, Nobody wrote:
> On Sat, 28 Mar 2015 22:07:05 +0000, Vir Campestris wrote:
>
>> AIUI Unix has used UTF-8 since the year dot,
>

> Unix is older than either Unicode or UTF-8. ... In short, Unix has

> used US-ASCII since the year dot, with everything else having been
> tacked on afterwards.
>
> The extent to which a library or interface uses UTF-8 tends to be
> directly proportional to its proximity to a) open-source projects
> (which are less concerned about legacy compatibility) and b) the
> internet (which typically wants some degree of internationalisation
> at minimal cost).

One of the keys to Unicode (via UTF-8 encoding) working in *nix
environments is that the designers of Unicode made special effort to
make it fairly transparent to most programs. The first 128 characters
exactly match the ASCII definitions, so an ASCII file and a Unicode
UTF-8 file of the same content are identical. They were also careful
that no code-point looked like a piece of another code-point which makes
string searching generally "work". This means that in general, if a
program just manipulates strings at points found by searching for
characters/strings, will tend to "just work" with UTF-8 text. This
describes most of the operations in the *nix kernel.

Programs that need to break down a string into "characters" (like an
editor) need to be much more aware of things like Unicode.

Nobody

unread,

Mar 29, 2015, 8:14:06 AM3/29/15

to

On Sat, 28 Mar 2015 23:49:10 +0000, Chris Vine wrote:

>> Unicode is a wide character set
>
> That is wrong. Unicode is a defined set of code points.

Which is roughly the correct meaning of "character set" (as opposed to
"encoding", which is what some people mean when they say "character set").

It's "wide" insofar as their are more than 256 code points.

> No. For narrow encodings, unix used to be as incoherent as microsoft
> code pages for its narrow codesets.

I wouldn't go quite that far.

And to the extent that it's true, it hasn't really changed. UTF-8 is
something a lot of people *wish* was standard, but isn't. UTF-8 itself is
*a* standard, but from POSIX' perspective, it's just one of the many
encodings which may or may not be supported by a given platform.

In short, for all its advantages, UTF-8 isn't magically immune to

http://xkcd.com/927/

Öö Tiib

unread,

Mar 29, 2015, 12:30:19 PM3/29/15

to

On Sunday, 29 March 2015 15:14:06 UTC+3, Nobody wrote:
>
> And to the extent that it's true, it hasn't really changed. UTF-8 is
> something a lot of people *wish* was standard, but isn't. UTF-8 itself is
> *a* standard, but from POSIX' perspective, it's just one of the many
> encodings which may or may not be supported by a given platform.
>
> In short, for all its advantages, UTF-8 isn't magically immune to
>
> http://xkcd.com/927/

Yes, but that is not the case with C, C++ or POSIX standards. Those
keep string contents totally implementation-defined because
of legacy that may have 9 bit bytes or EBCDIC encoding or the like.
However since even such (more imaginary than real) systems have to
deal with data formats consisting of 8-bit octets and UTF-8 texts
(wast majority of data we have on our planet is that) it is
clearly futile trend.

Chris Vine

unread,

Mar 29, 2015, 12:50:25 PM3/29/15

to

On Sun, 29 Mar 2015 13:14:04 +0100
Nobody <nob...@nowhere.invalid> wrote:
> On Sat, 28 Mar 2015 23:49:10 +0000, Chris Vine wrote:
>
> >> Unicode is a wide character set
> >
> > That is wrong. Unicode is a defined set of code points.
>
> Which is roughly the correct meaning of "character set" (as opposed to
> "encoding", which is what some people mean when they say "character
> set").
>
> It's "wide" insofar as their are more than 256 code points.

First, to say that unicode comprises UTF-32 and that other encodings,
including UTF-8, are not "unicode" (which was the suggestion to which I
was responding) is out-and-out wrong. You do not help anyone reading
this newsgroup to suggest otherwise.

Secondly, unicode is universal (as the name suggests). It is a type
error to say that unicode is "wide", and the fact that there are
0x10FFFF usable code points within the range unicode employs is
irrelevant to this. There are two narrow and two wide encodings for
unicode, if by "wide" you mean greater than 8 bits and by "narrow" you
mean 8 bits or less.

(As an aside, if by "wide" you are trying to bring in some association
with wchar_t, then you would be wrong to do so: on unix-like systems
wchar_t is normally a 32 bit type, therefore leaving the 16-bit unicode
encoding on this measure as "narrow" or unclassifiable on unix and
"wide" on windows.)

There are enough misconceptions about unicode floating around without
creating more.

Chris

Vir Campestris

unread,

Mar 29, 2015, 4:09:37 PM3/29/15

to

On 29/03/2015 17:50, Chris Vine wrote:
> (As an aside, if by "wide" you are trying to bring in some association
> with wchar_t, then you would be wrong to do so: on unix-like systems
> wchar_t is normally a 32 bit type, therefore leaving the 16-bit unicode
> encoding on this measure as "narrow" or unclassifiable on unix and
> "wide" on windows.)
>

Correct me if I'm wrong, but I think Windows just represents the first
64k code points in its 16-bit characters. With no way of representing
the rest.

> There are enough misconceptions about unicode floating around without
> creating more.

The point I pricked up my ears was "utf-8 == unicode". Which it isn't -
it's a representation.

And back to the question I asked a few days ago - if you want to open a
file whose name is not US-ASCII is there a way to do it without using a
compression system of some sort in current STL?

Andy

Paavo Helde

unread,

Mar 29, 2015, 5:03:50 PM3/29/15

to

Vir Campestris <vir.cam...@invalid.invalid> wrote in
news:psedndCBW6l4xoXI...@brightview.co.uk:

> On 29/03/2015 17:50, Chris Vine wrote:
>> (As an aside, if by "wide" you are trying to bring in some association
>> with wchar_t, then you would be wrong to do so: on unix-like systems
>> wchar_t is normally a 32 bit type, therefore leaving the 16-bit
unicode
>> encoding on this measure as "narrow" or unclassifiable on unix and
>> "wide" on windows.)
>>
> Correct me if I'm wrong, but I think Windows just represents the first
> 64k code points in its 16-bit characters. With no way of representing
> the rest.

No, at some time point (more than 15 years ago I believe) the Windows
encoding was redefined to be UTF-16 instead of UCS-2.

>
>> There are enough misconceptions about unicode floating around without
>> creating more.
>
> The point I pricked up my ears was "utf-8 == unicode". Which it isn't -
> it's a representation.

Yes, UTF-8 is a representation of the Unicode. Which means it is Unicode
(i.e. has a property of being able to represent all Unicode codepoints).

>
> And back to the question I asked a few days ago - if you want to open a
> file whose name is not US-ASCII is there a way to do it without using a
> compression system of some sort in current STL?

Not sure what is a "compression system", but of course on every platform
the C++ implementations generally take care that it is possible to open
files with all valid filenames for this platform. Unfortunately this is
still very platform-specific. On Windows for example you might need to
use functions like _wfopen_s() or CreateFileW().

Cheers
Paavo

Richard Damon

unread,

Mar 29, 2015, 5:10:13 PM3/29/15

to

On 3/29/15 4:09 PM, Vir Campestris wrote:
> On 29/03/2015 17:50, Chris Vine wrote:
>> (As an aside, if by "wide" you are trying to bring in some association
>> with wchar_t, then you would be wrong to do so: on unix-like systems
>> wchar_t is normally a 32 bit type, therefore leaving the 16-bit unicode
>> encoding on this measure as "narrow" or unclassifiable on unix and
>> "wide" on windows.)
>>
> Correct me if I'm wrong, but I think Windows just represents the first
> 64k code points in its 16-bit characters. With no way of representing
> the rest.

The current Microsoft documentation describes using UTF-16, so I think
they mean to allow surrogate pairs to get to the full range of Unicode
codepoints. This doesn't say how much of the system will have trouble
with them.

>
>> There are enough misconceptions about unicode floating around without
>> creating more.
>
> The point I pricked up my ears was "utf-8 == unicode". Which it isn't -
> it's a representation.
>

As are utf-16, ucs-2, utf-32, and ucs-4. You have to use some form of
"representation" to store ANY data.

> And back to the question I asked a few days ago - if you want to open a
> file whose name is not US-ASCII is there a way to do it without using a
> compression system of some sort in current STL?
>
> Andy

It will inherently be implementation (or other standard) dependent. For
*nix, it will depend on the system's language setting. If it is using
utf-8, then just send the file name as UTF-8. If it is configured to use
a national code page, you send a string encoded in the national code page.

Windows, I believe, will always store the filename with UTF-16 (so you
don't have language interpretation issue of filenames), but the "narrow"
open function will interpret the character string according to the
defined locale, as the function will widen them.

Juha Nieminen

unread,

Mar 30, 2015, 5:57:30 AM3/30/15

to

Robbie Hatley <see.m...@for.my.address> wrote:
> Which surprised me. I expected that the program would either
> crash (unhandled exception)

Why would it crash? You gave it some bytes and it echoed them back.
What exactly would make it crash? The fact that the bytes just happened
to be in a form specified by some UTF-8 specification makes no
difference. They are just bytes.

If you want to interpret the input as UTF-8 in your program, you need
to use some library for that. std::string itself is not enough for
that kind of operation.

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

Vir Campestris

unread,

Mar 30, 2015, 4:42:58 PM3/30/15

to

On 29/03/2015 22:03, Paavo Helde wrote:
>> And back to the question I asked a few days ago - if you want to open a
>> >file whose name is not US-ASCII is there a way to do it without using a
>> >compression system of some sort in current STL?
> Not sure what is a "compression system", but of course on every platform
> the C++ implementations generally take care that it is possible to open
> files with all valid filenames for this platform. Unfortunately this is
> still very platform-specific. On Windows for example you might need to
> use functions like _wfopen_s() or CreateFileW().

I think that means no.

By a compression system I meant something like UTF-8. Perhaps not the
best choice of words - sorry.

Last time I needed to know this the STL functions all took narrow
strings. That's OK if you know what character set the OS is using, and
the filename can be represented in it.

But there was no STL way to say "What's my char set" or "set my char
set" - so you're down to implementation dependent stuff.

Andy

Paavo Helde

unread,

Mar 30, 2015, 5:46:01 PM3/30/15

to

Vir Campestris <vir.cam...@invalid.invalid> wrote in

news:L_-dnYfwIKnVKITI...@brightview.co.uk:

> On 29/03/2015 22:03, Paavo Helde wrote:
>>> And back to the question I asked a few days ago - if you want to
>>> open a
>>> >file whose name is not US-ASCII is there a way to do it without
>>> >using a compression system of some sort in current STL?
>> Not sure what is a "compression system", but of course on every
>> platform the C++ implementations generally take care that it is
>> possible to open files with all valid filenames for this platform.
>> Unfortunately this is still very platform-specific. On Windows for
>> example you might need to use functions like _wfopen_s() or
>> CreateFileW().
>
> I think that means no.

Literally, it does not mean no, because you can have a narrow codepage in
Windows like Windows-1257 and have filenames with s-caron etc as C++
narrow string filenames which are definitely not US-ASCII.

But I think this is not what you asked. I agree this all could work much
better. I think the root reason is that most of OS-es as well as C and
C++ predate the Unicode standard.

>
> By a compression system I meant something like UTF-8. Perhaps not the
> best choice of words - sorry.
>
> Last time I needed to know this the STL functions all took narrow
> strings. That's OK if you know what character set the OS is using, and
> the filename can be represented in it.

There are only a handful (one?) of STL fuctions which take filenames. And
to be honest, I have never needed the fstream part of STL in professional
work. This is probably because most files I deal with are binary, not
some formatted text, plus fstream does not support file mapping anyway.
So while the current situation is sad, it's all still mostly academic, at
least for me.

Cheers
Paavo

Chris Vine

unread,

Mar 30, 2015, 7:33:07 PM3/30/15

to

On Mon, 30 Mar 2015 21:42:47 +0100
Vir Campestris <vir.cam...@invalid.invalid> wrote:
[snip]

> But there was no STL way to say "What's my char set" or "set my char
> set" - so you're down to implementation dependent stuff.

"What's my char set" and "set my char set" by themselves are
meaningless in the context of file names. There is nothing to say that
the filesystem narrow encoding is the same as a particular machine's
locale narrow encoding, and this is self-evident in the case of network
file systems: for these you take what you are given. A sane
cross-platform distributed system probably restricts itself to ASCII.
In practice, that works fine.

The Portable Filename Character Set for POSIX (paragraph 3.276 of the
SUS) is even smaller than that, but that is a minumum requirement. In
practice any modern unix will support practically any narrow encoding
(including ISO-8859 and UTF-8) because it views them as just another
stream of bytes terminated by the NUL character and using '/' as the
directory separator, although the space (' ') is best avoided on
unix-likes because lots of user software isn't designed to handle it.

On windows you have other issues arising from the need to map the
narrow character set to the wide character set.

Chris

Richard Damon

unread,

Mar 30, 2015, 9:42:44 PM3/30/15

to

My feeling is why would a program embed a non-basic-character set
filename in itself anyway, and need to avoid implementation dependent
stuff? Almost always, if I need to access a filename like that, it has
come from somewhere else. If I need to hard code a filename, I will use
just the basic alpha-numerics so it is safe.

Just by assuming you CAN use such a file name is a big assumption, and
makes you program implementation dependent to begin with.

Note, that you CAN get information about the current language, by using
setlocale and passing a null pointer for the locale information, but the
encoding of the results is implementation dependent.

Robbie Hatley

unread,

Apr 12, 2015, 8:20:26 PM4/12/15

to

On 3/29/2015 5:14 AM, Nobody wrote:

> In short, for all its advantages, UTF-8 isn't magically immune to
>
> http://xkcd.com/927/
>

Harrumpf. Amusing, but in the Internet and Perl communities, pretty
far from the truth. The following is tongue-in-cheek, but still
kinda accurate in those communities:

DEFINITIONS:
"Character Set" = utf8
"Character Encoding" = utf8
"Unicode" = utf8
"Character" = utf8 entity (visual)
"Grapheme" = utf8 entity (visual, w/o adornments)
"Grapheme Cluster" = utf8 entity (visual, with adornments)
"Code Point" = utf8 entity or subentity (numerical)

Obviously in other sub-fields of computer science this is much less
true. But I hope that eventually all the other versions of Unicode
go extinct (or get relegated to special purposes) so that for the
most part "Unicode" = "utf8".

And hopefully before too long, C and C++ will do as Perl has done
and incorporate utf8 handling so that it becomes the standard for
handling text. (Or, at least, make it easy to switch between
ASCII and utf8.) And "Unicode Collate" should be part of the
Standard Library for sure.

Also, it would be nice to be able to say the following in
C++ source code:

double 富士川町年齢 = 58.836; // median citizen age in Fujikawa
int 富士川町猫 = 4378;　 // number of cats in Fujikawa

You can already do that in some programming languages.
Just, not in C or C++. Yet.

Message has been deleted

Richard Damon

unread,

Apr 12, 2015, 9:46:11 PM4/12/15

to

Note that this it is explicitly defined that an implementation allow
such code. The implementation can use Unicode (any of utf8, utf16, or
ucs-4) as its character set, and
"identifier-nondigit" (which identifiers can be made from) explicitly
list "other implementation defined characters", like much of the unicode
character space.

Yes, you make you program dependent on an implement defined behavior,
but in some cases that is acceptable.

Robbie Hatley

unread,

Apr 13, 2015, 4:43:47 AM4/13/15

to

On 3/28/2015 8:37 AM, Richard Damon wrote:

> ...Now, if you actually want to count "Characters" in Unicode,

> you need to be fully Unicode aware, as this is a complicated

> attribute...

So, what you're saying is, a C++ program is unlikely to realize
(without a *LOT* of help from the programmer) that this string:

position

is actually just 8 characters, even though it's 590 bytes. :D
Though I must say, that's a pretty egregious abuse of the
concept of "Grapheme Clusters".

Robbie Hatley

unread,

Apr 13, 2015, 5:24:57 AM4/13/15

to

On 3/30/2015 2:57 AM, Juha Nieminen wrote:

> Robbie Hatley <see.m...@for.my.address> wrote:
> > Which surprised me. I expected that the program would either
> > crash (unhandled exception)
>
> Why would it crash? You gave it some bytes and it echoed them back.
> What exactly would make it crash? The fact that the bytes just happened
> to be in a form specified by some UTF-8 specification makes no
> difference. They are just bytes.

I'm beginning to appreciate that. Which is kinda cool, because
it means that some (but not all) operations with utf8 can be
done in C++ exactly as if you were using ASCII or ISO-8859-1.

> If you want to interpret the input as UTF-8 in your program, you need
> to use some library for that. std::string itself is not enough for
> that kind of operation.

I'm guessing that sometimes one could get away with that.

But sometimes not.

Let's run the following test. I have a C++ program that sorts and
dedups the lines of text in a text file. (Useful for lists of
things such as names, though would make garbage of normal text.)
The program looks like this (simplified for brevity):

int main (int Thyme, char *Sage[])
{
// Make a list of strings called "Text":
static std::list<std::string> Text;

// Read input from cin to Text:
ReadInput(Text);

// Sort the list of strings:
Text.sort();

// Remove duplicate lines from Text:
Text.unique();

// Write output from Text to cout:
WriteList(Text);

return 0;
}

(where the functions mentioned do what their names say)

Let's give it this input:

Kate Onthetimeline
Chasity Ahmad
Collin Tierney
Ed Gooz
Frederic Moseley Jr.
John Froex
Amanda Alatti
Emirjon Fishta
Lisa Lauchstedt
Mary Elizabeth Blackley
Zackry Wallace-Bell
İrfan Qureyş
Padraic O'Driscoll
Nékoé Mīkûriá
Arthur Vullamparthi
Ragu PG
Nathan Gutierrez
Sifokl AlSifokli
Tammy Houghtaling
Yuri Aleksei Carrión Belliard

And what I get back is:

Amanda Alatti
Arthur Vullamparthi
Chasity Ahmad
Collin Tierney
Ed Gooz
Emirjon Fishta
Frederic Moseley Jr.
John Froex
Kate Onthetimeline
Lisa Lauchstedt
Mary Elizabeth Blackley
Nathan Gutierrez
Nékoé Mīkûriá
Padraic O'Driscoll
Ragu PG
Sifokl AlSifokli
Tammy Houghtaling
Yuri Aleksei Carrión Belliard
Zackry Wallace-Bell
İrfan Qureyş

Which is, for the most part, exactly the sorting you expect, except
that "İrfan Qureyş" has been moved to the end due to the single
umlaut above the "I". Unless you got lucky and the "İ" was in
it's "fully decomposed" encoding ("I" plus umlaut as combining mark),
that would be likely to happen to any line starting with a letter
using a diacritical mark. Also, lines sorting with capital letters
would sort before lines starting with small letters.

Which is why I think C++ needs a "UnicodeCollate" function in its
std lib.

(Disclaimer: The names listed are random names for software testing
only; any resemblance to actual people, living or dead, is purely
coincidental.)

Ben Bacarisse

unread,

Apr 13, 2015, 7:56:58 AM4/13/15

to

Robbie Hatley <see.m...@for.my.address> writes:

> On 3/30/2015 2:57 AM, Juha Nieminen wrote:

<snip>

Did you try with

Text.sort(std::locale(""));

maybe with a suitable string there if your environment does not define
the locale correctly for that input? This won't address all Unicode
encoding and collating issues by any means, but it's a first step that
might be enough for simple programs.

--
Ben.