STL, UTF8, and CodeCvt

Philip

unread,

Mar 4, 2007, 10:44:21 AM3/4/07

to

I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
Unicode application targeted for Far Eastern language support.

The STL stream constructors and open functions require the filename be
provided as a narrow (char) string. Most operating systems now
support Unicode paths and filenames as UTF-8 strings, which can be
represented as char strings.

I would like to pass UTF-8 strings to the STL stream constructors and
open functions, in order to support Far Eastern language filenames.

The STL standard requires std::codecvt functions to support conversion
to and from Unicode UTF-16 wchar_t and MBCS char.

However, I cannot find equivalent functions for conversion from UTF-16
to UTF-8.

Are there UTF-16/UTF-8 conversions that are already part of or being
considered for inclusion in the STL standard?

Is there an Intel/Windows based STL implementation which currently
provides UTF-16/UTF-8 conversion?

More generally, STL currently focuses on char/wchar_t types and refers
to them more or less consistently as narrow versus wide type (see
std::ios::widen()).

Does the STL standard currently encompass any terminology to
accommodate UTF-8 (a sort of combination of narrow and wide) or is the
standards committee considering anything along these lines?

Lastly, the UTF-16/UTF-8 conversions are well-known and relatively
simple so I have considered writing my own std::codecvt
specialization.

However, there is no clearly delineated type which could stand for
UTF-8. I believe unsigned char is taken up (in Windows and VC++
anyway) for MBCS, and typedefs are not real types, only aliases and
thus do not allow a separate specialization.

Is there another technique or a language enhancement on the horizon
which would address this specialization limitation?

Philip

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Eugene Gershnik

unread,

Mar 4, 2007, 1:53:37 PM3/4/07

to

Philip wrote:
> I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
> Unicode application targeted for Far Eastern language support.
>
> The STL stream constructors and open functions require the filename be
> provided as a narrow (char) string. Most operating systems now
> support Unicode paths

Maybe. Windows and Unices do anyway.

> and filenames as UTF-8 strings, which can be
> represented as char strings.

Not the operating systems you are targeting. To date Windows and VC do not
support UTF-8 locales and therefore UTF-8 filenames. See
http://blogs.msdn.com/michkap/archive/2006/10/11/816996.aspx and
http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx
for more details.

> I would like to pass UTF-8 strings to the STL stream constructors and
> open functions, in order to support Far Eastern language filenames.

You need to
- Upgrade to VC 8 (aka 2005)
- Convert UTF-8 to UTF-16 which is the wchar_t encoding on Windows and use
the resultant filename in basic_fstream::open. The standard library coming
with VC 8 allows wchar_t filenames.

> The STL standard requires std::codecvt functions to support conversion
> to and from Unicode UTF-16 wchar_t and MBCS char.

Standard wchar_t does not have to be Unicode and when it is Unicode it does
not have to be UTF-16. Similarly char does not have to store any kind of
MBCS encoding. On Windows what you say happens to be true but C++ standard
is not for Windows only.

> However, I cannot find equivalent functions for conversion from UTF-16
> to UTF-8.
>
> Are there UTF-16/UTF-8 conversions that are already part of or being
> considered for inclusion in the STL standard?

Some standardisation expert will surely answer this. Whatever the standard
committee plans are, they are not going to help you now, though.

> Is there an Intel/Windows based STL implementation which currently
> provides UTF-16/UTF-8 conversion?

Commercial version of Dinkumware might. Check on their website.

> More generally, STL currently focuses on char/wchar_t types and refers
> to them more or less consistently as narrow versus wide type (see
> std::ios::widen()).

Yep, that's what they are. Not every OS out there supports Unicode and even
those that do sometimes use something else for wchar_t (e.g. Solaris).

> Does the STL standard currently encompass any terminology to
> accommodate UTF-8 (a sort of combination of narrow and wide) or is the
> standards committee considering anything along these lines?

I am not qualified to comment on the standard but UTF-8 is not a
"combination of narrow and wide". It is precisly one more kind of narrow
character. There is no fundamental difference between Shift-JIS for example
and UTF-8. Both encode a given logical character as a sequence of one or
more byte-sized units.

> Lastly, the UTF-16/UTF-8 conversions are well-known and relatively
> simple so I have considered writing my own std::codecvt
> specialization.

What you need is UTF-8 locale but VC doesn't support them.

> However, there is no clearly delineated type which could stand for
> UTF-8.

There is. It is called char.

--
Eugene

P.J. Plauger

unread,

Mar 4, 2007, 2:44:14 PM3/4/07

to

"Philip" <Mont...@Hotmail.com> wrote in message
news:1173013138.6...@t69g2000cwt.googlegroups.com...

>I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
> Unicode application targeted for Far Eastern language support.
>
> The STL stream constructors and open functions require the filename be
> provided as a narrow (char) string. Most operating systems now
> support Unicode paths and filenames as UTF-8 strings, which can be
> represented as char strings.
>
> I would like to pass UTF-8 strings to the STL stream constructors and
> open functions, in order to support Far Eastern language filenames.

Our upgrade library for VC++ accepts wchar_t strings as filenames,
so you can use UTF-16 names directly. UTF-8 names require a translation
step, but we also provide the translator.

> The STL standard requires std::codecvt functions to support conversion
> to and from Unicode UTF-16 wchar_t and MBCS char.
>
> However, I cannot find equivalent functions for conversion from UTF-16
> to UTF-8.

See:

http://www.dinkumware.com/manuals/?manual=compleat&page=wstring.html

which is also part of our upgrade library.

> Are there UTF-16/UTF-8 conversions that are already part of or being
> considered for inclusion in the STL standard?

The wstring header described above has been proposed for the
next version of the C++ Standard.

> Is there an Intel/Windows based STL implementation which currently
> provides UTF-16/UTF-8 conversion?

Just our upgrade library, AFAIK.

> More generally, STL currently focuses on char/wchar_t types and refers
> to them more or less consistently as narrow versus wide type (see
> std::ios::widen()).
>
> Does the STL standard currently encompass any terminology to
> accommodate UTF-8 (a sort of combination of narrow and wide) or is the
> standards committee considering anything along these lines?

The committee is considering several approaches, but hasn't settled
on a given one yet.

> Lastly, the UTF-16/UTF-8 conversions are well-known and relatively
> simple so I have considered writing my own std::codecvt
> specialization.
>
> However, there is no clearly delineated type which could stand for
> UTF-8. I believe unsigned char is taken up (in Windows and VC++
> anyway) for MBCS, and typedefs are not real types, only aliases and
> thus do not allow a separate specialization.

You can use any of the char types for UTF-8. No need to store
UTF-8 in its own type. But see below.

> Is there another technique or a language enhancement on the horizon
> which would address this specialization limitation?

That too is being discussed in the C++ committee.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Lourens Veen

unread,

Mar 4, 2007, 4:55:18 PM3/4/07

to

Eugene Gershnik wrote:

> Philip wrote:
>
>> Does the STL standard currently encompass any terminology to
>> accommodate UTF-8 (a sort of combination of narrow and wide) or is
>> the standards committee considering anything along these lines?
>
> I am not qualified to comment on the standard but UTF-8 is not a
> "combination of narrow and wide". It is precisly one more kind of
> narrow character. There is no fundamental difference between
> Shift-JIS for example and UTF-8. Both encode a given logical
> character as a sequence of one or more byte-sized units.

But there _is_ a fundamental difference between encodings that use the
same size code for each character and encodings that use variable
length codes. I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
string stored in a compressed format. That seems to be more natural
than looking at it as a narrow (ASCII) string where some of the
characters are really only partial characters.

If a UTF-8 string is to be stored in an array (or std::vector, or
std::basic_string) of char, then it should be possible to store a
UTF-8 character in a char. Which it isn't.

Lourens

--

Eugene Gershnik

unread,

Mar 5, 2007, 1:55:33 AM3/5/07

to

Lourens Veen wrote:
> Eugene Gershnik wrote:
>
>> Philip wrote:
>>
>>> Does the STL standard currently encompass any terminology to
>>> accommodate UTF-8 (a sort of combination of narrow and wide) or is
>>> the standards committee considering anything along these lines?
>>
>> I am not qualified to comment on the standard but UTF-8 is not a
>> "combination of narrow and wide". It is precisly one more kind of
>> narrow character. There is no fundamental difference between
>> Shift-JIS for example and UTF-8. Both encode a given logical
>> character as a sequence of one or more byte-sized units.
>
> But there _is_ a fundamental difference between encodings that use the
> same size code for each character and encodings that use variable
> length codes.

Definitely. Note, however that variable length encodings have been around
for a long time as anybody who had to deal with far eastern languages
knows.
There is nothing new in this idea.

> I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
> string stored in a compressed format.

Which is precisely the same as any other "MBCS" encodings people have been
using for a long time.

> That seems to be more natural
> than looking at it as a narrow (ASCII) string where some of the
> characters are really only partial characters.

Natural is in the eye of the beholder. If you expect a C++ char to stand
for
a logical character then you definitely have a problem. But this is a bad
idea regardless of Unicode, UTF-8 etc.

> If a UTF-8 string is to be stored in an array (or std::vector, or
> std::basic_string) of char, then it should be possible to store a
> UTF-8 character in a char. Which it isn't.

Sorry but this is non sequitur. You can store UTF-8 string in an array of
chars and it doesn't imply that a single char should have any definite
meaning. A char is just a small integer as far as C and C++ languages are
concerned. Any meaning you attach to it is up to you.

--
Eugene

Sebastian Redl

unread,

Mar 5, 2007, 1:54:55 AM3/5/07

to

On Sun, 4 Mar 2007, Lourens Veen wrote:

> Eugene Gershnik wrote:
>
> But there _is_ a fundamental difference between encodings that use the
> same size code for each character and encodings that use variable
> length codes. I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
> string stored in a compressed format. That seems to be more natural
> than looking at it as a narrow (ASCII) string where some of the
> characters are really only partial characters.

Perhaps it is more natural, but it is not how the C++ standard uses the
terms. To the C++ standard, narrow encodings are those that have char as
their minimal unit, while wide encodings have wchar_t as their minimal
unit. The minimal unit of UTF-8 is char, so it is a narrow encoding.

> If a UTF-8 string is to be stored in an array (or std::vector, or
> std::basic_string) of char, then it should be possible to store a
> UTF-8 character in a char. Which it isn't.

I don't understand this reasoning. Sounds like a non-sequitur to me.

Sebastian Redl

Ulrich Eckhardt

unread,

Mar 5, 2007, 5:37:11 AM3/5/07

to

Philip wrote:
> I am using MS VC++ 7/1 (Visual Studio 2003) and working in a fully
> Unicode application targeted for Far Eastern language support.
>
> The STL stream constructors

The STL doesn't include any streams, it only consists of containers,
iterators and algorithms. The C++ streams come from the IOStreams library
rather. Anyway, neither are standard C++, though the C++ standard was
heavily influenced by both, in fact they were mostly incorporated. When you
refer to the "STL standard", I assume you mean the C++ standard.

> and open functions require the filename be provided as a narrow (char)
> string. Most operating systems now support Unicode paths and filenames
> as UTF-8 strings, which can be represented as char strings.
>
> I would like to pass UTF-8 strings to the STL stream constructors and
> open functions, in order to support Far Eastern language filenames.

You can't. As pointed out, many OS' don't support UTF-8 and the string
passed to the ctor has implementation-defined meaning anyways. The IMHO
best way around this is to simply create a function

void open_file( ofstream& out, std::string const& utf_8_filename);

which then is implemented in a platform/compiler-dependant way. Typically,
this would delegate to to the normal ofstream::open or use the generally
present way to create an fstream from a FILE* or other, platform-dependant
things.

> The STL standard requires std::codecvt functions to support conversion
> to and from Unicode UTF-16 wchar_t and MBCS char.

No it doesn't. Nothing requires any particular meaning or interpretation for
wchar_t or char. Further, std::codecvt is notoriously bad at handling
internal multi-element per char encodings like UTF-8, UTF-16 and,
considering combining glyphs like accents, Unicode in general. In
particular the latter part is often simply ignored, leading to subtle
problems sometimes.

> However, there is no clearly delineated type which could stand for
> UTF-8. I believe unsigned char is taken up (in Windows and VC++
> anyway) for MBCS, and typedefs are not real types, only aliases and
> thus do not allow a separate specialization.

Again: encoding is rather a convention than that it is in any way mandated.
That means that 'char' is well-suited to holding UTF-8 data, just
as 'unsigned char' would be.

Uli

--
Sator Laser GmbH
Geschäftsführer: Ronald Boers Steuernummer: 02/858/00757
Amtsgericht Hamburg HR B62 932 USt-Id.Nr.: DE183047360

James Kanze

unread,

Mar 5, 2007, 5:39:47 AM3/5/07

to

Lourens Veen wrote:
> Eugene Gershnik wrote:

> > Philip wrote:

> >> Does the STL standard currently encompass any terminology to
> >> accommodate UTF-8 (a sort of combination of narrow and wide) or is
> >> the standards committee considering anything along these lines?

> > I am not qualified to comment on the standard but UTF-8 is not a
> > "combination of narrow and wide". It is precisly one more kind of
> > narrow character. There is no fundamental difference between
> > Shift-JIS for example and UTF-8. Both encode a given logical
> > character as a sequence of one or more byte-sized units.

> But there _is_ a fundamental difference between encodings that use the
> same size code for each character and encodings that use variable
> length codes. I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
> string stored in a compressed format. That seems to be more natural
> than looking at it as a narrow (ASCII) string where some of the
> characters are really only partial characters.

Natural or not, it doesn't correspond to the reality. UTF-8 is
a multibyte encoding; multibyte encodings have been around for
years (at least 40 years), and were officially recognized by the
first version of the C standard.

It's sometimes confusing, of course, because a lot of the C
functions which deal with characters (e.g. the functions in
<ctype.h>) don't really recognize this fact. Chalk it up to
historical reasons.

> If a UTF-8 string is to be stored in an array (or std::vector, or
> std::basic_string) of char, then it should be possible to store a
> UTF-8 character in a char. Which it isn't.

I don't see the relationship. Nor how UTF-8 is different from
any other multibyte character set. std::vector is just an array
of whatever's, with no semantics associated with what it
contains; std::string is not really much more, either (except
that the whatever's have to be PODs). UTF-8 defines its
encoding format as a sequence of bytes, so any container which
can contain a sequence of bytes (char's, in C++ parlance) is
appropriate. None have specific support for UTF-8 encoding, but
then, none have specific support for US ASCII encoding either.

--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Pete Becker

unread,

Mar 5, 2007, 9:54:23 AM3/5/07

to

Eugene Gershnik wrote:

> Lourens Veen wrote:
>
>> I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
>> string stored in a compressed format.
>
> Which is precisely the same as any other "MBCS" encodings people have been
> using for a long time.
>

Not quite. With UTF-8 you can always tell from the value of a byte
whether it is part of a multi-byte character. Other encodings don't have
this property, making it much more difficult to move around (especially
backwards) in a string.

--

-- Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com)
Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." (www.petebecker.com/tr1book)

Eugene Gershnik

unread,

Mar 5, 2007, 4:54:21 PM3/5/07

to

On Mar 5, 6:54 am, Pete Becker <p...@versatilecoding.com> wrote:
> Eugene Gershnik wrote:
> > Lourens Veen wrote:
>
> >> I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
> >> string stored in a compressed format.
>
> > Which is precisely the same as any other "MBCS" encodings people have been
> > using for a long time.
>
> Not quite. With UTF-8 you can always tell from the value of a byte
> whether it is part of a multi-byte character. Other encodings don't have
> this property, making it much more difficult to move around (especially
> backwards) in a string.

True. Another great feature is that UTF-8 is backward compatible with
ASCII as far as search operations are concerned. That is strchr() or
manual iteration will work as long as you search for something withing
the ASCII range. This can make a lot of English only code work with
UTF-8 where it would fail with any other multibyte encoding. (How many
times have you seen people splitting a path by looking at '/' or '\
\'?)
About the only dangerous operation with UTF-8 string is splitting it
at arbitrary location. For example code that naively truncates a
string at 20 chars and adds "..." for display will break.

In any event the important thing is that code that is multibyte
correct will work absolutely fine with UTF-8.

--
Eugene

--

Clark Cox

unread,

Mar 6, 2007, 1:53:45 AM3/6/07

to

On 2007-03-05 07:54:21 -0800, "Eugene Gershnik" <gers...@hotmail.com> said:

> On Mar 5, 6:54 am, Pete Becker <p...@versatilecoding.com> wrote:
>> Eugene Gershnik wrote:
>>> Lourens Veen wrote:
>>
>>>> I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
>>>> string stored in a compressed format.
>>
>>> Which is precisely the same as any other "MBCS" encodings people have been
>>> using for a long time.
>>
>> Not quite. With UTF-8 you can always tell from the value of a byte
>> whether it is part of a multi-byte character. Other encodings don't have
>> this property, making it much more difficult to move around (especially
>> backwards) in a string.
>
> True. Another great feature is that UTF-8 is backward compatible with
> ASCII as far as search operations are concerned. That is strchr() or
> manual iteration will work as long as you search for something withing
> the ASCII range.

Not entirely true. If I search for the character 'e' in the string
"acuté", it is equally possible that the character will be found as it
is that it won't. When encoding the above string in UTF-8, there are
two possibilities (due to decomposition):

"acute\xCC\x81" ('e' will be found at index 4)
"acut\xC3\xA9" ('e' will not be found)

> This can make a lot of English only code work with
> UTF-8 where it would fail with any other multibyte encoding. (How many
> times have you seen people splitting a path by looking at '/' or '\
> \'?)
> About the only dangerous operation with UTF-8 string is splitting it
> at arbitrary location. For example code that naively truncates a
> string at 20 chars and adds "..." for display will break.
>
> In any event the important thing is that code that is multibyte
> correct will work absolutely fine with UTF-8.

--
Clark S. Cox III
clar...@gmail.com

Lourens Veen

unread,

Mar 6, 2007, 1:52:28 AM3/6/07

to

James Kanze wrote:

Just out of interest, how did the designers of C++ end up designing an
std::string that doesn't work with them, then?

>> If a UTF-8 string is to be stored in an array (or std::vector, or
>> std::basic_string) of char, then it should be possible to store a
>> UTF-8 character in a char. Which it isn't.
>
> I don't see the relationship. Nor how UTF-8 is different from
> any other multibyte character set. std::vector is just an array
> of whatever's, with no semantics associated with what it
> contains; std::string is not really much more, either (except
> that the whatever's have to be PODs). UTF-8 defines its
> encoding format as a sequence of bytes, so any container which
> can contain a sequence of bytes (char's, in C++ parlance) is
> appropriate. None have specific support for UTF-8 encoding, but
> then, none have specific support for US ASCII encoding either.

Sorry, I didn't put that right, and the way I wrote it it does indeed
not make any sense. Here's another try.

A string is a sequence of characters. These characters can be chosen
from a very limited set (everything that can be encoded in 7-bit
ASCII) or a very extensive set (unicode). A class represents a string
if its interface is based on this model.

A sequence of (assume 8-bit) chars models a sequence of characters if
you use ASCII as the representation. A sequence of chars does not
model a sequence of characters if you use UTF-8, although a sequence
of characters can be stored in a sequence of bytes using e.g. UTF-8.

Indeed, std::string will not give you the length of a string in
characters if you use it to store a UTF-8 encoded string, and it
won't give you the character at position n if you use operator[](n).
So, std::string is not a string at all, it is a sequence of chars. It
just happens to be a string too if you use a character encoding that
uses a single char for each character, such as ASCII or one of the
ISO-8859 variants.

And yet, according to the C++ standard (according to Eugene, I'm not
very familiar with it), UTF-8 is just another narrow encoding. Well
if it was, you'd think that std::string would still model a string if
used with it!

Lourens

James Kanze

unread,

Mar 6, 2007, 5:16:47 AM3/6/07

to

Pete Becker wrote:
> Eugene Gershnik wrote:
> > Lourens Veen wrote:

> >> I think of a UTF-8 string as a wide (UCS-4 or UTF-32)
> >> string stored in a compressed format.

> > Which is precisely the same as any other "MBCS" encodings
> > people have been using for a long time.

> Not quite. With UTF-8 you can always tell from the value of a
> byte whether it is part of a multi-byte character. Other
> encodings don't have this property, making it much more
> difficult to move around (especially backwards) in a string.

I think a lot of other multi-byte encodings do have this
feature. What UTF-8 has that I've not seen elsewhere is the
possibility to identify in addition whether a given byte is the
first byte of a sequence, or one of the following bytes. This
makes operations like counting the number of characters very
simple (just count the bytes where (*p & 0xC0) != 0x80), and
allows guaranteed resynchronization without looking outside the
character.

--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

--

Lourens Veen

unread,

Mar 6, 2007, 5:11:40 AM3/6/07

to

Lourens Veen wrote:
> A string is a sequence of characters. These characters can be chosen
> from a very limited set (everything that can be encoded in 7-bit
> ASCII) or a very extensive set (unicode).

And now it sounds exclusive. What I meant is that they're chosen from
some set, sometimes a smaller one, sometimes a larger one, but the
point is that there is some abstract set of characters. Whether that
is unicode, or the set that can be encoded in 7-bit ASCII, or
something else doesn't really matter.

Pete Becker

unread,

Mar 6, 2007, 11:49:57 AM3/6/07

to

Clark Cox wrote:
> On 2007-03-05 07:54:21 -0800, "Eugene Gershnik" <gers...@hotmail.com>
> said:
>
>>
>> True. Another great feature is that UTF-8 is backward compatible with
>> ASCII as far as search operations are concerned. That is strchr() or
>> manual iteration will work as long as you search for something withing
>> the ASCII range.
>
> Not entirely true. If I search for the character 'e' in the string
> "acuté", it is equally possible that the character will be found as it
> is that it won't. When encoding the above string in UTF-8, there are
> two possibilities (due to decomposition):
>

Just a small clarification: that's a consequence of Unicode, not
specifically UTF-8. The same thing occurs with any encoding of Unicode
characters. There are two different ways of writing that final letter.
It can be written with a single code point 0x00E1 (LATIN SMALL LETTER A
WITH ACUTE), and it can be written as two code points, 0x0061 (LATIN
SMALL LETTER A) followed by 0x0301 (COMBINING ACUTE ACCENT).

--

-- Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com)
Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." (www.petebecker.com/tr1book)

P.J. Plauger

unread,

Mar 6, 2007, 11:56:40 AM3/6/07

to

"Lourens Veen" <lou...@rainbowdesert.net> wrote in message
news:d3008$45ec8ce6$8259a2fa$18...@news1.tudelft.nl...

For the same reason we didn't make any special provision for Roman
numerals. std::string works quite nicely with UTF-8 text -- you can
store it, search it, copy it about, etc. in many useful ways. The
class is not aware of the inner structure of UTF-8, however, any more
than it is aware of the inner structure of sentences of text. Or
Roman numerals, for that matter.

You mean it's not a string as *you'd* define it in this particular
context. And yet millions of programmers have used it as a "string"
of some sort that meets their needs.

> It
> just happens to be a string too if you use a character encoding that
> uses a single char for each character, such as ASCII or one of the
> ISO-8859 variants.

Or if you use it as a sequence of bytes, or ....

> And yet, according to the C++ standard (according to Eugene, I'm not
> very familiar with it), UTF-8 is just another narrow encoding. Well
> if it was, you'd think that std::string would still model a string if
> used with it!

And how is it to know that the contents of the string *this time*
are UTF-8, as opposed to JIS, Shift-JIS, EUC, UTF-16LE, etc. etc.?
These are all encodings for character sequences that have been widely
used over the past decade or so.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

--

James Kanze

unread,

Mar 6, 2007, 3:07:52 PM3/6/07

to

> >> > Philip wrote:

What do you mean by: doesn't work? I use std::string for my
UTF-8 sequences.

A more valid question might be why C++ has no real data type
which is an abstraction for text. Nothing more than a
collection of ad hoc functions (some of which can only be used
on C style arrays or std::vector<char>, but not on std::string),
which only more or less work.

I suspect that the reason is that even as late as 1998 (or, for
that matter, today), we don't really know what is needed.

> >> If a UTF-8 string is to be stored in an array (or std::vector, or
> >> std::basic_string) of char, then it should be possible to store a
> >> UTF-8 character in a char. Which it isn't.

> > I don't see the relationship. Nor how UTF-8 is different from
> > any other multibyte character set. std::vector is just an array
> > of whatever's, with no semantics associated with what it
> > contains; std::string is not really much more, either (except
> > that the whatever's have to be PODs). UTF-8 defines its
> > encoding format as a sequence of bytes, so any container which
> > can contain a sequence of bytes (char's, in C++ parlance) is
> > appropriate. None have specific support for UTF-8 encoding, but
> > then, none have specific support for US ASCII encoding either.

> Sorry, I didn't put that right, and the way I wrote it it does indeed
> not make any sense. Here's another try.

> A string is a sequence of characters. These characters can be chosen
> from a very limited set (everything that can be encoded in 7-bit
> ASCII) or a very extensive set (unicode). A class represents a string
> if its interface is based on this model.

That's one definition. (I'd qualify it as a text string.) And
standard C++ has no class which represents a text string, according
to this definition. Neither did C, and nor does Java. (Or a
lot of other languages, most of which I don't know.)

As I said above, I'm not even really certain that we know what
such a class should look like, even today.

> A sequence of (assume 8-bit) chars models a sequence of characters if
> you use ASCII as the representation.

It can. Provided you interpret it as such in your code.

> A sequence of chars does not
> model a sequence of characters if you use UTF-8, although a sequence
> of characters can be stored in a sequence of bytes using e.g. UTF-8.

It can. Provided you interpret it as such in your code.

std::vector<char> and std::basic_string<char> are simply
containers of char. What you do with the contents is up to you.

C++ does give you very little support for multibyte characters.
Things like ctype<char> don't work with them, for example. But
I don't expect that to change soon, because I don't see any real
consensus with regards to what is required. (If there is a
consensus today, which I doubt, it is more that you shouldn't be
using multibyte characters internally anyway---UTF-8 gets
translated into UTF-32 at the IO interface level.)

> Indeed, std::string will not give you the length of a string in
> characters if you use it to store a UTF-8 encoded string, and it
> won't give you the character at position n if you use operator[](n).

More importantly: std::string will not do anything with
characters. It is a container of char.

> So, std::string is not a string at all, it is a sequence of chars.

It is not a text string. I see no problem with calling the
sequence a string, but as you say, it is not a string of
characters, but rather as string of char (or bytes, if you
prefer). Just as std::basic_string<double> is a stirng of
doubles. (I'd say that this is almost implicit from the moment
string is a template.)

> It
> just happens to be a string too if you use a character encoding that
> uses a single char for each character, such as ASCII or one of the
> ISO-8859 variants.

Even then, it's only a text string if you, as a user, interpret
it as such.

> And yet, according to the C++ standard (according to Eugene, I'm not
> very familiar with it), UTF-8 is just another narrow encoding. Well
> if it was, you'd think that std::string would still model a string if
> used with it!

UTF-8 is just another encoding. Not even necessarily supported
by a C++ implementation.

--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

--

Eugene Gershnik

unread,

Mar 6, 2007, 5:16:46 PM3/6/07

to

On Mar 5, 10:53 pm, Clark Cox <clarkc...@gmail.com> wrote:
> If I search for the character 'e' in the string
> "acuté", it is equally possible that the character will be found as it
> is that it won't. When encoding the above string in UTF-8, there are
> two possibilities (due to decomposition):
>
> "acute\xCC\x81" ('e' will be found at index 4)
> "acut\xC3\xA9" ('e' will not be found)

This is a general Unicode problem that has nothing to do with UTF-8.
Even if you use single unit encoding such as UTF-32 this problem will
still be there.

In general there is no easy answer to this. There are two independent
issues here: whether the string "acuté" contains e and how to make
string operations deterministic. The first one depends on context and
it is up to application to decide. As a practical example Windows
active directory will allow a user called acuté to log on typing
acute. On the other hand you probably don't want this behavior from a
dictionary ;-)
The second problem is usually solved by storing all strings in the
same normal form within application code and converting on I/O.

--
Eugene

--

Clark Cox

unread,

Mar 6, 2007, 5:13:45 PM3/6/07

to

On 2007-03-06 02:49:57 -0800, Pete Becker <pe...@versatilecoding.com> said:

> Clark Cox wrote:
>> On 2007-03-05 07:54:21 -0800, "Eugene Gershnik" <gers...@hotmail.com>
>> said:
>>
>>>
>>> True. Another great feature is that UTF-8 is backward compatible with
>>> ASCII as far as search operations are concerned. That is strchr() or
>>> manual iteration will work as long as you search for something withing
>>> the ASCII range.
>>
>> Not entirely true. If I search for the character 'e' in the string
>> "acuté", it is equally possible that the character will be found as it
>> is that it won't. When encoding the above string in UTF-8, there are
>> two possibilities (due to decomposition):
>>
>
> Just a small clarification: that's a consequence of Unicode, not
> specifically UTF-8.

Yes, but UTF-8 is, by definition, an encoding of Unicode.

> The same thing occurs with any encoding of Unicode
> characters. There are two different ways of writing that final letter.
> It can be written with a single code point 0x00E1 (LATIN SMALL LETTER A
> WITH ACUTE), and it can be written as two code points, 0x0061 (LATIN
> SMALL LETTER A) followed by 0x0301 (COMBINING ACUTE ACCENT).

That is exactly my point. The claim that searching for ASCII characters
within a UTF-8 sequence with strchr will consistently work is clealy
false.

--
Clark S. Cox III
clar...@gmail.com

Philip

unread,

Mar 6, 2007, 7:58:47 PM3/6/07

to

PJ,

Thanks very much for the informative reply. It was exactly what I was
looking for.

I have some follow-up questions about standards committee noodlings...

On Mar 4, 2:44 pm, "P.J. Plauger" <p...@dinkumware.com> wrote:

Re: wstring

> The wstring header described above has been proposed for the
> next version of the C++ Standard.

It's a good-looking class. DOes it have a likelihood of getting onto
C0x?

Re: Narrow versus wide nomenclature

> The committee is considering several approaches, but hasn't settled
> on a given one yet.

Can you give me an idea of or point me to the proposals under
consideration?

Re: typealias versus typedef

> That too is being discussed in the C++ committee.

Again, can you give me an idea of or point me to the proposals under
consideration?

Re: New Question

When and how will the standard support Unicode files on disks with
BOMs etc.

Many thanks for the cogent and utterly on-topic reply

To others:

I am a little loose about STL versus the standard library and matters
like that so forgive the inaccuracies in my original post.

I still stick with my original idea that the widen and narrow function
nomenclature in the io-stream classes leaves no room for UTF-8 and
perhaps should be expanded (I see a great future for UTF-8). After
all that nomenclature is designed (I believe) to match the
differentiation between "external" and "internal" representations (see
Josuttis 14.4 p 720), which with the common advent of Unicode files on
disk is perhaps out-of-date.

So how about strungout/stringout for UTF-8? <grin>

Philip

Lourens Veen

unread,

Mar 6, 2007, 8:11:19 PM3/6/07

to

P.J. Plauger wrote:

> "Lourens Veen" <lou...@rainbowdesert.net> wrote in message
> news:d3008$45ec8ce6$8259a2fa$18...@news1.tudelft.nl...
>

[...removed, answered in my reply to James' post...]

>> And yet, according to the C++ standard (according to Eugene, I'm
>> not very familiar with it), UTF-8 is just another narrow encoding.
>> Well if it was, you'd think that std::string would still model a
>> string if used with it!
>
> And how is it to know that the contents of the string *this time*
> are UTF-8, as opposed to JIS, Shift-JIS, EUC, UTF-16LE, etc. etc.?
> These are all encodings for character sequences that have been
> widely used over the past decade or so.

It could keep track of the encoding. Of course, it would then also
have to be able to convert between encodings, join strings with
different encodings, and so on. It wouldn't just be a string class,
it would be a text management subsystem. But having a text management
subsystem is okay. If you have iconv, a standard IO stream library,
and standard string (and perhaps rope?) classes, then they might as
well be an integrated whole.

Lourens

Pete Becker

unread,

Mar 6, 2007, 8:18:50 PM3/6/07

to

{ Topic drift: I feel we're too deep into the details of Unicode with
little C++ content. Could we try more to stay on topic? Thanks. -mod/sk}

Clark Cox wrote:
> On 2007-03-06 02:49:57 -0800, Pete Becker <pe...@versatilecoding.com> said:
>
>> Clark Cox wrote:
>>> On 2007-03-05 07:54:21 -0800, "Eugene Gershnik" <gers...@hotmail.com>
>>> said:
>>>
>>>>
>>>> True. Another great feature is that UTF-8 is backward compatible with
>>>> ASCII as far as search operations are concerned. That is strchr() or
>>>> manual iteration will work as long as you search for something withing
>>>> the ASCII range.
>>>
>>> Not entirely true. If I search for the character 'e' in the string
>>> "acuté", it is equally possible that the character will be found as it
>>> is that it won't. When encoding the above string in UTF-8, there are
>>> two possibilities (due to decomposition):
>>>
>>
>> Just a small clarification: that's a consequence of Unicode, not
>> specifically UTF-8.
>
> Yes, but UTF-8 is, by definition, an encoding of Unicode.
>

Nevertheless, the problem you're talking about is in Unicode, not in UTF-8.

>> The same thing occurs with any encoding of Unicode
>> characters. There are two different ways of writing that final letter.
>> It can be written with a single code point 0x00E1 (LATIN SMALL LETTER A
>> WITH ACUTE), and it can be written as two code points, 0x0061 (LATIN
>> SMALL LETTER A) followed by 0x0301 (COMBINING ACUTE ACCENT).
>
> That is exactly my point. The claim that searching for ASCII characters
> within a UTF-8 sequence with strchr will consistently work is clealy
> false.
>

Your paraphrase makes a broader claim than the original statement did.

--

-- Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com)
Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." (www.petebecker.com/tr1book)

Lourens Veen

unread,

Mar 7, 2007, 11:53:54 AM3/7/07

to

James Kanze wrote:
>
> A more valid question might be why C++ has no real data type
> which is an abstraction for text. Nothing more than a
> collection of ad hoc functions (some of which can only be used
> on C style arrays or std::vector<char>, but not on std::string),
> which only more or less work.
>
> I suspect that the reason is that even as late as 1998 (or, for
> that matter, today), we don't really know what is needed.

Okay, that is what I am talking about then. I'm saying that
std::string does not do what I'd expect it to do, you're saying that
that is because it was never designed to do that, and that I
misunderstand the purpose of std::string. Fair enough, we're looking
at the same thing from different perspectives.

I do have some ideas in this direction. I have an MSc thesis to finish
first, but after that (in a few weeks) I'll put a prototype string
library together and post it here to show my ideas. And then you and
the others can explain to me all the stuff that will be wrong with it
and what a fool I was to think that I could do it, and then we'll all
have learnt something :). Or at least I will have :).

Lourens

P.J. Plauger

unread,

Mar 7, 2007, 5:40:55 PM3/7/07

to

"Philip" <Mont...@Hotmail.com> wrote in message

news:1173210839.3...@64g2000cwx.googlegroups.com...

> PJ,
>
> Thanks very much for the informative reply. It was exactly what I was
> looking for.
>
> I have some follow-up questions about standards committee noodlings...
>
> On Mar 4, 2:44 pm, "P.J. Plauger" <p...@dinkumware.com> wrote:
>
> Re: wstring
>
>> The wstring header described above has been proposed for the
>> next version of the C++ Standard.
>
> It's a good-looking class. DOes it have a likelihood of getting onto
> C0x?

I think so, yes.

> Re: Narrow versus wide nomenclature
>
>> The committee is considering several approaches, but hasn't settled
>> on a given one yet.
>
> Can you give me an idea of or point me to the proposals under
> consideration?

I don't know how much of the mailings are visible to civilians
but you should find at least some information about work in
progress at:

http://www.open-std.org/jtc1/sc22/wg21/

> Re: typealias versus typedef
>
>> That too is being discussed in the C++ committee.
>
> Again, can you give me an idea of or point me to the proposals under
> consideration?

As before.

> Re: New Question
>
> When and how will the standard support Unicode files on disks with
> BOMs etc.

We have codecvt facets that handle a host of Unicode encodings,
with optional BOMs. It's too soon to tell whether something
like that will be mandated, but we stand ready with good
descriptions of the committee is interested.

> Many thanks for the cogent and utterly on-topic reply
>
> To others:
>
> I am a little loose about STL versus the standard library and matters
> like that so forgive the inaccuracies in my original post.
>
> I still stick with my original idea that the widen and narrow function
> nomenclature in the io-stream classes leaves no room for UTF-8 and
> perhaps should be expanded (I see a great future for UTF-8). After
> all that nomenclature is designed (I believe) to match the
> differentiation between "external" and "internal" representations (see
> Josuttis 14.4 p 720), which with the common advent of Unicode files on
> disk is perhaps out-of-date.
>
> So how about strungout/stringout for UTF-8? <grin>

C and C++ really deal with three types of character sequences:

1) single byte
2) multibyte
3) wide chararacters

There's much endemic confusion about what each of these mean, and
how they interact.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

--