[boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]

224 views
Skip to first unread message

Matus Chochlik

unread,
Jan 19, 2011, 5:33:02 AM1/19/11
to bo...@lists.boost.org
The string-encoding-related discussion boils down
for me to the following: What fill the string handling
in C++ look like in the (maybe not immediate) future.

*Scenario A:*

We will pick a widely-accepted char-based encoding
that is able to handle all the writing scripts and alphabets
that we can think of, has enough reserved space for
future additions or is easily extensible and use that
with std::strings which will become the one and only
text string 'container' class.

All the wstrings, wxString, Qstrings, utf8strings, etc. will
be abandoned. All the APIs using ANSI or UCS-2 will
be slowly phased out with the help of convenience
classes like ansi_str_t and ucs2_t that will be made
obsolete and finally dropped (after the transition).

*Scenario B:*

We will add yet another string class named utf8_t to the
already crowded set named above. Then:

library a: will stick to the ANSI encodings with std::strings
It has worked in the past it will work in the future, right ?

library b[oost]: will use utf8_t instead and provide the (seamles
and straightforward) conversions between utf8_t and std::string
and std::wstring. Some (many but not all) others will follow

library c: will use std::strings with utf-8
...
library [.]n[et]: will use String class
...
library q[t]: will use Qstrings
..
library w[xWidgets]: will use wxStrings and wxChar*
library wi[napi]: will use TCHAR*
...
library z: will use const char* in an encoding agnostic way

Now an application using libraries [a..z] will become
the developers nightmare. What string should he use for
the class members, constructor parameters, who to do
when the conversions do not work so seamlesly ?

Also half of the cpu time assigned to running that
application will be wasted on useless string transcoding.
And half of the memory will be occupied with useless
transcoding-related code and data.

*Scenario C:*

This is basically the status quo; a mix of the above.
A sad and unsatisfactory state of things.

*Consequences of A:*

- Interface breaking changes, which will require some fixing
in the library client code and some work in the libraries
themselves. These should be made as painless as possible
with *temporary* utilities or convenience classes that would
for example handle the transcoding from utf8 to UCS-2/UTF-16
in WINAPI and be no-ops on most POSIX systems.

- Silent introduction of bugs for those who still use std::string
for ANSI CP####. This is worse than above and will require
some public-relations work on the part of Boost to make it
clear that using std::strings with ANSI may be an error since
Boost version x.y.z.

- We should finally accept the notion that one byte, word,
dword != one character and that there are code points
and there are characters and both of them can have
variable length encoding and devise tool to handle them
as such conveniently.

- Once we overcome the troubled period of transition
everything will be just great. No headaches related to
file encoding detection and transcoding.

Think about what will happen after we accept IPV6
and drop IPV4. The process will be painful but
after it is done, there will be no more NAT, and co.
and the whole network infrastructure will be simplified.

*Consequences of B:*

- No fixing of existing interface which IMO means
no or very slow moving on to a single encoding.

- Creating another string class, which, let us face it,
not everybody will accept even with the Boost influence
unless it becomes standard.

- We will abandon std::string and be stuck with utf8_t
which I *personally* already dislike :)

- People will probably start to use other programming
languages (although this may by FUD)

*Consequences of C:*

Here pick all the negatives of the above :)

*Note on the encoding to be used*

The best candidate for the widely-accepted and
extensible encoding vaguely mentioned above is IMO
UTF-8.

- It has been given a lot of thought

- It is an already widely accepted standard

- It is char-based so no need to switch
to std::basic_string<whatever_char_t>

- It is extensible, so once we have done the painful
transition we will not have to do it again. Currently
utf-8 uses 1-4 (or 1-6) byte sequences to encode code
points, but the scheme is transparently extensible
to 1-N bytes (unlike UCS-X and i'm not sure about
UTF-16/32).

So,
[dark-sarcasm]
even if we dig out the stargate or join the United
Federation of Planets and captain Kirk, every time
he returns home, brings a truckload of new writing
scripts to support, UTF-8 will be able to handle it.

just my 0.02 strips of gold pressed latinum :)
[/dark-sarcasm]


Best regards,

Matus
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Alexander Lamaison

unread,
Jan 19, 2011, 7:16:59 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:

> The string-encoding-related discussion boils down
> for me to the following: What fill the string handling
> in C++ look like in the (maybe not immediate) future.
>
> *Scenario A:*
>

[..]


> All the wstrings, wxString, Qstrings, utf8strings, etc. will
> be abandoned. All the APIs using ANSI or UCS-2 will
> be slowly phased out with the help of convenience
> classes like ansi_str_t and ucs2_t that will be made
> obsolete and finally dropped (after the transition).

This is simply not going to happen. How could MS even go about doing this
in Windows? It would make very single piece of Windows software
incompatible with the next version!

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Matus Chochlik

unread,
Jan 19, 2011, 7:25:41 AM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 1:16 PM, Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:
> On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
>
>> The string-encoding-related discussion boils down
>> for me to the following: What fill the string handling
>> in C++ look like in the (maybe not immediate) future.
>>
>> *Scenario A:*
>>
> [..]
>> All the wstrings, wxString, Qstrings, utf8strings, etc. will
>> be abandoned. All the APIs using ANSI or UCS-2 will
>> be slowly phased out with the help of convenience
>> classes like ansi_str_t and ucs2_t that will be made
>> obsolete and finally dropped (after the transition).
>
> This is simply not going to happen.  How could MS even go about doing this
> in Windows?  It would make very single piece of Windows software
> incompatible with the next version!

This is where the convenience classes would be used.
For windows it may take a while to get rid of them
and maybe a converter from std::string to WINAPI
string will have to exist for a long time.

IMO even Microsoft is finally realizing that the dual
interface is crap and they will have to do something
about it. Many new additions to WINAPI already
use only WCHAR* and do not provide the ANSI version.

What this means for Boost is that we would be using
std::string with UTF-8 and when using WINAPI as backend
(in Filesystem, Interproces, etc.) we should as Artyom
already suggested use only the "wide-char interface".

Matus

Chad Nelson

unread,
Jan 19, 2011, 8:39:28 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 11:33:02 +0100
Matus Chochlik <choc...@gmail.com> wrote:

> The string-encoding-related discussion boils down for me to the
> following: What fill the string handling in C++ look like in the
> (maybe not immediate) future.
>
> *Scenario A:*
>

> We will pick a widely-accepted char-based encoding [...] and use that


> with std::strings which will become the one and only text string
> 'container' class.
>
> All the wstrings, wxString, Qstrings, utf8strings, etc. will be
> abandoned. All the APIs using ANSI or UCS-2 will be slowly phased out
> with the help of convenience classes like ansi_str_t and ucs2_t that
> will be made obsolete and finally dropped (after the transition).

Sounds like a little slice of heaven to me. Though you'll still have
the pesky problem of having to verify that the UTF-8 code is valid all
the time. More on that below.

> *Scenario B:*
>
> We will add yet another string class named utf8_t to the already

> crowded set named above. [...] Now an application using libraries


> [a..z] will become the developers nightmare. What string should he
> use for the class members, constructor parameters, who to do when
> the conversions do not work so seamlesly ?

How is that different from what we've got today, except that the utf*_t
classes will make converting to and from different string types, and
validating the UTF code, a little easier and more automatic?

> Also half of the cpu time assigned to running that application will
> be wasted on useless string transcoding. And half of the memory will
> be occupied with useless transcoding-related code and data.

I think that's a bit of an exaggeration. :-) As more libraries move to
the assumption that std::string == UTF-8, the need (and code) for
transcoding will silently vanish. Eventually, utf8_t will just be a
statement by the programmer that the data contained within is
guaranteed to be valid UTF-8, enforced by the class -- something that
would require at minimum an extra call if using std::string, one that
could be forgotten and open up the program to exploits.

> *Scenario C:*
>
> This is basically the status quo; a mix of the above. A sad and
> unsatisfactory state of things.

Agreed.

> *Consequences of A:*
>
> [...] - Once we overcome the troubled period of transition everything


> will be just great. No headaches related to file encoding detection
> and transcoding.

It's the getting-there part that I'm concerned about.

> Think about what will happen after we accept IPV6 and drop IPV4. The
> process will be painful but after it is done, there will be no more
> NAT, and co. and the whole network infrastructure will be simplified.

That's a problem I've been watching carefully for many years now, and I
don't see that happening. ISPs will switch to IPv6 (because they have
to), and make it possible for their customers to stay on IPv4, so their
customers *will* stay on IPv4 because it's cheaper. And if they stay
with IPv4, there won't be any impetus for consumer electronics
companies to make their equipment IPv6-compatible because consumers
won't care about it. Without consumer demand, it won't get done for
years, maybe a decade or more.

That's what I see happening with std::string and UTF-8 as well.

> *Consequences of B:*
>
> - No fixing of existing interface which IMO means no or very slow
> moving on to a single encoding.

Which, as stated above, I believe will happen anyway.

> - Creating another string class, which, let us face it, not everybody
> will accept even with the Boost influence unless it becomes standard.

That's the beauty of it -- not everybody *needs* to accept it. Just the
people who write code that isn't encoding-agnostic. Boost.FileSystem
might provide a utf16_t overload for Windows, for instance, so that it
can automatically convert strings in other UTF types. But I see no
reason it would lose the existing interface.

> - We will abandon std::string and be stuck with utf8_t which I
> *personally* already dislike :)

Any technical reason why, other than what you've already written?

> - People will probably start to use other programming languages
> (although this may by FUD)

I hate to point this out, but people are *already* using other
programming languages. :-) C++ isn't new or sexy, and has some
pain-points (though many of the most egregious ones will be solved with
C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
types will only ease that.

> *Note on the encoding to be used*
>
> The best candidate for the widely-accepted and extensible encoding

> vaguely mentioned above is IMO UTF-8. [...]

Apparently a growing number of people agree, as do I.

> - It is extensible, so once we have done the painful transition we
> will not have to do it again. Currently utf-8 uses 1-4 (or 1-6) byte
> sequences to encode code points, but the scheme is transparently
> extensible to 1-N bytes (unlike UCS-X and i'm not sure about

> UTF-16/32). [...]

UTF-16 can't be extended any further than its current definition, not
without a major reinterpretation. UTF-32 (and UTF-8) could go up to
0xFFFFFFFF codepoints, but the standards bodies involved have agreed
that they'll never be extended past the current UTF-16 limitations.
Though of course, that's subject to change if the circumstances change,
though nobody foresees such a change right now.
--
Chad Nelson
Oak Circle Software, Inc.
*
*
*

signature.asc

Chad Nelson

unread,
Jan 19, 2011, 9:00:17 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 12:16:59 +0000
Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:

> On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
> [..]
>> All the wstrings, wxString, Qstrings, utf8strings, etc. will
>> be abandoned. All the APIs using ANSI or UCS-2 will
>> be slowly phased out with the help of convenience
>> classes like ansi_str_t and ucs2_t that will be made
>> obsolete and finally dropped (after the transition).
>
> This is simply not going to happen. How could MS even go about doing
> this in Windows? It would make very single piece of Windows software
> incompatible with the next version!

That has never stopped them before -- see Windows 2.0 -> 3.0, Windows
3.x -> Windows 95 (only partial compatibility), various versions of
WinCE/Windows
Mobile/whatever-marketingspeak-name-they're-using-this-year... ;-)

But you're right, they'll probably stick with UTF-16, despite its
problems.

signature.asc

Ian Emmons

unread,
Jan 19, 2011, 9:06:52 AM1/19/11
to bo...@lists.boost.org
On Jan 19, 2011, at 7:16 AM, Alexander Lamaison wrote:
> On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
>> The string-encoding-related discussion boils down
>> for me to the following: What fill the string handling
>> in C++ look like in the (maybe not immediate) future.
>>
>> *Scenario A:*
>>
> [..]
>> All the wstrings, wxString, Qstrings, utf8strings, etc. will
>> be abandoned. All the APIs using ANSI or UCS-2 will
>> be slowly phased out with the help of convenience
>> classes like ansi_str_t and ucs2_t that will be made
>> obsolete and finally dropped (after the transition).
>
> This is simply not going to happen. How could MS even go about doing this
> in Windows? It would make very single piece of Windows software
> incompatible with the next version!

There is a straightforward way for Microsoft to migrate Windows to this future: If they add UTF-8 support to their narrow character interface (I am avoiding calling it ANSI due to the negative connotations that has) and add narrow character APIs for all wide character APIs that lack a narrow counterpart, then I believe we could treat POSIX and Windows identically from an encoding point of view. Then Microsoft would be free deprecate their wide character interface over an extended period of time, if they so chose.

Ian Emmons

Matus Chochlik

unread,
Jan 19, 2011, 9:08:06 AM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson
<chad.thec...@gmail.com> wrote:
> On Wed, 19 Jan 2011 11:33:02 +0100
> Matus Chochlik <choc...@gmail.com> wrote:
>
>>
>> *Scenario A:*

>
> Sounds like a little slice of heaven to me. Though you'll still have
> the pesky problem of having to verify that the UTF-8 code is valid all
> the time. More on that below.

I am a believer ;) and when people realize that UTF-8 is the way to
go, the pesky problems will vanish. Believe me today with ANSI

Today I have to check/detect the encoding of input files created
by users on different windows machines and do the conversions.
And checking if data is valid UTF-8 is IMO an easier task.

Most people here use windows1252 that is not so different
from ASCII so even if something gets garbled it can be rescued.
I can't imagine what it is like in countries that have to deal with
semitic languages, chinese/japanese/korean ideograms, etc.

>
>> *Scenario B:*


>>
>
> How is that different from what we've got today, except that the utf*_t
> classes will make converting to and from different string types, and
> validating the UTF code, a little easier and more automatic?

Exactly, and I think that we agree that the current status is far from
ideal. The automatic conversions would (probably) be OK but
introducing yet another string class is not.

>
>> Also half of the cpu time assigned to running that application will
>> be wasted on useless string transcoding. And half of the memory will
>> be occupied with useless transcoding-related code and data.
>
> I think that's a bit of an exaggeration. :-) As more libraries move to

Yes, sorry I could not resist :)

> the assumption that std::string == UTF-8, the need (and code) for
> transcoding will silently vanish. Eventually, utf8_t will just be a
> statement by the programmer that the data contained within is
> guaranteed to be valid UTF-8, enforced by the class -- something that
> would require at minimum an extra call if using std::string, one that
> could be forgotten and open up the program to exploits.

Yes but why do not enforce it "organizationally" with the power
and influence Boost has. Again I know that it would break a lot
of stuff but really are all those people that now use std::string ready
to change all their code to use utf8_t instead ? Which will involve
more work ? I'm convinced that it will be the latter, but I can be wrong.

And many people already *do* use std::string for UTF-8 and are
doing the "right" (sorry :)) thing, by introducing utf8_t we are "punishing"
them because we want them, for the sake of people which still dwell
on ANSI, to change their code. IMO we should do the opposite.

>> [...] - Once we overcome the troubled period of transition everything
>> will be just great. No headaches related to file encoding detection
>> and transcoding.
>
> It's the getting-there part that I'm concerned about.

Me too, but again many other people already pointed out
that a large portion of the code is completely encoding agnostic
so there would be no impact if we stayed with std::string. There
would be, if we add utf8_t.

>
>> Think about what will happen after we accept IPV6 and drop IPV4. The
>> process will be painful but after it is done, there will be no more
>> NAT, and co. and the whole network infrastructure will be simplified.
>
> That's a problem I've been watching carefully for many years now, and I
> don't see that happening. ISPs will switch to IPv6 (because they have
> to), and make it possible for their customers to stay on IPv4, so their
> customers *will* stay on IPv4 because it's cheaper. And if they stay
> with IPv4, there won't be any impetus for consumer electronics
> companies to make their equipment IPv6-compatible because consumers
> won't care about it. Without consumer demand, it won't get done for
> years, maybe a decade or more.
>
> That's what I see happening with std::string and UTF-8 as well.

Yes, people (me included) are resistant to big changes event for the better.
But I've learned that I should always consider the long-term impact.

>> - Creating another string class, which, let us face it, not everybody
>> will accept even with the Boost influence unless it becomes standard.
>
> That's the beauty of it -- not everybody *needs* to accept it. Just the
> people who write code that isn't encoding-agnostic. Boost.FileSystem
> might provide a utf16_t overload for Windows, for instance, so that it
> can automatically convert strings in other UTF types. But I see no
> reason it would lose the existing interface.

So you suggest that for example in the STL there would be
(for example) besides the existing fstream and wfstream also
a third "ufstream". I think that we actually should be reducing
the interface not expanding it (yes I hear it ... "breaking changes!" :)).

>
>> - We will abandon std::string and be stuck with utf8_t which I
>> *personally* already dislike :)
>
> Any technical reason why, other than what you've already written?

Besides the ugly name and that is a new class ? No :)

> I hate to point this out, but people are *already* using other
> programming languages. :-) C++ isn't new or sexy, and has some
> pain-points (though many of the most egregious ones will be solved with
> C++0x). Unicode handling is one of them, and in my opinion, the utf*_t
> types will only ease that.

And the solution is long overdue. And creating utf8_t is just putting
the problem away, not solving it really.

Edward Diener

unread,
Jan 19, 2011, 9:58:13 AM1/19/11
to bo...@lists.boost.org
On 1/19/2011 9:08 AM, Matus Chochlik wrote:
> On Wed, Jan 19, 2011 at 2:39 PM, Chad Nelson
> <chad.thec...@gmail.com> wrote:
>> On Wed, 19 Jan 2011 11:33:02 +0100
>> Matus Chochlik<choc...@gmail.com> wrote:
>>
>>>
>>> *Scenario A:*
>>
>> Sounds like a little slice of heaven to me. Though you'll still have
>> the pesky problem of having to verify that the UTF-8 code is valid all
>> the time. More on that below.
>
> I am a believer ;) and when people realize that UTF-8 is the way to
> go, the pesky problems will vanish. Believe me today with ANSI

I do not believe that UTF-8 is the way to go. In fact I know it is not,
except perhaps for the very near future for some programmers ( Linux
advocates ).

Inevitably a Unicode standard will be adapted where every character of
every language will be represented by a single fixed length number of
bits. Nobody will care any longer that this fixed length set of bits
"wastes space", as so many people today hysterically are fixated on.
Whether or not UTF-32 can do this now or not I do not know but this
world where a character in some language on earth is represented by some
arcane multi-byte encoding will end. If UTF-32 can not do it then UTF-nn
inevitably will.

I do not think that shoving UTF-8 down everybody's throats is the best
solution even now, I think a good set of classes to convert between
encoding standards is much better.

Matus Chochlik

unread,
Jan 19, 2011, 10:13:04 AM1/19/11
to bo...@lists.boost.org
>
> I do not believe that UTF-8 is the way to go. In fact I know it is not,
> except perhaps for the very near future for some programmers ( Linux
> advocates ).

:-) Just for the record, I'm not a Linux advocate any more then I'm
a Windows advocate. I use both .. I'm writing this on a windows machine.
What I would like is the whole encoding madness/dysfunction (including
but not limited to the dual TCHAR/whateverchar-based interfaces) to stop.
Everywhere.

>
> Inevitably a Unicode standard will be adapted where every character of every
> language will be represented by a single fixed length number of bits. Nobody
> will care any longer that this fixed length set of bits "wastes space", as
> so many people today hysterically are fixated on. Whether or not UTF-32 can
> do this now or not I do not know but this world where a character in some
> language on earth is represented by some arcane multi-byte encoding will
> end. If UTF-32 can not do it then UTF-nn inevitably will.

And then the HUGE codebase written in C/C++ that currently
uses char will be reimplemented using some utfNN_char_t.
Sorry but I don't see that happening.

Best,

Matus

Alexander Lamaison

unread,
Jan 19, 2011, 10:22:34 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 09:00:17 -0500, Chad Nelson wrote:

> On Wed, 19 Jan 2011 12:16:59 +0000
> Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:
>
>> On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
>> [..]
>>> All the wstrings, wxString, Qstrings, utf8strings, etc. will
>>> be abandoned. All the APIs using ANSI or UCS-2 will
>>> be slowly phased out with the help of convenience
>>> classes like ansi_str_t and ucs2_t that will be made
>>> obsolete and finally dropped (after the transition).
>>
>> This is simply not going to happen. How could MS even go about doing
>> this in Windows? It would make very single piece of Windows software
>> incompatible with the next version!
>
> That has never stopped them before -- see Windows 2.0 -> 3.0, Windows
> 3.x -> Windows 95 (only partial compatibility), various versions of
> WinCE/Windows
> Mobile/whatever-marketingspeak-name-they're-using-this-year... ;-)

I'm not convinced you're right about this. You only have to read The Old
New Thing to see some of the remarkable (insane?) things MS do to retain
backwards compatabiliy. I believe only the 64-bit versions of Windows
Vista/7 ditch 16-bit program compatibilty - so you should be able to crack
out those windows 3 programs on Windows 7 x86 and watch then run! :D

Alexander Lamaison

unread,
Jan 19, 2011, 10:25:23 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 09:06:52 -0500, Ian Emmons wrote:

> On Jan 19, 2011, at 7:16 AM, Alexander Lamaison wrote:
>> On Wed, 19 Jan 2011 11:33:02 +0100, Matus Chochlik wrote:
>>> The string-encoding-related discussion boils down
>>> for me to the following: What fill the string handling
>>> in C++ look like in the (maybe not immediate) future.
>>>
>>> *Scenario A:*
>>>
>> [..]
>>> All the wstrings, wxString, Qstrings, utf8strings, etc. will
>>> be abandoned. All the APIs using ANSI or UCS-2 will
>>> be slowly phased out with the help of convenience
>>> classes like ansi_str_t and ucs2_t that will be made
>>> obsolete and finally dropped (after the transition).
>>
>> This is simply not going to happen. How could MS even go about doing this
>> in Windows? It would make very single piece of Windows software
>> incompatible with the next version!
>
> There is a straightforward way for Microsoft to migrate Windows to this
> future: If they add UTF-8 support to their narrow character interface
> (I am avoiding calling it ANSI due to the negative connotations that
> has) and add narrow character APIs for all wide character APIs that lack
> a narrow counterpart, then I believe we could treat POSIX and Windows
> identically from an encoding point of view.

It would break any programs using the narrow API currently that use any
'exotic' codepage (i.e. pretty much anything except 7-bit ascii). That
said, perhaps it's worth it.

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Alexander Lamaison

unread,
Jan 19, 2011, 10:34:02 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:

>>
>> I do not believe that UTF-8 is the way to go. In fact I know it is not,
>> except perhaps for the very near future for some programmers ( Linux
>> advocates ).
>
> :-) Just for the record, I'm not a Linux advocate any more then I'm
> a Windows advocate. I use both .. I'm writing this on a windows machine.
> What I would like is the whole encoding madness/dysfunction (including
> but not limited to the dual TCHAR/whateverchar-based interfaces) to stop.
> Everywhere.

Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
implementation on Windows which expects local-codepage narrow strings? Are
we hoping MS etc. change these to match? Because otherwise we'll be
converting between narrow encodings for the rest of eternity.

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Matus Chochlik

unread,
Jan 19, 2011, 10:44:29 AM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 4:34 PM, Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:
> On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
>
> Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
> implementation on Windows which expects local-codepage narrow strings?  Are
> we hoping MS etc. change these to match?  Because otherwise we'll be
> converting between narrow encodings for the rest of eternity.

Actually this is the biggest problem I see with the whole transition
and it also concerns other systems. But AFAIK POSIX OSes
are moving to utf-8 so Windows the only one where this
is a real issue.

But is it possible that Windows does the same thing that
POSIX did ? Some time ago on unix sk_SK locale came
with the ISO-8859-2 encoding. Since the the default became
sk_SK.UTF-8 with UTF-8 encoding. Is there any major
obstracle that would prevent Microsoft from doing this ?

>
> Alex

Dave Abrahams

unread,
Jan 19, 2011, 11:26:50 AM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 11:33:02 +0100,

*Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.

*Scenario E:* We add another string class and everyone adopts it

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Alexander Lamaison

unread,
Jan 19, 2011, 11:29:12 AM1/19/11
to bo...@lists.boost.org
> On Wed, 19 Jan 2011 16:44:29 +0100, Matus Chochlik wrote:
>
> > On Wed, Jan 19, 2011 at 4:34 PM, Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:
> > On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
> >
> > Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
> > implementation on Windows which expects local-codepage narrow strings?  Are
> > we hoping MS etc. change these to match?  Because otherwise we'll be
> > converting between narrow encodings for the rest of eternity.
>
> Actually this is the biggest problem I see with the whole transition
> and it also concerns other systems. But AFAIK POSIX OSes
> are moving to utf-8 so Windows the only one where this
> is a real issue.

'Only'? :P While it may be one OS, it probably has more code written for
it than all the rest combined.

> But is it possible that Windows does the same thing that
> POSIX did ? Some time ago on unix sk_SK locale came
> with the ISO-8859-2 encoding. Since the the default became
> sk_SK.UTF-8 with UTF-8 encoding. Is there any major
> obstracle that would prevent Microsoft from doing this ?

I know some Microsoft guys hang out here and may be listening to this
(Stephan L, are you about?). Do you guys have any input on this UTF-8
issue?

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Matus Chochlik

unread,
Jan 19, 2011, 11:34:27 AM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <da...@boostpro.com> wrote:
> At Wed, 19 Jan 2011 11:33:02 +0100,
> Matus Chochlik wrote:
>
> *Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.

'I think maybe you underestimate our influence.' :)

>
> *Scenario E:* We add another string class and everyone adopts it

Ok I admit that this is possible. But let me ask: How did the C world
made the transition without abandoning char ?

BR,

Matus

Peter Dimov

unread,
Jan 19, 2011, 11:30:43 AM1/19/11
to bo...@lists.boost.org
Alexander Lamaison wrote:
> > There is a straightforward way for Microsoft to migrate Windows to this
> > future: If they add UTF-8 support to their narrow character interface
> > (I am avoiding calling it ANSI due to the negative connotations that
> > has) and add narrow character APIs for all wide character APIs that lack
> > a narrow counterpart, then I believe we could treat POSIX and Windows
> > identically from an encoding point of view.
>
> It would break any programs using the narrow API currently that use any
> 'exotic' codepage (i.e. pretty much anything except 7-bit ascii).

It will only break programs that depend on a specific code page. Programs
that use the narrow API but do not require a specific code page (or a single
byte code page - the exact opposite of exotic) will work fine - they'll
simply see an ANSI code page of 65001. It will still cause a fair amount of
breakage, of course, but in principle, the transition path is obvious and
straightforward.

Peter Dimov

unread,
Jan 19, 2011, 11:33:39 AM1/19/11
to bo...@lists.boost.org
Edward Diener wrote:
> Inevitably a Unicode standard will be adapted where every character of
> every language will be represented by a single fixed length number of
> bits.

This was the prevailing thinking once. First this number of bits was 16,
which incorrect assumption claimed Microsoft and Java as victims, then it
became 21 (or 22?). Eventually, people realized that this will never happen
even if we allocate 32 bits per character, so here we are.

Mathias Gaunard

unread,
Jan 19, 2011, 11:51:30 AM1/19/11
to bo...@lists.boost.org

*Scenario D:*

Use Ranges, don't care whether it's std::string, whatever_string, etc.
This also allows maximum efficiency, with lazy concatenation,
transformations, conversion, filtering etc.

My Unicode library works with arbitrary ranges, and you can adapt a
range in an encoding into a range in another encoding.
This can be used to lazily perform encoding conversion as the range is
iterated; such conversions may even be pipelined.

Alexander Lamaison

unread,
Jan 19, 2011, 11:56:57 AM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 17:34:27 +0100, Matus Chochlik wrote:

> On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <da...@boostpro.com> wrote:
>> At Wed, 19 Jan 2011 11:33:02 +0100,
>> Matus Chochlik wrote:
>>
>> *Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
>
> 'I think maybe you underestimate our influence.' :)
>
>>
>> *Scenario E:* We add another string class and everyone adopts it
>
> Ok I admit that this is possible. But let me ask: How did the C world
> made the transition without abandoning char ?

They made the transition? I must have missed this.

The Windows API _is_ C and has all the problems we've been talking about.

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Matus Chochlik

unread,
Jan 19, 2011, 12:10:06 PM1/19/11
to bo...@lists.boost.org
>
> They made the transition?  I must have missed this.
>
> The Windows API _is_ C and has all the problems we've been talking about.
>
OK, besides Microsoft C world :)

Peter Dimov

unread,
Jan 19, 2011, 12:09:48 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:

> *Scenario D:* We try for scenario A. and people still use Qstrings,
> wxStrings, etc.
>
> *Scenario E:* We add another string class and everyone adopts it

The problem with using an Unicode string, be it QString or utf8_string, to
represent paths is that it forces you to pick an encoding under POSIX. When
the OS gives you a file name as char*, to store it in your Unicode string,
you have to interpret it. Then, to give it back to the OS, you have to
de-interpret it. This forces you to choose between two evils: you can opt to
use a single byte encoding such as ISO-8859-1, which gives you perfect
round-trip, but leads to the problem that people can enter a Cyrillic file
name in your Unicode-enabled GUI and see something odd happen on disk, even
when their shell is configured as UTF-8 and can show Cyrillic names. Or, you
can choose to use UTF-8, in which case the OS can give you a name which you
can't decode properly, because it's invalid UTF-8.

There is no single good answer to this, of course; even if you go with my
recommended approach as treating paths as byte sequences unless and until
you need to display them (in which case you treat them as UTF-8), there'll
still be paths that won't show up properly on the screen. But the program
will be able to work with them, even if they are undisplayable.

To give a simple example:

int my_main( int ac, char const* av[] )
{
my_fopen( av[1] );
}

Since files can have arbitrary byte sequences as names under POSIX (Mac OS X
excluded), if my_fopen insists on taking valid UTF-8, it will refuse to open
the file.

Edward Diener

unread,
Jan 19, 2011, 12:18:48 PM1/19/11
to bo...@lists.boost.org
On 1/19/2011 11:33 AM, Peter Dimov wrote:
> Edward Diener wrote:
>> Inevitably a Unicode standard will be adapted where every character of
>> every language will be represented by a single fixed length number of
>> bits.
>
> This was the prevailing thinking once. First this number of bits was 16,
> which incorrect assumption claimed Microsoft and Java as victims, then
> it became 21 (or 22?). Eventually, people realized that this will never
> happen even if we allocate 32 bits per character, so here we are.

"Eventually, people realized..." . This is just rhetoric, where "people"
is just whatever your own opinion is.

I do not understand the technical reason for it never happening. Are
human "alphabets" proliferating so fast that we can not fit the notion
of a character in any alphabet into a fixed size character ? In that
case neither are we ever going to have multi-byte characters
representing all of the possible characters in any language. But it is
absurd to believe that. "Eventually people realized that making a fixed
size character representing every character in every language was doable
and they just did it." That sounds fairly logical to me, aside from the
practicality of getting diverse people from different
nationalities/character-sets to agree on things.

Of course you can argue that having a variable number of bytes
representing each possible character in any language is better than
having a single fixed size character and I am willing to listen to that
technical argument. But from a programming point of view, aside from the
"waste of space" issue, it does seem to me that having a fixed size
character has the obvious advantage of being able to access a character
via some offset in the character array, and that all the algorithms for
finding/inserting/deleting/changing characters become much easier and
quicker with a fixed size character, as well as displaying and inputting.

Alexander Lamaison

unread,
Jan 19, 2011, 12:21:25 PM1/19/11
to bo...@lists.boost.org
> On Wed, 19 Jan 2011 18:10:06 +0100, Matus Chochlik wrote:
>
> >>
> > They made the transition?  I must have missed this.
> >
> > The Windows API _is_ C and has all the problems we've been talking about.
> >
> OK, besides Microsoft C world :)

By changing the OS-default encoding to assume char* string was UTF-8. Same
as for C++. This whole issue is about how to accommodate OSses that don't
make that assumption.

Alex


--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Matus Chochlik

unread,
Jan 19, 2011, 12:41:22 PM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 6:21 PM, Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:
>> On Wed, 19 Jan 2011 18:10:06 +0100, Matus Chochlik wrote:
>>
>> >>
>> > They made the transition?  I must have missed this.
>> >
>> > The Windows API _is_ C and has all the problems we've been talking about.
>> >
>> OK, besides Microsoft C world :)
>
> By changing the OS-default encoding to assume char* string was UTF-8.  Same
> as for C++.  This whole issue is about how to accommodate OSses that don't
> make that assumption.

Agreed, again if Microsoft could move by default to UTF-8
for the various locales instead of using the current encodings
then this whole discussion would be moot.

For the time being we would need to do something like this
even if a complete transcoding is not possible:

std::string filepath(get_path_in_utf8())
std::fstream file(utf8_to_locale_encoding(filepath));

everywhere the implementation (STL, etc.) expects native
encoding. This is the ugliest part of the whole transition.
Boost could hide this completely by using the wide-char
interfaces and doing CreateFileW(utf8_to_winapi_wide(filepath), ...).

It also could be an opportunity for alternate
implementations of STL which would handle it transparently.

Matus

Peter Dimov

unread,
Jan 19, 2011, 12:42:25 PM1/19/11
to bo...@lists.boost.org
Alexander Lamaison wrote:

> By changing the OS-default encoding to assume char* string was UTF-8.

You keep talking about "OS-default encoding", but there's no such thing.
POSIX operating systems do not assume anything about the encoding of char*
(*). You have a global locale (**) in C/C++ programs, and the user can
control it via environment variables (unless the program changes it), but
the OS itself does not.

(*) Except Mac OS X, which requires UTF-8 for paths.
(**) Actually, you have two global locales - C and C++, not necessarily in
sync with each other.

Artyom

unread,
Jan 19, 2011, 12:50:50 PM1/19/11
to bo...@lists.boost.org
> From: Alexander Lamaison <aw...@doc.ic.ac.uk>

>
> On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
>
> >>
> >> I do not believe that UTF-8 is the way to go. In fact I know it is not,
> >> except perhaps for the very near future for some programmers ( Linux
> >> advocates ).
> >
> > :-) Just for the record, I'm not a Linux advocate any more then I'm
> > a Windows advocate. I use both .. I'm writing this on a windows machine.
> > What I would like is the whole encoding madness/dysfunction (including
> > but not limited to the dual TCHAR/whateverchar-based interfaces) to stop.
> > Everywhere.
>
> Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
> implementation on Windows which expects local-codepage narrow strings? Are
> we hoping MS etc. change these to match? Because otherwise we'll be
> converting between narrow encodings for the rest of eternity.
>
> Alex


First of all today there **is** problem and STL code can't open
file, try to open "שלום-سلام-pease.txt" under Windows using
GCC's std::fstream... You can't.

I assume with some other compilers it happens as well.

There **is** problem ignoring it would not help us.

How can we address STL problem and UTF-8? Simply?

Provide:

boost::basic_fstream
boost::fopen
boost::freopen
boost::remove
boost::rename

Which are using same std::* classes under Posix platform
and UTF-8 aware implementations for Windows.

Take a look on this code:

http://art-blog.no-ip.info/files/nowide.zip

This is the code I use for my projects that implements
what I'm talking about - simple easy to use straightforward.

Also don't forget two things:

1. Microsoft Deprecated ANSI API and does not recommend
to use it.

If the only OS that gives us most of the encodings headache
deprecated ANSI API I don't think that Boost should
continue supporting it.

2. All the world had already moved to Unicode, Microsoft
did this as well.

They did it in their incompatible-with-rest-of-the-world
way... But still they did it too - so we can continue
ignoring the fact that UTF-8 is ultimate encoding
or go forward.

Artyom

Peter Dimov

unread,
Jan 19, 2011, 12:52:05 PM1/19/11
to bo...@lists.boost.org
Edward Diener wrote:
> On 1/19/2011 11:33 AM, Peter Dimov wrote:
> > Edward Diener wrote:
> >> Inevitably a Unicode standard will be adapted where every character of
> >> every language will be represented by a single fixed length number of
> >> bits.
> >
> > This was the prevailing thinking once. First this number of bits was 16,
> > which incorrect assumption claimed Microsoft and Java as victims, then
> > it became 21 (or 22?). Eventually, people realized that this will never
> > happen even if we allocate 32 bits per character, so here we are.
>
> "Eventually, people realized..." . This is just rhetoric, where "people"
> is just whatever your own opinion is.
>
> I do not understand the technical reason for it never happening.

I'm not sure that I do, either. Nevertheless, people at the Unicode
consortium have been working on that for... 20 years now? What technical
obstacle that currently blocks their progress do you foresee disappearing in
the future? Occam says that variable width characters are simply a better
match for the problem domain, even when character width in bits is not a
problem.

Peter Dimov

unread,
Jan 19, 2011, 1:01:15 PM1/19/11
to bo...@lists.boost.org
Artyom wrote:

> How can we address STL problem and UTF-8? Simply?
>
> Provide:
>
> boost::basic_fstream
> boost::fopen
> boost::freopen
> boost::remove
> boost::rename
>
> Which are using same std::* classes under Posix platform
> and UTF-8 aware implementations for Windows.

This is basically what I do as well, wrappers that on Windows translate
UTF-8 into UTF-16 and call the corresponding _w* function.

Robert Ramey

unread,
Jan 19, 2011, 1:27:17 PM1/19/11
to bo...@lists.boost.org
Peter Dimov wrote:
> Edward Diener wrote:
>> Inevitably a Unicode standard will be adapted where every character
>> of every language will be represented by a single fixed length
>> number of bits.
>
> This was the prevailing thinking once. First this number of bits was
> 16, which incorrect assumption claimed Microsoft and Java as victims,
> then it became 21 (or 22?). Eventually, people realized that this
> will never happen even if we allocate 32 bits per character, so here
> we are.

well put !!!

This is the problem of trying to impose a view upon the future. No
one really sees far enough ahead to do that. The best we can do
is to allow proposals to be implemented so they can be then
sorted out by "software evolution". It's the age old argument
of central planning - intelligent design - etc. vs. market capitalism,
evolution, etc. Admitadly, it goes against the grain of control
freaks like us, but we have to live with it. And things do get
better. Imagine if somehow 32 bit Unicode had been imposed
upon us!

This is the essence of my argument that the only way forward
is to propose and implement a better way forward and then
try to sell it.

Robert Ramey

Dave Abrahams

unread,
Jan 19, 2011, 1:35:51 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 09:58:13 -0500,

Edward Diener wrote:
>
> I do not think that shoving UTF-8 down everybody's throats is the best
> solution even now, I think a good set of classes to convert between
> encoding standards is much better.

Can we please tone down the rhetoric here?

I could say, "I do not think that shoving a set of classes to convert
between encoding standards down everybody's throats is the best
solution..." but I don't think it would help anyone understand the
issues better.

Is there any harm in exploring the alternatives here in a calm and
rational way? If we do that, and the approaches you oppose are truly
inferior, that fact will become clear to everyone, I think.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Dave Abrahams

unread,
Jan 19, 2011, 1:39:44 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 17:34:27 +0100,

Matus Chochlik wrote:
>
> On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <da...@boostpro.com> wrote:
> > At Wed, 19 Jan 2011 11:33:02 +0100,
> > Matus Chochlik wrote:
> >
> > *Scenario D:* We try for scenario A. and people still use Qstrings, wxStrings, etc.
>
> 'I think maybe you underestimate our influence.' :)

Our influence, if we introduce new library components, is very great,
because they're on a de-facto fast track to standardization, and an
improved string library is exactly the sort of thing that would be
adopted upstream. If we simply agree to a programming convention,
that will have some impact, but much less.

> > *Scenario E:* We add another string class and everyone adopts it
>
> Ok I admit that this is possible. But let me ask: How did the C world
> made the transition without abandoning char ?

The transition from what to what?

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Dave Abrahams

unread,
Jan 19, 2011, 1:43:55 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 19:09:48 +0200,

Peter Dimov wrote:
>
> Dave Abrahams wrote:
>
> > *Scenario D:* We try for scenario A. and people still use Qstrings,
> > wxStrings, etc.
> >
> > *Scenario E:* We add another string class and everyone adopts it
>
> The problem with using an Unicode string, be it QString or
> utf8_string, to represent paths is that it forces you to pick an
> encoding under POSIX. When the OS gives you a file name as char*, to
> store it in your Unicode string, you have to interpret it. Then, to
> give it back to the OS, you have to de-interpret it.

Nonono; if you don't want to choose an encoding, you store it as a
raw_string, (a.k.a. std::string, for example)!

The whole point is to separate by type the things we know how to
interpret from the things we don't.

Please tell me if I'm missing something that's still important below
after my explanation above. I only skimmed because it mostly seemed
to be based on a misinterpretation of my proposal.

> This forces you to choose between two evils: you can opt to use a
> single byte encoding such as ISO-8859-1, which gives you perfect
> round-trip, but leads to the problem that people can enter a
> Cyrillic file name in your Unicode-enabled GUI and see something odd
> happen on disk, even when their shell is configured as UTF-8 and can
> show Cyrillic names. Or, you can choose to use UTF-8, in which case
> the OS can give you a name which you can't decode properly, because
> it's invalid UTF-8.
>
> There is no single good answer to this, of course; even if you go with
> my recommended approach as treating paths as byte sequences unless and
> until you need to display them (in which case you treat them as
> UTF-8), there'll still be paths that won't show up properly on the
> screen. But the program will be able to work with them, even if they
> are undisplayable.
>
> To give a simple example:
>
> int my_main( int ac, char const* av[] )
> {
> my_fopen( av[1] );
> }
>
> Since files can have arbitrary byte sequences as names under POSIX
> (Mac OS X excluded), if my_fopen insists on taking valid UTF-8, it
> will refuse to open the file.

--

Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Alexander Lamaison

unread,
Jan 19, 2011, 1:54:31 PM1/19/11
to bo...@lists.boost.org

Hmmmm ... I'm starting to come round to your std::string == UTF-8 point-of
view.

The one thing that would still annoy me is that std::string's interface was
clearly designed for single-byte == single-character/codepoint/whatever
operation. I don't suppose anyone will be adding
.begin_character()/.end_character() methods to std::string any time soon.

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Matus Chochlik

unread,
Jan 19, 2011, 2:03:59 PM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 7:39 PM, Dave Abrahams <da...@boostpro.com> wrote:
>
> Our influence, if we introduce new library components, is very great,
> because they're on a de-facto fast track to standardization, and an
> improved string library is exactly the sort of thing that would be
> adopted upstream.  If we simply agree to a programming convention,
> that will have some impact, but much less.

OK, I see. But, is there any chance that the standard itself would be updated
so that it first would recommend to use UTF-8 with C++ strings.
After some period of time all other encodings would be deprecated
and using them would cause undefined behavior. Could Boost
be the driving force here?

I really see all the obstacles that prevent us from just switching
to UTF-8, but adding a new string class will not help for the same
reasons adding wstring did not help.
As I already said elsewhere I think that this is a problem that has
to be solved "organizationally".

>
>> > *Scenario E:* We add another string class and everyone adopts it
>>
>> Ok I admit that this is possible. But let me ask: How did the C world
>> made the transition without abandoning char ?
>
> The transition from what to what?

I meant that for example on POSIX OSes the POSIX C-API
did not have to be changed or extended by a new set of functions
doing the same things, but using a new character type, when they
switched from the old encodings to UTF-8.

To compare two strings you still can use stdcmp and not utf8strcmp,
to collate strings you use strcoll and not utf8strcol, etc.

I must admit that the previous statement is an oversimplification and that
the things also rely on the C/C++ locale, etc.


>
> --
> Dave Abrahams
> BoostPro Computing
> http://www.boostpro.com
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
>

--
________________
::matus_chochlik

Matus Chochlik

unread,
Jan 19, 2011, 2:15:30 PM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 7:54 PM, Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:
>>
>> Agreed, again if Microsoft could move by default to UTF-8
>> for the various locales instead of using the current encodings
>> then this whole discussion would be moot.
>>
>> For the time being we would need to do something like this
>> even if a complete transcoding is not possible:
>>
>> std::string filepath(get_path_in_utf8())
>> std::fstream file(utf8_to_locale_encoding(filepath));
>>
>> everywhere the implementation (STL, etc.)  expects native
>> encoding. This is the ugliest part of the whole transition.
>> Boost could hide this completely by using the wide-char
>> interfaces and doing CreateFileW(utf8_to_winapi_wide(filepath), ...).
>>
>> It also could be an opportunity for alternate
>> implementations of STL which would handle it transparently.
>
> Hmmmm ... I'm starting to come round to your std::string == UTF-8 point-of
> view.
>
> The one thing that would still annoy me is that std::string's interface was
> clearly designed for single-byte == single-character/codepoint/whatever
> operation.  I don't suppose anyone will be adding
> .begin_character()/.end_character() methods to std::string any time soon.

This is where the (Boost.)Locale and (Boost.)Unicode libraries could provide
insight into how to extend the std::string interface or be the testbed for
new additions to the standard library related to string manipulation.
(Provided, the standard adopts UTF-8 as a native encoding. Or does it already ?)

Matus

Yakov Galka

unread,
Jan 19, 2011, 2:20:44 PM1/19/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 18:33, Peter Dimov <pdi...@pdimov.com> wrote:

> This was the prevailing thinking once. First this number of bits was 16,
> which incorrect assumption claimed Microsoft and Java as victims, then it
> became 21 (or 22?). Eventually, people realized that this will never happen
> even if we allocate 32 bits per character, so here we are.


This is one more advantage of UTF-8 over UTF-16 and UTF-32. UTF-8 bit
patterns can be extended indefinitely, even for 256 bit code-points. :-)

Chad Nelson

unread,
Jan 19, 2011, 2:25:52 PM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 09:06:52 -0500
Ian Emmons <iem...@bbn.com> wrote:

>> This is simply not going to happen. How could MS even go about
>> doing this in Windows? It would make very single piece of Windows
>> software incompatible with the next version!


>
> There is a straightforward way for Microsoft to migrate Windows to
> this future: If they add UTF-8 support to their narrow character

> interface [...] then I believe we could treat POSIX and Windows
> identically from an encoding point of view. Then Microsoft would be
> free deprecate their wide character interface over an extended period
> of time, if they so chose.

And if the developers at Microsoft controlled the company, this would
probably already be underway, if not completed. :-) But Microsoft is
controlled by management, which answers to investors, who want them to
squeeze as much money out of their customers as they can.
Interoperability that makes programmers able to take a Windows program
and easily port it to some other OS is a very *bad* thing, from their
point of view -- anything that locks people into using Windows is far
preferable.

Market forces might coerce them into allowing it someday, as they
have coerced them to making Internet Explorer more standards-compliant,
but they'll fight it tooth and nail.
--
Chad Nelson
Oak Circle Software, Inc.
*
*
*

signature.asc

Dave Abrahams

unread,
Jan 19, 2011, 2:28:09 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 20:03:59 +0100,

Matus Chochlik wrote:
>
> On Wed, Jan 19, 2011 at 7:39 PM, Dave Abrahams <da...@boostpro.com> wrote:
> >
> > Our influence, if we introduce new library components, is very great,
> > because they're on a de-facto fast track to standardization, and an
> > improved string library is exactly the sort of thing that would be
> > adopted upstream.  If we simply agree to a programming convention,
> > that will have some impact, but much less.
>
> OK, I see. But, is there any chance that the standard itself would
> be updated so that it first would recommend to use UTF-8 with C++
> strings.

Well, never say "never," but... never. Such recommendations are not
part of the standard's mission. It doesn't do things like that.

> After some period of time all other encodings would be deprecated

By whom?

> and using them would cause undefined behavior. Could Boost be the
> driving force here?

This doesn't seem like a very plausible scenario to me, based on my
experience. Of course, others may disagree.

> I really see all the obstacles that prevent us from just switching
> to UTF-8, but adding a new string class will not help for the same
> reasons adding wstring did not help.

I don't see the parallel at all. wstring is just another container of
bytes, for all practical purposes. It doesn't imply any particular
encoding, and does nothing to segregate the encoded from the raw.

> As I already said elsewhere I think that this is a problem that has
> to be solved "organizationally".

Perhaps. The type system is one of our organizational tools, and
Boost has an impact insofar as it produces components that people use,
so if we aren't able to produce some flagship library components that
help with the solution, we have little traction.

> >> > *Scenario E:* We add another string class and everyone adopts it
> >>
> >> Ok I admit that this is possible. But let me ask: How did the C world
> >> made the transition without abandoning char ?
> >
> > The transition from what to what?
>
> I meant that for example on POSIX OSes the POSIX C-API
> did not have to be changed or extended by a new set of functions
> doing the same things, but using a new character type, when they
> switched from the old encodings to UTF-8.

...and people still have the problem that they lose track of what's
"raw" and what's encoded as utf-8.

> To compare two strings you still can use stdcmp and not utf8strcmp,
> to collate strings you use strcoll and not utf8strcol, etc.

Yeah... but surely POSIX's strcmp only tells you whether the two
strings have the same sequence of code points, not whether they have
the same characters, right? And if you inadvertently compare a "raw"
string with an equivalent utf-8-encoded string, what happens?

Chad Nelson

unread,
Jan 19, 2011, 2:30:07 PM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 15:22:34 +0000
Alexander Lamaison <aw...@doc.ic.ac.uk> wrote:

>>> This is simply not going to happen. How could MS even go about
>>> doing this in Windows? It would make very single piece of Windows
>>> software incompatible with the next version!
>>

>> That has never stopped them before -- see Windows 2.0 -> 3.0, Windows
>> 3.x -> Windows 95 (only partial compatibility), various versions of
>> WinCE/Windows
>> Mobile/whatever-marketingspeak-name-they're-using-this-year... ;-)
>
> I'm not convinced you're right about this. You only have to read The
> Old New Thing to see some of the remarkable (insane?) things MS do to
> retain backwards compatabiliy. I believe only the 64-bit versions of
> Windows Vista/7 ditch 16-bit program compatibilty - so you should be
> able to crack out those windows 3 programs on Windows 7 x86 and watch
> then run! :D

Yes, that answer was meant tongue-in-cheek. Microsoft got the
backward-compatibility religion (for the desktop, at least) around the
time they introduced Windows 95, because the only way to convince
people to buy it was to show them that their old programs would
continue to run. A few years ago they seemed to start drifting away
from that again, but they seem to have rediscovered the need for it.

signature.asc

Alexander Lamaison

unread,
Jan 19, 2011, 2:42:44 PM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 19:42:25 +0200, Peter Dimov wrote:

> Alexander Lamaison wrote:
>
>> By changing the OS-default encoding to assume char* string was UTF-8.
>
> You keep talking about "OS-default encoding", but there's no such thing.
> POSIX operating systems do not assume anything about the encoding of char*

I was under the impression that Linux changed from interpreting char* as
being in a multitude of different encodings to being in UTF-8 by default.

Alex

--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

_______________________________________________

Chad Nelson

unread,
Jan 19, 2011, 2:50:25 PM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 15:08:06 +0100
Matus Chochlik <choc...@gmail.com> wrote:

>> How is that different from what we've got today, except that the
>> utf*_t classes will make converting to and from different string
>> types, and validating the UTF code, a little easier and more
>> automatic?
>
> Exactly, and I think that we agree that the current status is far from
> ideal. The automatic conversions would (probably) be OK but
> introducing yet another string class is not.

Do you see another way to provide those conversions, and automatic
verification of proper UTF coding? (Automatic verification is a very
good thing, without it someone won't use it or will forget to, and open
up their programs to exploitation.)

>> the assumption that std::string == UTF-8, the need (and code) for
>> transcoding will silently vanish. Eventually, utf8_t will just be a
>> statement by the programmer that the data contained within is
>> guaranteed to be valid UTF-8, enforced by the class -- something that
>> would require at minimum an extra call if using std::string, one that
>> could be forgotten and open up the program to exploits.
>
> Yes but why do not enforce it "organizationally" with the power
> and influence Boost has. Again I know that it would break a lot
> of stuff but really are all those people that now use std::string
> ready to change all their code to use utf8_t instead ? Which will
> involve more work ? I'm convinced that it will be the latter, but I
> can be wrong.

If Boost comes out with a version that breaks existing programs,
companies just won't upgrade to it. I can keep one of the companies
that mine works with upgrading, because the group that I work with is
the only one there using C++ and they listen to me, but most companies
have a lot more invested in the existing system. Believe me, any
breaking changes have to be eased in over many versions -- the "boiling
a frog" approach. :-)

> And many people already *do* use std::string for UTF-8 and are doing
> the "right" (sorry :)) thing, by introducing utf8_t we are
> "punishing" them because we want them, for the sake of people which
> still dwell on ANSI, to change their code. IMO we should do the
> opposite.

If they're already using UTF-8 strings, then we provide something like
BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes
configure themselves to accept std::strings as UTF-8-encoded, and any
changes are completely transparent to those people. No punishment
involved.

For everyone else, we introduce the utf*_t API alongside the
std::string one, for those classes and functions that are not
encoding-agnostic. The std::string one can be deprecated in future
versions if the library author desires. Again, no punishment involved.

>>> [...] - Once we overcome the troubled period of transition
>>> everything will be just great. No headaches related to file
>>> encoding detection and transcoding.
>>
>> It's the getting-there part that I'm concerned about.
>
> Me too, but again many other people already pointed out
> that a large portion of the code is completely encoding agnostic
> so there would be no impact if we stayed with std::string. There
> would be, if we add utf8_t.

Those portions of the code that are encoding-agnostic can continue
using std::string, and nothing changes. It's only the functions that
need to know the encoding that would change, and that change can be
gradual.

>>> Think about what will happen after we accept IPV6 and drop IPV4. The
>>> process will be painful but after it is done, there will be no more
>>> NAT, and co. and the whole network infrastructure will be
>>> simplified.
>>
>> That's a problem I've been watching carefully for many years now,
>> and I don't see that happening. [...]
>
> Yes, people (me included) are resistant to big changes event for the
> better. But I've learned that I should always consider the long-term
> impact.

As have I. :-) I think the design I'm proposing is low-impact enough
that people will adopt it. Slowly, but they will.

>> That's the beauty of it -- not everybody *needs* to accept it. Just
>> the people who write code that isn't encoding-agnostic.
>> Boost.FileSystem might provide a utf16_t overload for Windows, for
>> instance, so that it can automatically convert strings in other UTF
>> types. But I see no reason it would lose the existing interface.
>
> So you suggest that for example in the STL there would be (for
> example) besides the existing fstream and wfstream also a third
> "ufstream". I think that we actually should be reducing the interface
> not expanding it (yes I hear it ... "breaking changes!" :)).

I don't expect that the utf*_t classes will make it into the standard.
They definitely won't make it into the now-misnamed C++0x standard, and
it'll likely be another ten years before another one is hashed out --
by then, the UTF-8 conversion should be complete, so there will be no
need for it, except possibly to confirm that a string isn't malformed.

>>> - We will abandon std::string and be stuck with utf8_t which I
>>> *personally* already dislike :)
>>
>> Any technical reason why, other than what you've already written?
>
> Besides the ugly name and that is a new class ? No :)

If you can think of a more-acceptable-but-still-descriptive name for
it, I'm all ears. :-)

>> I hate to point this out, but people are *already* using other
>> programming languages. :-) C++ isn't new or sexy, and has some
>> pain-points (though many of the most egregious ones will be solved
>> with C++0x). Unicode handling is one of them, and in my opinion, the
>> utf*_t types will only ease that.
>
> And the solution is long overdue. And creating utf8_t is just putting
> the problem away, not solving it really.

I see it as merely easing the transition.

signature.asc

Chad Nelson

unread,
Jan 19, 2011, 2:54:27 PM1/19/11
to bo...@lists.boost.org
On Wed, 19 Jan 2011 09:58:13 -0500
Edward Diener <eldi...@tropicsoft.com> wrote:

>> I am a believer ;) and when people realize that UTF-8 is the way to
>> go, the pesky problems will vanish. Believe me today with ANSI


>
> I do not believe that UTF-8 is the way to go. In fact I know it is
> not, except perhaps for the very near future for some programmers
> ( Linux advocates ).
>

> Inevitably a Unicode standard will be adapted where every character
> of every language will be represented by a single fixed length number

> of bits. [...]

I'm no Unicode expert, but the reason this hasn't happened might be
combinatorial explosion. In which case it might never happen. But I
could well be wrong. And I hope I am, the design you outline is
something I'd love to see.

signature.asc

Peter Dimov

unread,
Jan 19, 2011, 3:18:21 PM1/19/11
to bo...@lists.boost.org
Alexander Lamaison wrote:
> I was under the impression that Linux changed from interpreting char* as
> being in a multitude of different encodings to being in UTF-8 by default.

Well, it probably depends on what part of Linux we're talking to, but most
of the functions do not interpret char* as being in any encoding, neither do
they have a default. They just treat it as a byte sequence.

Peter Dimov

unread,
Jan 19, 2011, 3:15:10 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:
> At Wed, 19 Jan 2011 19:09:48 +0200,
> Peter Dimov wrote:
> > The problem with using an Unicode string, be it QString or
> > utf8_string, to represent paths is that it forces you to pick an
> > encoding under POSIX. When the OS gives you a file name as char*, to
> > store it in your Unicode string, you have to interpret it. Then, to
> > give it back to the OS, you have to de-interpret it.
>
> Nonono; if you don't want to choose an encoding, you store it as a
> raw_string, (a.k.a. std::string, for example)!

OK. You're designing a portable library that talks to the OS. It has the
following functions:

T get_path( ... );
void process_path( T );

What do you use for T? string or utf8_string?

Peter Dimov

unread,
Jan 19, 2011, 3:25:05 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:
> At Wed, 19 Jan 2011 09:58:13 -0500,
> Edward Diener wrote:
> >
> > I do not think that shoving UTF-8 down everybody's throats is the best
> > solution even now, I think a good set of classes to convert between
> > encoding standards is much better.
>
> Can we please tone down the rhetoric here?

It's OK. :-)

Either way, not shoving _something_ down everybody's throats is not an
option if you need to create a library that talks to the OS. You have to
pick some type, and if you pick string or char*, you have to make a decision
how to interpret it.

Dave Abrahams

unread,
Jan 19, 2011, 3:34:29 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 22:15:10 +0200,

Peter Dimov wrote:
>
> Dave Abrahams wrote:
> > At Wed, 19 Jan 2011 19:09:48 +0200,
> > Peter Dimov wrote:
> > > The problem with using an Unicode string, be it QString or
> > > utf8_string, to represent paths is that it forces you to pick an
> > > encoding under POSIX. When the OS gives you a file name as char*, to
> > > store it in your Unicode string, you have to interpret it. Then, to
> > > give it back to the OS, you have to de-interpret it.
> >
> > Nonono; if you don't want to choose an encoding, you store it as a
> > raw_string, (a.k.a. std::string, for example)!
>
> OK. You're designing a portable library that talks to the OS. It has
> the following functions:
>
> T get_path( ... );
> void process_path( T );
>
> What do you use for T? string or utf8_string?

I'm even less of an expert on encodings at the OS boundary than I am
on an expert on encodings in general, but I'll take a shot at this
one.

OK, according to all the experts (like you), we should be trafficking
in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is
boost::filesystem::path, but that begs the same questions, ultimately).

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Stewart, Robert

unread,
Jan 19, 2011, 3:42:03 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:
> Peter Dimov wrote:
> > Dave Abrahams wrote:
>
> > > Nonono; if you don't want to choose an encoding, you store it as a
> > > raw_string, (a.k.a. std::string, for example)!
> >
> > OK. You're designing a portable library that talks to the OS. It has
> > the following functions:
> >
> > T get_path( ... );
> > void process_path( T );
> >
> > What do you use for T? string or utf8_string?
>
> OK, according to all the experts (like you), we should be trafficking
> in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is
> boost::filesystem::path, but that begs the same questions,
> ultimately).

I think it depends upon get_path() and process_path(). If get_path() returns an OS byte sequence and process_path() uses its argument for OS calls, then T is std::string as both functions are encoding agnostic. If process_path() is to do character-based processing, then it probably needs a UTF8 string and so it might expect a utf8_string which, presumably, would have a converting constructor from std::string assumed to have unknown encoding. (I have no idea whether it is possible to determine the encoding from the byte sequence, but I suppose it is.)

In a system which assumes UTF8 encoding in std::strings, then utf8_string might be a typedef for std::string and the only concern is that all sources of such strings be UTF8.

_____
Rob Stewart robert....@sig.com
Software Engineer, Core Software using std::disclaimer;
Susquehanna International Group, LLP http://www.sig.com

IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Artyom

unread,
Jan 19, 2011, 3:50:17 PM1/19/11
to bo...@lists.boost.org
> From: Dave Abrahams <da...@boostpro.com>

>
> Matus Chochlik wrote:
> >
> > On Wed, Jan 19, 2011 at 5:26 PM, Dave Abrahams <da...@boostpro.com> wrote:
> > > Matus Chochlik wrote:
> > >
> > > *Scenario D:* We try for scenario A. and people still use Qstrings,
>wxStrings, etc.
> >
> > 'I think maybe you underestimate our influence.' :)
>
> Our influence, if we introduce new library components, is very great,
> because they're on a de-facto fast track to standardization, and an
> improved string library is exactly the sort of thing that would be
> adopted upstream. If we simply agree to a programming convention,
> that will have some impact, but much less.
>

Dave,

Most of existing projects and frameworks had decided
about 1 single and useful encoding:

- C++:

+ Qt UTF-16 using QString
+ GtkMM UTF-8 using ustring
+ MFC UTF-16 using CString /when compiled in "Unicode mode"
+ ICU UTF-16 using UnicodeString

- C:
+ Gtk UTF-8 string

- Java: UTF-16 String
- C#: UTF-16 string
- Vala: UTF-8 String/using "char *"

And so on...

If you take a look on All C++ frameworks they
all have a way to convert their string to std::string
and backwards.

C++ hadn't picked yet, but C++ has string
and very good one. And every existing
project has an interface to it.

The problem we hadn't decided about its encoding.

Yes, we can't say to standard std::string is UTF-8
but we can say other things.

As standard deprecated auto_ptr (which I think is crime but this
is other story) it should deprecate all non-unicode
aware uses of std::string and say default is UTF-8.

It already has u8"שלום" that creates UTF-8
string using "char *" the only remaining thing
is to adopt it.

All frameworks decided how they use Unicode and what
string they use.

Boost can and **should** decide - we use Unicode - and
we use UTF-8 as all frameworks did.

Decide and cut it. As Boost had decided not to
use tabs in source code or use BSL for all its
code base.

This would do only good.

Sometimes it is bad to support every bad decision
that was made.

As many Boost Developers and Users enjoy the
fact that Boost is in constant evolution so we
can evolve and decide:

On windows char */std::string etc is UTF-8
if you don't agree, don't use Boost.

Artyom

Robert Ramey

unread,
Jan 19, 2011, 3:56:07 PM1/19/11
to bo...@lists.boost.org
Peter Dimov wrote:
> Alexander Lamaison wrote:
>> I was under the impression that Linux changed from interpreting
>> char* as being in a multitude of different encodings to being in
>> UTF-8 by default.
>
> Well, it probably depends on what part of Linux we're talking to, but
> most of the functions do not interpret char* as being in any encoding,
> neither do they have a default. They just treat it as a byte sequence.

hmmm - that's what I always considered std::string to be. There's
no notion of locale in there.

I'm still not seeing why we can't continue to consider std::string
just a sequence of bytes with some extra sauce ..

... and make a new class utf8_string .. derived from which which includes
a code point iterator, a function to return a utf8 "character or codepoint
or whatever it is".

I just can't see anything wrong with this. It doesn't redefine the
sematics (formal, intuitive, common usage) of std::string, utf8_string would
let one use the special unicode sauces when needed. And it could
be implicitly converted to std::string when passed as a function
argument. Finally, given the history of this, I don't believe utf8 is the
"end of the road". It still leaves open the possibility of the next
greatest thing - whatever that turns out to be. To summarize:

std::string - a sequence of bytes
utf8_string - a sequence of "code points" implemented in terms of
std::string.
(or at least convertible to std::string)

Robert Ramey

Peter Dimov

unread,
Jan 19, 2011, 4:02:02 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:
...

> > OK. You're designing a portable library that talks to the OS. It has
> > the following functions:
> >
> > T get_path( ... );
> > void process_path( T );
> >
> > What do you use for T? string or utf8_string?
>
> I'm even less of an expert on encodings at the OS boundary than I am
> on an expert on encodings in general, but I'll take a shot at this
> one.
>
> OK, according to all the experts (like you), we should be trafficking
> in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is
> boost::filesystem::path, but that begs the same questions, ultimately).

My answer is different. T is std::string, and:

- on POSIX OSes, this string is taken directly from the OS and given
directly to the OS, without any conversion;

- on Windows, this string is UTF-8 and is converted to UTF-16 before being
given to the OS, and converted from UTF-16 after being received from it.
This conversion should tolerate broken UTF-16 because the OS does so as
well.

Dave Abrahams

unread,
Jan 19, 2011, 4:04:06 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 12:50:17 -0800 (PST),

Artyom wrote:
>
> Most of existing projects and frameworks had decided
> about 1 single and useful encoding:
>
> - C++:
>
> + Qt UTF-16 using QString
> + GtkMM UTF-8 using ustring
> + MFC UTF-16 using CString /when compiled in "Unicode mode"
> + ICU UTF-16 using UnicodeString
>
> - C:
> + Gtk UTF-8 string
>
> - Java: UTF-16 String
> - C#: UTF-16 string
> - Vala: UTF-8 String/using "char *"
>
> And so on...
>
> If you take a look on All C++ frameworks they
> all have a way to convert their string to std::string
> and backwards.
>
> C++ hadn't picked yet, but C++ has string
> and very good one.

I guess whether std::string is a good design could be considered a
matter of opinion.

> And every existing project has an interface to it.
>
> The problem we hadn't decided about its encoding.
>
> Yes, we can't say to standard std::string is UTF-8
> but we can say other things.

Like what?

> As standard deprecated auto_ptr (which I think is crime but this
> is other story) it should deprecate all non-unicode
> aware uses of std::string and say default is UTF-8.

The standard can't deprecate usage patterns, just language (which
includes the standard library) features.

> Boost can and **should** decide - we use Unicode - and
> we use UTF-8 as all frameworks did.

Except for all the UTF-16 frameworks you cited above?

> Decide and cut it.

Cut what?

> As Boost had decided not to use tabs in source code or use BSL for
> all its code base.

Those were easy to do without breaking code.

> This would do only good.
>
> Sometimes it is bad to support every bad decision
> that was made.

No argument there.

> As many Boost Developers and Users enjoy the
> fact that Boost is in constant evolution so we
> can evolve and decide:
>
> On windows char */std::string etc is UTF-8
> if you don't agree, don't use Boost.

Yes we can. It would break code, I'm pretty sure. I am not opposed
to breaking code when the benefits are worth it, but in this case I am
not yet convinced that there isn't an equally-good alternative that
doesn't break code. We're still exploring those alternatives.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Dave Abrahams

unread,
Jan 19, 2011, 4:32:54 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 23:02:02 +0200,

Peter Dimov wrote:
>
> Dave Abrahams wrote:
> ...
> > > OK. You're designing a portable library that talks to the OS. It has
> > > the following functions:
> > >
> > > T get_path( ... );
> > > void process_path( T );
> > >
> > > What do you use for T? string or utf8_string?
> >
> > I'm even less of an expert on encodings at the OS boundary than I am
> > on an expert on encodings in general, but I'll take a shot at this
> > one.
> >
> > OK, according to all the experts (like you), we should be trafficking
> > in UTF-8 everywhere, so I guess I'd say T is utf8_string (well, T is
> > boost::filesystem::path, but that begs the same questions, ultimately).
>
> My answer is different. T is std::string, and:
>
> - on POSIX OSes, this string is taken directly from the OS and given
> directly to the OS, without any conversion;
>
> - on Windows, this string is UTF-8 and is converted to UTF-16 before
> being given to the OS, and converted from UTF-16 after being received
> from it. This conversion should tolerate broken UTF-16 because the OS
> does so as well.

A fine answer if:

a. you think the interface to std::string is a good one for posterity,
and

b. every other std::string that might be used along with your portable
library is guaranteed to be utf-8 encoded.

But I don't agree with a), and the interface to std::string makes a
future where b) holds look highly unlikely to me.

I prefer to have semantic constraints/invariants like "this is UTF-8
encoded" represented in the type system and enforced by public library
interfaces. I'm arguing for a future like that.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Peter Dimov

unread,
Jan 19, 2011, 5:07:18 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:
> At Wed, 19 Jan 2011 23:02:02 +0200,
> Peter Dimov wrote:
> > My answer is different. T is std::string, and:
> >
> > - on POSIX OSes, this string is taken directly from the OS and given
> > directly to the OS, without any conversion;
> >
> > - on Windows, this string is UTF-8 and is converted to UTF-16 before
> > being given to the OS, and converted from UTF-16 after being received
> > from it. This conversion should tolerate broken UTF-16 because the OS
> > does so as well.

...

> I prefer to have semantic constraints/invariants like "this is UTF-8
> encoded" represented in the type system and enforced by public library
> interfaces. I'm arguing for a future like that.

But the semantics I outlined above only have this constraint under Windows.

Mathias Gaunard

unread,
Jan 19, 2011, 5:41:03 PM1/19/11
to bo...@lists.boost.org
On 19/01/2011 22:02, Peter Dimov wrote:

> - on Windows, this string is UTF-8 and is converted to UTF-16 before
> being given to the OS, and converted from UTF-16 after being received
> from it. This conversion should tolerate broken UTF-16 because the OS
> does so as well.

I see no need to tolerate bad practices if they cause obvious problems.

Brent Spillner

unread,
Jan 19, 2011, 6:25:34 PM1/19/11
to bo...@lists.boost.org
On 1/19/2011 11:33 AM, Peter Dimov wrote:
> This was the prevailing thinking once. First this number of bits was 16,
> which incorrect assumption claimed Microsoft and Java as victims, then
> it became 21 (or 22?). Eventually, people realized that this will never
> happen even if we allocate 32 bits per character, so here we are.

The OED lists ~600,000 words, so 32 bits is enough space to provide a
fully pictographic alphabet for over 7,000 languages as rich as English,
with room for a few line-drawing characters left over. Surely that's enough?

Peter Dimov

unread,
Jan 19, 2011, 6:50:30 PM1/19/11
to bo...@lists.boost.org
Mathias Gaunard wrote:
> On 19/01/2011 22:02, Peter Dimov wrote:
>
> > - on Windows, this string is UTF-8 and is converted to UTF-16 before
> > being given to the OS, and converted from UTF-16 after being received
> > from it. This conversion should tolerate broken UTF-16 because the OS
> > does so as well.
>
> I see no need to tolerate bad practices if they cause obvious problems.

It is possible to create a file whose name is not a valid UTF-16 sequence on
Windows, so the library ought to be able to work with it. You could go
either way in this specific case though, since such names are extremely rare
in practice.

Patrick Horgan

unread,
Jan 19, 2011, 8:23:44 PM1/19/11
to bo...@lists.boost.org
On 01/19/2011 02:33 AM, Matus Chochlik wrote:
> ... elision by patrick ...
> - It is extensible, so once we have done the painful
> transition we will not have to do it again. Currently
> utf-8 uses 1-4 (or 1-6) byte sequences to encode code
The 5 and 6 byte sequences are from early versions of the utf-8 and have
known negative security implications. You should never use them in your
encoding, nor should you ever accept them as valid utf-8. The entire
unicode code space (all 2^31 codes) is encodable in 4 byte standard
compliant utf-8. Please see RFC3629 UTF-8, a transformation format of
ISO 10646. F. Yergeau. November 2003. This is also STD0063. Also see
Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the
Unicode Standard. I can't emphasize this enough. There have been real,
serious problems, that cost people money from following the older naive
spec.

> to 1-N bytes (unlike UCS-X and i'm not sure about
> UTF-16/32).
If you extended it, then it would not be utf-8 which is an encoding of UCS.
> So,
> [dark-sarcasm]
> even if we dig out the stargate or join the United
> Federation of Planets and captain Kirk, every time
> he returns home, brings a truckload of new writing
> scripts to support, UTF-8 will be able to handle it.
Well, most of the code space of UCS is still unused. There's plenty of
room. 2^31 codes is a lot.
> just my 0.02 strips of gold pressed latinum :)
> [/dark-sarcasm]
>
>
> Best regards,
>
> Matus
> _______________________________________________
> Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Edward Diener

unread,
Jan 19, 2011, 8:43:09 PM1/19/11
to bo...@lists.boost.org
On 1/19/2011 6:25 PM, Brent Spillner wrote:
> On 1/19/2011 11:33 AM, Peter Dimov wrote:
>> This was the prevailing thinking once. First this number of bits was 16,
>> which incorrect assumption claimed Microsoft and Java as victims, then
>> it became 21 (or 22?). Eventually, people realized that this will never
>> happen even if we allocate 32 bits per character, so here we are.
>
> The OED lists ~600,000 words, so 32 bits is enough space to provide a
> fully pictographic alphabet for over 7,000 languages as rich as English,
> with room for a few line-drawing characters left over. Surely that's enough?

It is technically enough. In fact Unicode only uses 0x10FFF code points
in the range 0 to 0x10FFF, and a UTF-32 value will therefore not exceed
0x10FFF. So in fact UTF-32 can easily handle all of the code points in
Unicode.

But Unicode has the idea of an abstract character, which may be
represented by a more than 1 code point. Whether an abstract character
is always considered a single character, or an amalgam of a single
character ( code point ) and various formatting/graphical code points,
is probably debatable. But if one assumes that an abstract character is
a single "character" in some encoding, then the way that Unicode has
mapped out abstract characters allows for that "character" to be larger
than what will fit into a single UTF-32 encoding.

Beman Dawes

unread,
Jan 19, 2011, 9:22:17 PM1/19/11
to boost
On Wed, Jan 19, 2011 at 2:42 PM, Alexander Lamaison <aw...@doc.ic.ac.uk>wrote:

> On Wed, 19 Jan 2011 19:42:25 +0200, Peter Dimov wrote:
>
> > Alexander Lamaison wrote:
> >
> >> By changing the OS-default encoding to assume char* string was UTF-8.
> >
> > You keep talking about "OS-default encoding", but there's no such thing.
> > POSIX operating systems do not assume anything about the encoding of
> char*
>
> I was under the impression that Linux changed from interpreting char* as
> being in a multitude of different encodings to being in UTF-8 by default.
>

Peter is correct, with a slight editorial clarification: "POSIX operating
systems [API's] do not assume anything about the encoding of char*".

When I was designing Boost Filesystem V3, several of the POSIX liaison folks
were kind enough to confirm this with me in person. Some of the shell
utilities do have notions of encoding, but not the API's.

Linux is a "POSIX-like" operating system, but there are places where it
deviates from the POSIX spec. So because Linux does something a certain way
doesn't necessarily mean that POSIX specifies it that way.

--Beman

Patrick Horgan

unread,
Jan 19, 2011, 9:55:51 PM1/19/11
to bo...@lists.boost.org
On 01/19/2011 06:58 AM, Edward Diener wrote:
> ... elision by patrick...

>
> I do not believe that UTF-8 is the way to go. In fact I know it is
> not, except perhaps for the very near future for some programmers (
> Linux advocates ).
>
> Inevitably a Unicode standard will be adapted where every character of
> every language will be represented by a single fixed length number of
> bits. Nobody will care any longer that this fixed length set of bits
> "wastes space", as so many people today hysterically are fixated on.
> Whether or not UTF-32 can do this now or not I do not know but this
> world where a character in some language on earth is represented by
> some arcane multi-byte encoding will end. If UTF-32 can not do it then
> UTF-nn inevitably will.
UTF-32 is the only UCS fixed width encoding.

UTF-16 can encode most the basic multilingual plane in fixed width.
That's most the characters in the world. If you know your problem
domain, and know that you are in the first code plane then you can use
UTF-16 as a fixed width encoding. If you know that you have to be able
to handle any UCS character, then you can't. Currently 107,296 of the
characters in UCS are defined out of a total code space of 1,114,112, (0
to 10FFFF16).

>
>
> I do not think that shoving UTF-8 down everybody's throats is the best
> solution even now, I think a good set of classes to convert between
> encoding standards is much better.

I agree with you. Nobody should shove any one solution down anyone's
throat. Instead, I wish that more people would understand the
trade-offs of different encodings and when each might be more desirable
instead of saying, "Oh, we can never do that." or "Oh, we must always
do that." The best thing is to understand your problem domain, and what
the implications of that domain are in each of the possible encodings.

The truth is that the web and xml apps all use Unicode, as do more and
more applications. Nobody considers doing new international
applications with anything other than Unicode. That means that you need
to know about the three encodings, UTF-8 UTF-16 and UTF-32, and their
trade-offs. If you're on a fast lightly loaded machine with lots of
memory, there could be real advantages to UTF-32. If you're running on
a hand-held device with limited memory, UTF-8 could be a real winner.
That's a simplistic view of a complex decision, but if you're doing the
design for something you should educate yourself and make the complex
decision with fore thought.

You can get your own copy of the Unicode 5.2 standard as a zipped pdf
file at http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip

The 6.0 standard is being worked on as we speak.

Patrick

Dave Abrahams

unread,
Jan 19, 2011, 11:13:45 PM1/19/11
to bo...@lists.boost.org
At Thu, 20 Jan 2011 00:07:18 +0200,

Peter Dimov wrote:
>
> Dave Abrahams wrote:
> > At Wed, 19 Jan 2011 23:02:02 +0200,
> > Peter Dimov wrote:
> > > My answer is different. T is std::string, and:
> > > > - on POSIX OSes, this string is taken directly from the OS and
> > > > given
> > > > directly to the OS, without any conversion;
> > > > - on Windows, this string is UTF-8 and is converted to UTF-16
> > > before
> > > being given to the OS, and converted from UTF-16 after being received
> > > from it. This conversion should tolerate broken UTF-16 because the OS
> > > does so as well.
>
> ...
>
> > I prefer to have semantic constraints/invariants like "this is UTF-8
> > encoded" represented in the type system and enforced by public library
> > interfaces. I'm arguing for a future like that.
>
> But the semantics I outlined above only have this constraint under
> Windows.

Sorry, I don't understand what you're saying here.

But let me say a little more about my point; maybe that will help. If
I get a std::string from "somewhere", I don't know what encoding it's
in, if any. The abstraction presented by std::string is essentially
"sequence of individually addressable and mutable chars that by
convention represents text in some unspecified way." It has lots of
interface that is aimed at manipulating the raw sequence of chars, and
none that helps with an interpretation of those chars.

IIUC, you're talking about changing the abstraction presented by
std::string to "sequence of individually addressable and mutable chars
that by convention represents text encoded as utf-8."

I would prefer to be handling something that presents the abstraction
"character string." I'm not sure exactly what that looks like, but
I'm pretty sure the "individually addressable and mutable chars" part
should go. I'd like to see an interface that prevents corrupting the
underlying data such that it no longer represents a valid sequence of
characters (or at least makes it highly unlikely that such corruption
could happen accidentally). Furthermore, there are lots of string-y
things I'd want to do that aren't provided—or aren't provided well—by
std::string, e.g. if (s1.starts_with(s2)) {...}

Does this make more sense?

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Patrick Horgan

unread,
Jan 19, 2011, 11:15:55 PM1/19/11
to bo...@lists.boost.org
On 01/19/2011 07:34 AM, Alexander Lamaison wrote:

> On Wed, 19 Jan 2011 16:13:04 +0100, Matus Chochlik wrote:
>
>>> I do not believe that UTF-8 is the way to go. In fact I know it is not,
>>> except perhaps for the very near future for some programmers ( Linux
>>> advocates ).
>> :-) Just for the record, I'm not a Linux advocate any more then I'm
>> a Windows advocate. I use both .. I'm writing this on a windows machine.
>> What I would like is the whole encoding madness/dysfunction (including
>> but not limited to the dual TCHAR/whateverchar-based interfaces) to stop.
>> Everywhere.
> Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
> implementation on Windows which expects local-codepage narrow strings? Are
> we hoping MS etc. change these to match? Because otherwise we'll be
> converting between narrow encodings for the rest of eternity.
That's the reality already. As long as people use local narrow
encodings we will be converting between them. If your code runs on
Windows in Korea or in Spain, you'll get local-codepage narrow strings
that are incompatible. At least if there was a utf-8_string type, or
utf16_string type, or utf-32_string type, with documentation about how
to implement templated conversions to them, (code conversion facets),
someone could write a library to use them, and everyone using all of
these different local encodings would know what to do to use the
library. The way it is today it's much more difficult to figure out how
to write a generic library that accepts text from a user. What's a
char* or a std::string<char> imply about encoding? Who knows what
you'll get. A local 8 bit code page? Shift-JIS? utf-8? euc? This is
just saying that, hey, here's one way to deal with this issue.

This sort of scheme lets the Windows STL implementation exist, but says,
here's what you need to do so that I know how to treat the text you pass
to me as an argument. If it's in a local code page you need to convert
it to what I want. With validating string types that support the three
UCS encodings you can trust that the data is validly encoded, although
all the normal issues about whether the content is meaningful to you
still exist.

If you use normal code conversion facets as specified for C++ locales,
for conversion from local code pages to your strings, then you can
leverage existing work. Why reinvent the wheel?

Patrick

Dave Abrahams

unread,
Jan 19, 2011, 11:18:05 PM1/19/11
to bo...@lists.boost.org
At Wed, 19 Jan 2011 23:25:34 +0000,

Brent Spillner wrote:
>
> On 1/19/2011 11:33 AM, Peter Dimov wrote:
> > This was the prevailing thinking once. First this number of bits was 16,
> > which incorrect assumption claimed Microsoft and Java as victims, then
> > it became 21 (or 22?). Eventually, people realized that this will never
> > happen even if we allocate 32 bits per character, so here we are.
>
> The OED lists ~600,000 words, so 32 bits is enough space to provide a
> fully pictographic alphabet for over 7,000 languages as rich as English,
> with room for a few line-drawing characters left over. Surely that's enough?

Even if it's theoretically possible, the best standards organization
the world has come up with for addressing these issues was unable to
produce a standard that did it. As far as I'm concerned, Boost is
stuck with the results of the Unicode Consortium until some better
standards body comes along, and the likelihood of anyone generating
the will to overturn their results as the dominant paradigm is so low
as to render that possibility unworthy of attention. Certainly, doing
it ourselves is out-of-scope for Boost.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

_______________________________________________

Dean Michael Berris

unread,
Jan 19, 2011, 11:29:01 PM1/19/11
to bo...@lists.boost.org
On Thu, Jan 20, 2011 at 12:13 PM, Dave Abrahams <da...@boostpro.com> wrote:
>
> I would prefer to be handling something that presents the abstraction
> "character string."  I'm not sure exactly what that looks like, but
> I'm pretty sure the "individually addressable and mutable chars" part
> should go.  I'd like to see an interface that prevents corrupting the
> underlying data such that it no longer represents a valid sequence of
> characters (or at least makes it highly unlikely that such corruption
> could happen accidentally).  Furthermore, there are lots of string-y
> things I'd want to do that aren't provided—or aren't provided well—by
> std::string, e.g. if (s1.starts_with(s2)) {...}
>
> Does this make more sense?
>

This discussion is interesting for a lot of reasons. However, I think
it's time to address the root cause of the problem with strings in
C++: that the way we think of strings right now is broken. Everything
follows from this basic problem.

It's time to call a spade a spade: std::string is not as well thought
out as everybody might seem to think. I think since we've had
something like 20 years to think about this problem, it's time to
consider revolution instead of evolution.

Immutable strings with lazy operations seem to be the most effective
way to deal with strings from a design/implementation perspective.
Encoding is just a matter of rendering, or in fusion/mpl parlance is a
view of the data.

String mutation is a concurrency hindrance, encourages bad programming
practice, and is generally an overrated feature that makes designing
efficient strings still revert to pointers and value twiddling. In
this day and age with all the idioms in C++ we already know, we should
really be thinking about changing the way people think about strings.

Of course it shouldn't be as drastic as wiping out std::string from
the face of all programs -- but something that allows for taking data
from an std::string and becoming immutable, allowing lazy operations
on it, and overall making a crazy efficient string implementation
should be the goal first before we think about dealing with encodings
and what not. Maybe it's time someone formalizes a string calculus and
implements a string type that's worthy of being called a modern
string.

Going-back-to-just-watching'ly yours,

--
Dean Michael Berris
about.me/deanberris

Patrick Horgan

unread,
Jan 19, 2011, 11:29:58 PM1/19/11
to bo...@lists.boost.org
On 01/19/2011 08:33 AM, Peter Dimov wrote:

> Edward Diener wrote:
>> Inevitably a Unicode standard will be adapted where every character
>> of every language will be represented by a single fixed length number
>> of bits.
>
> This was the prevailing thinking once. First this number of bits was
> 16, which incorrect assumption claimed Microsoft and Java as victims,
> then it became 21 (or 22?). Eventually, people realized that this will
> never happen even if we allocate 32 bits per character, so here we are.
At 32 bits we can encode all current languages, all extinct languages,
Klingon, and still have most the space empty. You might want to read
the Unicode spec which talks clearly about this. If you just read
through the end of Chapter 6 you'll have a great overall understanding
of Unicode. It's available as a compressed pdf file at:
http://www.unicode.org/versions/Unicode5.2.0/UnicodeStandard-5.2.zip

Patrick

Patrick Horgan

unread,
Jan 19, 2011, 11:38:43 PM1/19/11
to bo...@lists.boost.org
On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
> ... elision by patrick ...
>>
>
>
> *Scenario D:*
>
> Use Ranges, don't care whether it's std::string, whatever_string, etc.
> This also allows maximum efficiency, with lazy concatenation,
> transformations, conversion, filtering etc.
>
> My Unicode library works with arbitrary ranges, and you can adapt a
> range in an encoding into a range in another encoding.
> This can be used to lazily perform encoding conversion as the range is
> iterated; such conversions may even be pipelined.
Sounds interesting. Of courses ranges could be used with strings of
whatever sort. Is the intelligence about the encoding in the ranges?
As you iterate a range does it move byte by byte character by character,
does it deal with compositions? Is it available to read?

Peter Dimov

unread,
Jan 19, 2011, 11:43:48 PM1/19/11
to bo...@lists.boost.org
Dave Abrahams wrote:
> IIUC, you're talking about changing the abstraction presented by
> std::string to "sequence of individually addressable and mutable chars
> that by convention represents text encoded as utf-8."

Something like that. string is just char[] with value semantics. It doesn't
necessarily hold a valid UTF-8 sequence.

> I would prefer to be handling something that presents the abstraction
> "character string." I'm not sure exactly what that looks like, but
> I'm pretty sure the "individually addressable and mutable chars" part
> should go. I'd like to see an interface that prevents corrupting the
> underlying data such that it no longer represents a valid sequence of
> characters (or at least makes it highly unlikely that such corruption
> could happen accidentally). Furthermore, there are lots of string-y
> things I'd want to do that aren't provided—or aren't provided well—by
> std::string, e.g. if (s1.starts_with(s2)) {...}
>
> Does this make more sense?

It makes sense in the abstract. But there is no way to protect against
corruption without also setting an invariant that the sequence is not
corrupted (represents valid UTF-8), and I don't usually need such a string
in the interfaces we're discussing, although it can certainly be useful on
its own. The interfaces that talk to the OS need to be able to carry
arbitrary char sequences (in the POSIX case). Even an interface that
displays the string, one that by necessity must interpret it as UTF-8,
should preferably handle invalid UTF-8 and display some placeholders instead
of the invalid subsequence - it's better for the user to see parts of the
string than nothing at all. It's even worse to abort the whole operation
with an invalid_utf8 exception.

I don't particularly like string's mutable chars, but they don't mutate
themselves without my telling them to, so things tend to work out fine. :-)

Patrick Horgan

unread,
Jan 20, 2011, 12:09:45 AM1/20/11
to bo...@lists.boost.org
On 01/19/2011 09:50 AM, Artyom wrote:
> ... elision by patrick ...
> Take a look on this code:
>
> http://art-blog.no-ip.info/files/nowide.zip
building test failed on my linux box with gcc 4.6. Is it supposed to
work here?


2/2 Testing: test_streambuf
2/2 Test: test_streambuf
Command: "/usr/local/downloads/tmp/nowide/test_streambuf"
Directory: /usr/local/downloads/tmp/nowide
"test_streambuf" start time: Jan 19 20:58 PST
Output:
----------------------------------------------------------
Testing input device
Testing input device, small buffer
Testing output device
Testing output device, small buffer size
Testing output device, reset
Testing seek fault
Testing tell fault
Testing random access device
Error /usr/local/downloads/tmp/nowide/test/test_streambuf.cpp:231
int(io.tellg())==4
<end of output>
Test time = 0.01 sec
----------------------------------------------------------
Test Failed.
"test_streambuf" end time: Jan 19 20:58 PST
"test_streambuf" time elapsed: 00:00:00
----------------------------------------------------------

End testing: Jan 19 20:58 PST

> This is the code I use for my projects that implements
> what I'm talking about - simple easy to use straightforward.
>
> Also don't forget two things:
>
> 1. Microsoft Deprecated ANSI API and does not recommend
> to use it.
>
> If the only OS that gives us most of the encodings headache
> deprecated ANSI API I don't think that Boost should
> continue supporting it.
>
> 2. All the world had already moved to Unicode, Microsoft
> did this as well.
>
> They did it in their incompatible-with-rest-of-the-world
> way... But still they did it too - so we can continue
> ignoring the fact that UTF-8 is ultimate encoding
> or go forward.
>
> Artyom


>
>
>
> _______________________________________________
> Unsubscribe& other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________

Patrick Horgan

unread,
Jan 20, 2011, 12:16:45 AM1/20/11
to bo...@lists.boost.org
On 01/19/2011 09:52 AM, Peter Dimov wrote:
> ... elision by patrick ...
>
> I'm not sure that I do, either. Nevertheless, people at the Unicode
> consortium have been working on that for... 20 years now? What
> technical obstacle that currently blocks their progress do you foresee
> disappearing in the future? Occam says that variable width characters
> are simply a better match for the problem domain, even when character
> width in bits is not a problem.
You lost me with the last line. I'm thinking you're talking about
Occam's razor which says that we should prefer simpler explanations for
things except when a more complicate explanation does a better job of
explaining the facts. I'm completely lost about how that would apply to
choosing a variable width encoding over a fixed width encoding.

Patrick

Artyom

unread,
Jan 20, 2011, 12:39:48 AM1/20/11
to bo...@lists.boost.org
>
> > Boost can and **should** decide - we use Unicode - and
> > we use UTF-8 as all frameworks did.
>
> Except for all the UTF-16 frameworks you cited above?
>

Short reminder:

http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

- Qt support UTF-16 only from version 4, before that Qt3 supported only UCS-2!
(At it wasn't long time ago)
- Java Supports UTF-16 from 1.5 before UCS-2
- Windows somehow supports UTF-16 starting from XP
- MS SQL Server does not support UTF-16 yet (only UCS-2)

I can continue...

UTF-16 is a "historical mistake" because some (long)
time ago Unicode supposed to be 16 bit,
and in those days 16 bit character was very reasonable but
it didn't worked - so UTF-16 was invented.

No modern project should pick it as it give more problems
the headache.

Not to mention that before char16_t would be supported in
all compiler it would be hard time to support it in C++.

(and not wchar_t is not good for UTF-16)

Just a small point before we may think of picking
UTF-16.

My $0.02

Artyom

Patrick Horgan

unread,
Jan 20, 2011, 2:20:31 AM1/20/11
to bo...@lists.boost.org
On 01/19/2011 11:15 AM, Matus Chochlik wrote:
> ...elision by patrick...
> This is where the (Boost.)Locale and (Boost.)Unicode libraries could provide
> insight into how to extend the std::string interface or be the testbed for
> new additions to the standard library related to string manipulation.
> (Provided, the standard adopts UTF-8 as a native encoding. Or does it already ?)
In recent C++ specs you can specify u8"a string to be considered encoded
in utf-8". If a wide char string literal is next to a u8 string literal
they don't concatenate, it's an error. There's of course all the locale
stuff including codecvt_utf8. A byte must be large enough to hold an 8
bit utf-8 code unit. That's all that's in the C++ spec so far.

Patrick

Patrick Horgan

unread,
Jan 20, 2011, 3:05:47 AM1/20/11
to bo...@lists.boost.org
On 01/19/2011 11:54 AM, Chad Nelson wrote:
> On Wed, 19 Jan 2011 09:58:13 -0500
> Edward Diener<eldi...@tropicsoft.com> wrote:
>
>>> I am a believer ;) and when people realize that UTF-8 is the way to
>>> go, the pesky problems will vanish. Believe me today with ANSI

>> I do not believe that UTF-8 is the way to go. In fact I know it is
>> not, except perhaps for the very near future for some programmers
>> ( Linux advocates ).
>>
>> Inevitably a Unicode standard will be adapted where every character
>> of every language will be represented by a single fixed length number
>> of bits. [...]
> I'm no Unicode expert, but the reason this hasn't happened might be
> combinatorial explosion. In which case it might never happen. But I
> could well be wrong. And I hope I am, the design you outline is
> something I'd love to see.
It's already here and has been for a long time. That's just UCS encoded
as UTF-32. UCS isn't a new thing. They started on the standard in the
late 80s and the standard was first copyright in 1991. They've come a
long way. All the common languages and many of the uncommon languages
are supported. Already many dead languages are supported. Language
with supported added in 5.1 and 5.2 were Cham, Kayah Li, Lepcha, Ol
Chiki, Rejang, Saurashtra, Sundanese, Vai, Bamum, Javanese, Lisu, Meetei
Mayek, Samaritan, Tai Tham, and Tai Viet.

bernardH

unread,
Jan 20, 2011, 3:41:17 AM1/20/11
to bo...@lists.boost.org
Dave Abrahams <dave <at> boostpro.com> writes:

>
> At Wed, 19 Jan 2011 23:25:34 +0000,
> Brent Spillner wrote:
> >
> > On 1/19/2011 11:33 AM, Peter Dimov wrote:
> > > This was the prevailing thinking once. First this number of bits was 16,
> > > which incorrect assumption claimed Microsoft and Java as victims, then
> > > it became 21 (or 22?). Eventually, people realized that this will never
> > > happen even if we allocate 32 bits per character, so here we are.
> >
> > The OED lists ~600,000 words, so 32 bits is enough space to provide a
> > fully pictographic alphabet for over 7,000 languages as rich as English,
> > with room for a few line-drawing characters left over. Surely that's enough?
>
> Even if it's theoretically possible, the best standards organization
> the world has come up with for addressing these issues was unable to
> produce a standard that did it.

I must confess a lack of knowledge wrt to encodings, but my understanding
is that strings are sequences of some raw data (without semantic),
code points and glyphs.

Current/Upcoming std::string , std::u16string and std::u32string
would be the raw data containers, with char*, char16_t* and
char32_t* as random iterators.

I believe that wrt encoding, one size does not fit all because
of the domain/architecture specific tradeoffs between memory
consumption and random access speed.
(However, maybe two sizes fit all, namely utf-8 for compact
representation and utf-32 for random access).

So my uniformed wish would be for something along
(disregarding constness issues for the moment)
namespace std {
namespace unicode
{
template<typename CharT> struct code_points {
typedef implementation defined iterator;

explicit code_points(std::basic_string<CharT> & s_): s(s_){}

iterator begin();
iterator end();
...
std::basic_string<CharT>& s;
};
// convenience functions
template<typename CharT>
code_points<CharT> as_code_points(std::basic_string<CharT>& s)
{ return code_points<CharT>(s);}

}}
code_points<> would be specialized to
provide a random access code_points<std::char32_t>::iterator
while code_points<char>::iterator would be a forward iterator.

Algorithms processing sequences of code points could
be specialized to take advantage of random access when available.

template<typename CharT> struct glyphs{}; would also be provided
but no random access could be provided (utf-64 anyone ? :) )

Note that the usual idiom of
for( ; b != e; ++b)
{ process(*b); }
would not be as efficient as possible for variable lenght
encoding of code points (e.g. utf-8) because process
certainly performs the same operations as ++b to retrieve the
whole code points, so we should prefer
while( b != e)
{ b= process(b);}

The problem is that I don't have the knowledge to know if
processing code points (instead of glyphs) is truly relevant
in practice. If it is, I believe that something along my
proposal would :
1°) leverage existing std::basic_string<>,
2°) empower the end-user to select the memory consumption
/ algorithmic complexity tradeoff when processing code points.

What do other think of this ?

Best Regards,

Bernard

Matus Chochlik

unread,
Jan 20, 2011, 3:45:45 AM1/20/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <da...@boostpro.com> wrote:
>>
>> OK, I see. But, is there any chance that the standard itself would
>> be updated so that it first would recommend to use UTF-8 with C++
>> strings.
>
> Well, never say "never," but... never.  Such recommendations are not
> part of the standard's mission.  It doesn't do things like that.

My view of what is the standardizing comitee willing to do may by naive,
but generally I don't see why this could not be done. Other major
languages (Java, C#, etc.) picked a single "standard" encoding and
in those languages you treat text with other encodings as special case.

If C++ recommended the use of UTF-8 this would probably kickstart
the OS and compiler vendors to follow or at least to fix their implementations
of the standard libary and the OS API's to accept UTF-8 by default
(if we agree that this is a good idea).

>
>> After some period of time all other encodings would be deprecated
>
> By whom?

By the same comitee that made the recommendation in the first place.

>
>> I really see all the obstacles that prevent us from just switching
>> to UTF-8, but adding a new string class will not help for the same
>> reasons adding wstring did not help.
>
> I don't see the parallel at all.  wstring is just another container of
> bytes, for all practical purposes.  It doesn't imply any particular
> encoding, and does nothing to segregate the encoded from the raw.

Maybe wstring is not officially UTF-16 or UTF-32 or UCS, but
on most platforms it is at least treated as "the unicode string"
regardless of this being a vague term. What I am afraid of is
that just like the use of wchar_t and wstring spawned the dual
interface used by Winapi and followed by many others (including
myself in the past), introducing a third (semi-)standard string class
will spawn a "ternary" interface (but I may be wrong or mixing the
order of the events mentienoed above).


>
>> As I already said elsewhere I think that this is a problem that has
>> to be solved "organizationally".
>
> Perhaps.  The type system is one of our organizational tools, and
> Boost has an impact insofar as it produces components that people use,
> so if we aren't able to produce some flagship library components that
> help with the solution, we have little traction.

I believe in strong typing, but .. OK, for the sake of argument, where
do we imagine utf8_t (or whatever its name will be) will be used and
what is out long-term plan for std::string?

If I design a library or an application should I use utf8_t everywhere ?
As the type of the class' member variables, parameters of functions
and constructors or should I stick to std::string (or perhaps wstring)
for maximum compatibility with the rest of the world ?

>
>> >> > *Scenario E:* We add another string class and everyone adopts it
>> >>
>> I meant that for example on POSIX OSes the POSIX C-API
>> did not have to be changed or extended by a new set of functions
>> doing the same things, but using a new character type, when they
>> switched from the old encodings to UTF-8.
>
> ...and people still have the problem that they lose track of what's
> "raw" and what's encoded as utf-8.

Yes, but in the end, they will get used to it. There are many dangerous
things in C++ (like for example dereferencing a nil or dangling pointer,
doing C-pointer arithmetic in the presence of inheritance, etc.) you
should not do and mixing UTF-8 and other encoding would be one
of them. It is a breaking change but it would not be the first one in
C++'s history.

>
>> To compare two strings you still can use stdcmp and not utf8strcmp,
>> to collate strings you use strcoll and not utf8strcol, etc.
>
> Yeah... but surely POSIX's strcmp only tells you whether the two
> strings have the same sequence of code points, not whether they have
> the same characters, right?  And if you inadvertently compare a "raw"
> string with an equivalent utf-8-encoded string, what happens?

Undefined behavior, your application segfaults, aborts, silently fails...
(what happens if you dereference a dangling pointer ?)

BR,

Matus

Matus Chochlik

unread,
Jan 20, 2011, 3:59:51 AM1/20/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 8:50 PM, Chad Nelson
<chad.thec...@gmail.com> wrote:
>
> Do you see another way to provide those conversions, and automatic
> verification of proper UTF coding? (Automatic verification is a very
> good thing, without it someone won't use it or will forget to, and open
> up their programs to exploitation.)

Yes, implementing it into std::string in some future standard.

>
> If Boost comes out with a version that breaks existing programs,
> companies just won't upgrade to it. I can keep one of the companies
> that mine works with upgrading, because the group that I work with is
> the only one there using C++ and they listen to me, but most companies
> have a lot more invested in the existing system. Believe me, any
> breaking changes have to be eased in over many versions -- the "boiling
> a frog" approach. :-)

Of course this is a valid point and what we should do is to do some
potential damage evaluation. There have been breaking changes
in Boost and the end-users finally accepted them (even if complaining
loudly) Boost is a cutting edge library and such changes should
be avoided if possible, but they should not be avoided completelly.
This would require a lot of PR and announcing the changes well
in advance.

>
> If they're already using UTF-8 strings, then we provide something like
> BOOST_ALL_STD_STRINGS_ARE_UTF8 that they can define. The utf*_t classes
> configure themselves to accept std::strings as UTF-8-encoded, and any
> changes are completely transparent to those people. No punishment
> involved.

OK this could work.

>
> For everyone else, we introduce the utf*_t API alongside the
> std::string one, for those classes and functions that are not
> encoding-agnostic. The std::string one can be deprecated in future
> versions if the library author desires. Again, no punishment involved.
>
>
> I don't expect that the utf*_t classes will make it into the standard.
> They definitely won't make it into the now-misnamed C++0x standard, and
> it'll likely be another ten years before another one is hashed out --
> by then, the UTF-8 conversion should be complete, so there will be no
> need for it, except possibly to confirm that a string isn't malformed.
>
>>
>> Besides the ugly name and that is a new class ? No :)
>
> If you can think of a more-acceptable-but-still-descriptive name for
> it, I'm all ears. :-)

I have an idea: what about boost::string, which could possibly become
the next std::string in the future.

>> And the solution is long overdue. And creating utf8_t is just putting
>> the problem away, not solving it really.
>
> I see it as merely easing the transition.

OK, if the long term plan is:

1) design and implement boost::string using UTF-8 doing all the things
like code-point iteration, character iteration, convenience stuff like
starts-with, ends-with, replace, trim, etc., etc. with as much backward
compatibility with std::string as possible without hindering progress

2) try really hard to push it to the standard

then I'm on board with that.

Patrick Horgan

unread,
Jan 20, 2011, 4:00:34 AM1/20/11
to bo...@lists.boost.org
On 01/19/2011 12:56 PM, Robert Ramey wrote:
> ... elision by patrick ...
> std::string - a sequence of bytes
> utf8_string - a sequence of "code points" implemented in terms of
> std::string.
With the ability to specify a conversion facet to convert from your
local encoding to utf-8. The string would still validate the utf-8
received from the conversion facet.

What do you do about things that can validly be represented by one
character, or by a basic character with one or more combining
characters. For example Ü can be represented by U+00DC, a capital U
with diaeresis or by the two combining characters U+0055 U+0308, a U and
a combining diaeresis. Ü<=- That one is done with two combining
characters and the previous one is just one character. The spec says
that these must be considered absolutely equivalent. Will our
utf8_string class always choose one representation over another?
Certainly to make choices like this you'd need the characterization
database from Unicode.

So, if you're iterating the utf8_string with an iterator iter, what type
does *iter return? It could _consume_ a lot of bytes.

Is it a char32_t with the character in it, is it another utf8-string
with only one character in it? I'd say char32_t because that can hold
anything in ucs.

So then what about *iter=thechar. What type or types can thechar be?

char32_t char16_t, wchar_t, char, unsigned char, int, int32_t, a
utf8_string with only one "character" to be copied in, a utf8_string and
we'll just take the first char?

I'd probably use char32_t in both those cases.

Food for thought. I agree I'd like to see it be derived from
std::string so you can pass it to things that expect a std::string and
don't care so much about encoding.

Patrick

Lassi Tuura

unread,
Jan 20, 2011, 4:17:45 AM1/20/11
to bo...@lists.boost.org
Hi,

> The OED lists ~600,000 words, so 32 bits is enough space to provide a
> fully pictographic alphabet for over 7,000 languages as rich as English,
> with room for a few line-drawing characters left over. Surely that's enough?

It could be. Depends on what problems you are trying to solve.

Languages in the world operate in many interestingly different ways. Enabling computers to input, store, display, typeset, hyphenate, search, spell check, render to speech, and perform other multi-lingual text tasks sometimes involves rules more complex than those used for English.

Unicode consortium (unicode.org) provides lots of excellent material on these issues, including FAQs. If you are genuinely interesting in solving text processing issues for the entire world, I highly recommend a visit over there.

Not all software needs to care about those problems. For a library one needs to decide which set of tasks and languages to support. If the target is all text processing tasks for the entire world, one may end up having strange ideas like variable number of code units or that random access to strings is lower priority.

Then there are constraints. Coming across as unnecessarily having doubled the app memory use might earn library designers some seriously bad reputation. Refusing to, say, display all files in a directory may get users upset, even if the filenames aren't valid by some standard or another.

Of course when there are other goals - perhaps software needs to handle any text but treats it as an opaque blob, or perhaps author values beauty of internal design more than supporting languages in far-flung corners of the world, or the app is such that butchering the names of 50% of world's population will have no dire consequences - one will likely end up with a different design.

To give you a taste of some the complex issues, here's a few quotes from South Asian Scripts FAQ http://www.unicode.org/versions/Unicode5.0.0/ch09.pdf:

> The writing systems that employ Devanagari and other Indic scripts constitute abugidas -- a cross between syllabic writing systems and alphabetic writing systems. The effective unit of these writing systems is the orthographic syllable, consisting of a consonant and vowel (CV) core and, optionally, one or more preceding consonants, with a canonical structure of (((C)C)C)V. [...] Devanagari characters, like characters from many other scripts, can combine or change shape depending on their context. [...] Additionally, a few Devanagari characters cause a change in the order of the displayed characters. [...] Some Devanagari consonant letters have alternative presentation forms whose choice depends on neighboring consonants. [...] Devanagari has a collection of nonspacing dependent vowel signs that may appear above or below a consonant letter, as well as spacing dependent vowel signs that may occur to the right or to the left of a consonant letter or consonant cluster. [...] If
the superscript mark RAsup is to be applied to a dead consonant that is subsequently replaced by its half-consonant form, then the mark is positioned so that it applies to the form that serves as the base of the consonant cluster. [...]


You might want to also read:

http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
http://blog.mozilla.com/dmandelin/2008/02/14/wtf-16/

Regards,
Lassi

Patrick Horgan

unread,
Jan 20, 2011, 4:36:22 AM1/20/11
to bo...@lists.boost.org
On 01/19/2011 09:39 PM, Artyom wrote:
>> ... elision by patrick ...
> Short reminder:
>
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
Dueling reminder:
http://www.joelonsoftware.com/articles/Unicode.html

Patrick

Artyom

unread,
Jan 20, 2011, 4:40:16 AM1/20/11
to bo...@lists.boost.org
> From: Patrick Horgan <phor...@gmail.com>

>
> On 01/19/2011 09:50 AM, Artyom wrote:
> > ... elision by patrick ...
> > Take a look on this code:
> >
> > http://art-blog.no-ip.info/files/nowide.zip
> building test failed on my linux box with gcc 4.6. Is it supposed to
> work here?
>

Actually if it fails then there is either bug in g++-4.6
or even more likely bug in my test :-)

Because under Linux it is just something like

namespace nowide {
using std::fstream;
}

> [snip]


>
> ----------------------------------------------------------
> Test Failed.
> "test_streambuf" end time: Jan 19 20:58 PST
> "test_streambuf" time elapsed: 00:00:00
> ----------------------------------------------------------
>
> End testing: Jan 19 20:58 PST

This is quite old code, the new one can be found
as part of Booster - Boost-like part of CppCMS.

http://cppcms.svn.sourceforge.net/viewvc/cppcms/framework/trunk/booster/booster/nowide/

http://cppcms.svn.sourceforge.net/viewvc/cppcms/framework/trunk/booster/lib/nowide/


I fixed since then few bugs. If there is an interest I can
extract the code once again, in any case new code is tested
on many platforms and compilers:

http://art-blog.no-ip.info/files/nightly-build-report.html

Artyom

Artyom

unread,
Jan 20, 2011, 5:21:15 AM1/20/11
to bo...@lists.boost.org
> From: Patrick Horgan <phor...@gmail.com>

> On 01/19/2011 09:39 PM, Artyom wrote:
> >> ... elision by patrick ...
> > Short reminder:
> >
> >
>http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful
> Dueling reminder:
> http://www.joelonsoftware.com/articles/Unicode.html
>
> Patrick
>

I know this article well.

The problem is it is outdated, has many errors and
written from the Microsoft's view on Unicode
which apparently lead to all the TCHAR crap.

In any case I woudn't give it to anybody
as good reference for Unicode.

Artyom

Mathias Gaunard

unread,
Jan 20, 2011, 7:49:08 AM1/20/11
to bo...@lists.boost.org
On 20/01/2011 05:38, Patrick Horgan wrote:
> On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
>> My Unicode library works with arbitrary ranges, and you can adapt a
>> range in an encoding into a range in another encoding.
>> This can be used to lazily perform encoding conversion as the range is
>> iterated; such conversions may even be pipelined.
> Sounds interesting. Of courses ranges could be used with strings of
> whatever sort. Is the intelligence about the encoding in the ranges?

I've chosen not to attach encoding information to ranges, as this could
make my Unicode library quite intrusive.

It's a design by contract; your input ranges must satisfy certain
criteria, such as encoding, depending on the function you choose to
call. If the criteria are not satisfied, you either get undefined
behaviour or an exception, depending on the version of the function you
choose to call.

> As
> you iterate a range does it move byte by byte character by character,

You can adapt a range of code units into a range of code points or into
a range of ranges of code points (combining character sequences,
graphemes, words, sentences, etc.)

> does it deal with compositions?

It can.

My library doesn't really have string algorithms, it's up to you to make
sure you call those algorithms using the correct adapters.

For example, to search for a substring in a string, both of which being
in UTF-8, and taking into account combining characters, there are
different strategies:
- Decode both to UTF-32, normalize them, segment them as combining
character sequences, and perform a substring search on that.
- Decode both to UTF-32, normalize them, re-encode them both in UTF-8,
perform a substring search at the byte level, and ignore matches that do
not lie on the utf8_combining_boundary (checks whether we're at a UTF-8
code point boundary, decodes to UTF-32, checks whether we're at a
combining character boundary).

You could want to avoid the normalization step because you know your
data is already normalized.
The second one has chances to be quite faster than the former, because
you spend most of the time working on chars in actual memory, which can
be optimized quite aggressively.

Both approaches are doable directly in a couple of lines by combining
Boost.StringAlgo and my Unicode library in various ways ; and all
conversions can happen lazily or not as one wishes.
Boost.StringAlgo isn't however that good (it only provides naive O(n*m)
algorithms, doesn't support right-to-left search well, and certainly is
unable to vectorize the cases where the range is made of built-in types
contiguous in memory), so eventually it might have to be replaced.


> Is it available to read?

Somewhat dated docs are at <http://mathias.gaunard.com/unicode/doc/html/>
A presentation is planned for Boostcon 2011, and a submission for review
before that.

Mathias Gaunard

unread,
Jan 20, 2011, 7:53:02 AM1/20/11
to bo...@lists.boost.org
On 20/01/2011 02:23, Patrick Horgan wrote:
> The entire
> unicode code space (all 2^31 codes)

You mean all (2^20 + 2^16) codes.
Unicode doesn't reserve more codes than that.

Mathias Gaunard

unread,
Jan 20, 2011, 8:18:51 AM1/20/11
to bo...@lists.boost.org
On 20/01/2011 09:41, bernardH wrote:
> Dave Abrahams<dave<at> boostpro.com> writes:
>
>>
>> At Wed, 19 Jan 2011 23:25:34 +0000,
>> Brent Spillner wrote:
>>>
>>> On 1/19/2011 11:33 AM, Peter Dimov wrote:
>>>> This was the prevailing thinking once. First this number of bits was 16,
>>>> which incorrect assumption claimed Microsoft and Java as victims, then
>>>> it became 21 (or 22?). Eventually, people realized that this will never
>>>> happen even if we allocate 32 bits per character, so here we are.
>>>
>>> The OED lists ~600,000 words, so 32 bits is enough space to provide a
>>> fully pictographic alphabet for over 7,000 languages as rich as English,
>>> with room for a few line-drawing characters left over. Surely that's enough?
>>
>> Even if it's theoretically possible, the best standards organization
>> the world has come up with for addressing these issues was unable to
>> produce a standard that did it.
>
> I must confess a lack of knowledge wrt to encodings, but my understanding
> is that strings are sequences of some raw data (without semantic),
> code points and glyphs.

The difference between graphemes and glyphs is the main reason for the
complications of dealing with text on computers.

A grapheme is the unit of natural text, while glyphs are the units used
for its graphical representation.

Different glyphs can represent the same grapheme (this is usually
considered a typeface difference, albeit some typefaces support multiple
glyphs for the same grapheme).

A grapheme can be represented by several glyphs (mostly diacritics).

A single glyph can represent several graphemes, with ligatures, albeit
some consider this a typeface quirk and not really a glyph, since a
glyph should be at most one grapheme.

Unicode mostly tries to encode graphemes (it doesn't encode all
variations of 'a' for example, nor all graphic variations of CJK
characters), but due to historical reasons, the whole thing is quite a mess.

A code point is therefore an element in the Unicode mapping, which
semantics depend on what that element actually is. It can be a ligature,
a diacritic, a code that is semantically equivalent to another, but not
necessarily functionally equivalent, etc.

UTF-X are then a series of encoding that describe how code points are
encoded as a series of X-sized code units.

Chad Nelson

unread,
Jan 20, 2011, 9:21:15 AM1/20/11
to bo...@lists.boost.org
On Thu, 20 Jan 2011 00:05:47 -0800
Patrick Horgan <phor...@gmail.com> wrote:

>>> Inevitably a Unicode standard will be adapted where every character
>>> of every language will be represented by a single fixed length
>>> number of bits. [...]
>>
>> I'm no Unicode expert, but the reason this hasn't happened might be
>> combinatorial explosion. In which case it might never happen. But I
>> could well be wrong. And I hope I am, the design you outline is
>> something I'd love to see.
>
> It's already here and has been for a long time. That's just UCS

> encoded as UTF-32. [...]

The problem, in my uninformed view of it, is the idea of combining
characters. Any time you can have a single character that requires more
than one code-point, you can't assume that a fixed number of bits will
be able to represent every character.

I may be wrong, and I hope I am. If a character is guaranteed never to
consist of more than X code-points, it would be simple to offer a
fixed-width character type, even if the width is huge by comparison to
the eight-bit char type. But from what I've seen, I don't think that's
the case.
--
Chad Nelson
Oak Circle Software, Inc.
*
*
*

signature.asc

Artyom

unread,
Jan 20, 2011, 9:30:48 AM1/20/11
to bo...@lists.boost.org
> I may be wrong, and I hope I am. If a character is guaranteed never to
> consist of more than X code-points,
> it would be simple to offer a
> fixed-width character type, even if the width is huge by comparison to
> the eight-bit char type. But from what I've seen, I don't think that's
> the case.

I assume there is some limit but who know which?

Even in Hebrew (the language I speak) you can easily create
a letter with 4 code points:

- shin-basic, shin/sin mark, vovel, dagesh
- Now I can also add some biblical marks (I think there may be two or
three of them)

And Hebrew is relatively simple one.

Now I have no idea about what happens in other languages and what
happens with Unicode points that are going to be added in future
Unicode releases.

So I would suggest not assume that there is a certain limit.

Artyom

Chad Nelson

unread,
Jan 20, 2011, 9:33:02 AM1/20/11
to bo...@lists.boost.org
On Thu, 20 Jan 2011 09:59:51 +0100
Matus Chochlik <choc...@gmail.com> wrote:

>> Do you see another way to provide those conversions, and automatic
>> verification of proper UTF coding? (Automatic verification is a very
>> good thing, without it someone won't use it or will forget to, and
>> open up their programs to exploitation.)
>
> Yes, implementing it into std::string in some future standard.

'Fraid that's a little beyond my current level of programming skill. ;-)

>>> Besides the ugly name and that is a new class ? No :)
>>
>> If you can think of a more-acceptable-but-still-descriptive name for
>> it, I'm all ears. :-)
>
> I have an idea: what about boost::string, which could possibly become
> the next std::string in the future.

And string16 and string32? We'll have to support UTF-32, as the
single-codepoint-per-element type, and UTF-16 (distasteful though it
may be) is needed for Windows.

Or are you suggesting the utf* types in addition to the boost::string
type? If so, I believe the idea has merit.

>>> And the solution is long overdue. And creating utf8_t is just
>>> putting the problem away, not solving it really.
>>
>> I see it as merely easing the transition.
>
> OK, if the long term plan is:
>
> 1) design and implement boost::string using UTF-8 doing all the things
> like code-point iteration, character iteration, convenience stuff like
> starts-with, ends-with, replace, trim, etc., etc. with as much
> backward compatibility with std::string as possible without hindering
> progress
>
> 2) try really hard to push it to the standard
>
> then I'm on board with that.

Some of those could be problematic (I've run across references implying
that 0x20 isn't the universal word-separation character, so trim would
at least need some extra parameters), but for the most part, I'd agree
with it.

signature.asc

Artyom

unread,
Jan 20, 2011, 9:46:55 AM1/20/11
to bo...@lists.boost.org
> >
> > OK, if the long term plan is:
> >
> > 1) design and implement boost::string using UTF-8 doing all the things
> > like code-point iteration, character iteration, convenience stuff like
> > starts-with, ends-with, replace, trim, etc., etc. with as much
> > backward compatibility with std::string as possible without hindering
> > progress
> >
> > 2) try really hard to push it to the standard
> >
> > then I'm on board with that.
>
> Some of those could be problematic (I've run across references implying
> that 0x20 isn't the universal word-separation character, so trim would
> at least need some extra parameters), but for the most part, I'd agree
> with it.

And also it is locale dependent.

Unicode defines 4 text segments: Grapheme, Word and Sentence.

http://www.unicode.org/reports/tr14/

There is also line break boundaries defined:

http://unicode.org/reports/tr29

Most of them are also locale dependent as require use of
dictionaries.

So unless you want to carry locale information in the string,
I don't think it is good to put these into the string itself.

Artyom

Chad Nelson

unread,
Jan 20, 2011, 9:47:56 AM1/20/11
to bo...@lists.boost.org
On Thu, 20 Jan 2011 06:30:48 -0800 (PST)
Artyom <arty...@yahoo.com> wrote:

>> I may be wrong, and I hope I am. If a character is guaranteed never
>> to consist of more than X code-points, it would be simple to offer a
>> fixed-width character type, even if the width is huge by comparison
>> to the eight-bit char type. But from what I've seen, I don't think
>> that's the case.
>
> I assume there is some limit but who know which?
>
> Even in Hebrew (the language I speak) you can easily create

> a letter with 4 code points: [...] And Hebrew is relatively simple
> one. [...] So I would suggest not assume that there is a certain
> limit.

<sigh> Yes, that's what I pretty much expected.

signature.asc

Alexander Lamaison

unread,
Jan 20, 2011, 10:43:52 AM1/20/11
to bo...@lists.boost.org
On Thu, 20 Jan 2011 09:33:02 -0500, Chad Nelson wrote:

> On Thu, 20 Jan 2011 09:59:51 +0100
> Matus Chochlik <choc...@gmail.com> wrote:
>
>>> Do you see another way to provide those conversions, and automatic
>>> verification of proper UTF coding? (Automatic verification is a very
>>> good thing, without it someone won't use it or will forget to, and
>>> open up their programs to exploitation.)
>>
>> Yes, implementing it into std::string in some future standard.
>
> 'Fraid that's a little beyond my current level of programming skill. ;-)
>
>>>> Besides the ugly name and that is a new class ? No :)
>>>
>>> If you can think of a more-acceptable-but-still-descriptive name for
>>> it, I'm all ears. :-)
>>
>> I have an idea: what about boost::string, which could possibly become
>> the next std::string in the future.
>
> And string16 and string32? We'll have to support UTF-32, as the
> single-codepoint-per-element type, and UTF-16 (distasteful though it
> may be) is needed for Windows.

I imagine you wouldn't have UTF-16 and UTF-32 string being passed about as
a matter for course. For instance, a UTF-16 string should only be used
just before calling a Windows API call.

If this is the case, it makes sense to make the common case (UTF-8 string)
have a nice name like boost::string and the others which are used for
special situations can have something less snappy like boost::u16string and
boost::u32string.

Alex


--
Easy SFTP for Windows Explorer (http://www.swish-sftp.org)

Ivan Le Lann

unread,
Jan 20, 2011, 2:32:35 AM1/20/11
to bo...@lists.boost.org
Artyom wrote :

> If you take a look on All C++ frameworks they
> all have a way to convert their string to std::string
> and backwards.
>
> C++ hadn't picked yet, but C++ has string
> and very good one.

Please allow me to question this last statement.
I'm struggling to follow this thread, but there is one thing that emerge
from this effort : encoded strings are dangerous beasts you don't want
to touch, you should pass them as-is or use an expert library to analyze
or modify them.

Do you think that std::string and its fairly open interface reflects
this good practice ?
As a user, I would like to see encoded strings as something a little bit
more opaque.

Ivan.

Matus Chochlik

unread,
Jan 20, 2011, 11:52:58 AM1/20/11
to bo...@lists.boost.org
On Thu, Jan 20, 2011 at 3:33 PM, Chad Nelson
<chad.thec...@gmail.com> wrote:
>
>>>> Besides the ugly name and that is a new class ? No :)
>>>
>>> If you can think of a more-acceptable-but-still-descriptive name for
>>> it, I'm all ears. :-)
>>
>> I have an idea: what about boost::string, which could possibly become
>> the next std::string in the future.
>
> And string16 and string32? We'll have to support UTF-32, as the
> single-codepoint-per-element type, and UTF-16 (distasteful though it
> may be) is needed for Windows.
>
> Or are you suggesting the utf* types in addition to the boost::string
> type? If so, I believe the idea has merit.

If boost::string uses utf-8 by default and I will be able to do
sed 's/boost::string/std::string/g' with all my sources at some point
in the distant future without breaking them (completely) we can have
string16, string32, string_ucs2, string_ucs4, etc. for all I care :-).

I am not against alternative string representations and encodings,
but I would like to finally see a string class, which I can for example
write to a file on a Windows machine with cp1250 and open it
on Linux with utf-8 without doing explicit transcoding, which
allows to do true code-point and character iteration, supporting
the essential algorithms (it is open for debate which ones), which
I can use as a type for parameters of my functions and member
variables, etc.

>>
>> OK, if the long term plan is:
>>
>> 1) design and implement boost::string using UTF-8 doing all the things
>> like code-point iteration, character iteration, convenience stuff like
>> starts-with, ends-with, replace, trim, etc., etc. with as much
>> backward compatibility with std::string as possible without hindering
>> progress
>>
>> 2) try really hard to push it to the standard
>>
>> then I'm on board with that.
>
> Some of those could be problematic (I've run across references implying
> that 0x20 isn't the universal word-separation character, so trim would
> at least need some extra parameters), but for the most part, I'd agree
> with it.

This is *exactly* why I would like to see them in a standard string
(or string manipulation library) , designed and implemented by true
experts and not reinvented by an "expert" like me :)

Jens Finkhäuser

unread,
Jan 20, 2011, 12:33:21 PM1/20/11
to bo...@lists.boost.org
On Wed, Jan 19, 2011 at 07:42:44PM +0000, Alexander Lamaison wrote:
> I was under the impression that Linux changed from interpreting char* as
> being in a multitude of different encodings to being in UTF-8 by default.
What you might be thinking of is that most modern Linux distributions set
the default locale to include UTF-8 encoding (usually en_US.UTF-8).

--
1.21 Jiggabytes of memory should be enough for anybody.

Bo Persson

unread,
Jan 20, 2011, 1:07:23 PM1/20/11
to bo...@lists.boost.org
Matus Chochlik wrote:
> On Wed, Jan 19, 2011 at 8:28 PM, Dave Abrahams <da...@boostpro.com>
> wrote:
>>>
>>> OK, I see. But, is there any chance that the standard itself would
>>> be updated so that it first would recommend to use UTF-8 with C++
>>> strings.
>>
>> Well, never say "never," but... never. Such recommendations are not
>> part of the standard's mission. It doesn't do things like that.
>
> My view of what is the standardizing comitee willing to do may by
> naive,
> but generally I don't see why this could not be done. Other major
> languages (Java, C#, etc.) picked a single "standard" encoding and
> in those languages you treat text with other encodings as special
> case.

These are not good examples, as they are single company, single
platform languages. C++ is supposed to be much more than that.

>
> If C++ recommended the use of UTF-8 this would probably kickstart
> the OS and compiler vendors to follow or at least to fix their
> implementations of the standard libary and the OS API's to accept
> UTF-8 by default (if we agree that this is a good idea).
>

To name one set of obstacles: IBM, z/OS, EBCDIC.

Will work when all the legacy Cobol code has been converted to C++,
but not before that. :-)

Latest estimate was 200+ billion lines left to process.


Bo Persson

Sergey Cheban

unread,
Jan 20, 2011, 1:22:15 PM1/20/11
to bo...@lists.boost.org
19.01.2011 18:34, Alexander Lamaison wrote:

> Even if I bought the UTF-8ed-Boost idea, what would we do about the STL
> implementation on Windows which expects local-codepage narrow
strings? Are
> we hoping MS etc. change these to match? Because otherwise we'll be
> converting between narrow encodings for the rest of eternity.
The problems with MSVC and multilingual filenames are not boost-related.
Even the following code don't work correctly:

#include <stdio.h>
int main( int argc, char *argv[])
{
printf("%s", argv[1]);
return 0;
}

>1.exe asdfфыва
asdfЇ√тр

As you can see, the cyrillic characters are broken (this is an ANSI vs
OEM issue and is not related to the unicode at all).

Please note that the cygwin compiler/libc has no such problems because
it uses utf-8 (by default, at least). The fopen() uses the utf-8 for
filenames, too.

So, we may choose one of the following:

1. Wait until MS fixes the problem on their side. For now, the windows
users may use the short filenames (i.e. GetShortPathName() ) for the
multilingual filenames.

2. Provide a char * interface that will allow the windows developers to
work with multilingual filenames.

3. Provide WCHAR * interface specially for the windows developers and
allow them to write the non-portable code. Leave the char * interface
unusable for windows/msvc and wait until MS fixes it on their side.

4. Create the almost-portable wchar_t * interface.

5. Create our own type (boost::native_t or boost::utf8_t) and conversion
routines for it. Please note that independent libraries will NEVER use
foreign non-standard types.

I think only 2nd and 3rd options are realistic.

--
Best regards,
Sergey Cheban

Sergey Cheban

unread,
Jan 20, 2011, 1:59:49 PM1/20/11
to bo...@lists.boost.org
20.01.2011 17:30, Artyom пишет:

> Even in Hebrew (the language I speak) you can easily create
> a letter with 4 code points:
>
> - shin-basic, shin/sin mark, vovel, dagesh
> - Now I can also add some biblical marks (I think there may be two or
> three of them)
>
> And Hebrew is relatively simple one.

Even in english, you should combine the letters to write text. There are
some kerning-related issues, too. But I see no problem here.

Best regards,
Sergey Cheban.

Sergey Cheban

unread,
Jan 20, 2011, 2:54:12 PM1/20/11
to bo...@lists.boost.org
20.01.2011 2:50, Peter Dimov пишет:

> It is possible to create a file whose name is not a valid UTF-16
> sequence on Windows, so the library ought to be able to work with it.
> You could go either way in this specific case though, since such names
> are extremely rare in practice.
On Windows, it is possible to create the filenames with \0 in the
middle. But I don't think that Boost should support them.

Beman Dawes

unread,
Jan 20, 2011, 5:15:17 PM1/20/11
to bo...@lists.boost.org
On Thu, Jan 20, 2011 at 1:22 PM, Sergey Cheban <s.ch...@drweb.com> wrote:


> The problems with MSVC and multilingual filenames are not boost-related.
> Even the following code don't work correctly:
>
> #include <stdio.h>
> int main( int argc, char *argv[])
> {
> printf("%s", argv[1]);
> return 0;
> }
>
> >1.exe asdfфыва
> asdfЇ√тр
>

You lost me. That example has nothing to do with filenames.


>
> As you can see, the cyrillic characters are broken (this is an ANSI vs OEM
> issue and is not related to the unicode at all).
>
> Please note that the cygwin compiler/libc has no such problems because it
> uses utf-8 (by default, at least). The fopen() uses the utf-8 for filenames,
> too.
>
> So, we may choose one of the following:
>
> 1. Wait until MS fixes the problem on their side. For now, the windows
> users may use the short filenames (i.e. GetShortPathName() ) for the
> multilingual filenames.
>
> 2. Provide a char * interface that will allow the windows developers to
> work with multilingual filenames.
>
> 3. Provide WCHAR * interface specially for the windows developers and allow
> them to write the non-portable code. Leave the char * interface unusable for
> windows/msvc and wait until MS fixes it on their side.
>
> 4. Create the almost-portable wchar_t * interface.
>
> 5. Create our own type (boost::native_t or boost::utf8_t) and conversion
> routines for it. Please note that independent libraries will NEVER use
> foreign non-standard types.
>
> I think only 2nd and 3rd options are realistic.
>

Why not just use Boost.Filesystem V3 for dealing with files and filenames?

You can work with char strings in the encoding of your choice, including
utf-8 encoding. You can use wchar_t strings in utf-16 encoding. If your
compiler supports C++0x char16_t and char_32t, you will be able to also use
strings based on those as C++0x support matures. Class
boost::filesystem::path provides a single non-template class that works fine
with all of those types and encodings. Your code can be written to be
reasonably portable too, particularly if all you are concerned with is
either Windows systems or POSIX-like systems that use utf-8 for filenames.
If you want wider portability, you would have to avoid narrow strings so
that on POSIX-like systems the wide strings could be converted to whatever
narrow encoding the system uses.

--Beman

--Beman

Patrick Horgan

unread,
Jan 20, 2011, 5:53:39 PM1/20/11
to bo...@lists.boost.org
On 01/20/2011 06:33 AM, Chad Nelson wrote:
> ... elision by patrick ...
> And string16 and string32? We'll have to support UTF-32, as the
> single-codepoint-per-element type, and UTF-16 (distasteful though it
> may be) is needed for Windows.
string is already templated on the character type, character traits, and
allocator, no? So if you used string with char16_t or char32_t (real
types in current C++ specs) you would get them already. You just
wouldn't know anything about encoding.

> Or are you suggesting the utf* types in addition to the boost::string
> type? If so, I believe the idea has merit.
Yay! +1.72

Patrick

Ian Emmons

unread,
Jan 20, 2011, 5:58:35 PM1/20/11
to bo...@lists.boost.org

On Jan 19, 2011, at 11:30 AM, Peter Dimov wrote:
> Alexander Lamaison wrote:
>> > There is a straightforward way for Microsoft to migrate Windows to this
>> > future: If they add UTF-8 support to their narrow character interface
>> > (I am avoiding calling it ANSI due to the negative connotations that
>> > has) and add narrow character APIs for all wide character APIs that lack
>> > a narrow counterpart, then I believe we could treat POSIX and Windows
>> > identically from an encoding point of view.
>>
>> It would break any programs using the narrow API currently that use any
>> 'exotic' codepage (i.e. pretty much anything except 7-bit ascii).
>
> It will only break programs that depend on a specific code page. Programs that use the narrow API but do not require a specific code page (or a single byte code page - the exact opposite of exotic) will work fine - they'll simply see an ANSI code page of 65001. It will still cause a fair amount of breakage, of course, but in principle, the transition path is obvious and straightforward.

What I intended here (but forgot to say explicitly -- sorry) was that Microsoft could allow a process (or thread) to set its local character set to UTF-8. Then all existing code that pays attention to the narrow representation would find that it is UTF-8 and deal correctly with it.

Naturally, this migration would take time -- but Microsoft has done that before. They successfully transitioned a large developer base off 16-bit Windows and onto 32-bit Windows (and, incidentally, introduced the wide character API at the same time).

It is loading more messages.
0 new messages