Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RFC: String iterator invalidation rules to be considered a defect?

366 views
Skip to first unread message

Kimon Hoffmann

unread,
Nov 24, 2008, 5:02:29 PM11/24/08
to
Hi all,

while developing a few generic algorithms that operate on arbitrary
"Forward Containers" I recently had a few issues with the iterator
invalidation rules defined for the basic_string family of types.
Consider the following simple example:

-------------------- Snip --------------------
#include <string>
#include <iostream>

template<typename ForwardContainer>
inline typename ForwardContainer::const_iterator
findSomething(ForwardContainer const& container) {
// For the sake of simplicity ...
return container.end();
}

int main(int, char**) {
std::string someString = "abc";
std::string aCopy = someString;
std::string::const_iterator it = findSomething(aCopy);
if (it != aCopy.end()) {
std::cerr << "Unexpected." << std::endl;
}
return 0;
}

-------------------- Snip --------------------

While for every normal model of ForwardContainer the if condition would
evaluate to false, for strings it, more or less unexpectedly, may
evaluate to true because of the following passage of the standard (C++
98 final draft):

21.3 - Template class basic_string [lib.basic.string]
-5- References, pointers, and iterators referring to the elements of a
basic_string sequence may be invalidated by the following uses of that
basic_string object:
* [...]
* Subsequent to any of the above uses except the forms of insert() and
erase() which return iterators, the first call to non-const member
functions operator[](), at(), begin(), rbegin(), end(), or rend().

An example implementation exhibiting this behavior is libstdc++, which
uses this rule to provide lazy copying of a strings contents.
While I totally understand the desire to grant library implementations
as much freedom as possible when it comes to providing optimized
implementations, I think this particular behavior may lead to subtle,
hard to find errors in the context of generic programming and should
thus be reconsidered. But since I'm no expert on this subject I'd really
like to hear a few opinions. Have you also encountered this problem? Is
there some clever way to safely work around it?

For the moment I've helped myself with a helper function like the following:

-------------------- Snip --------------------

template<typename ForwardContainer>
inline ForwardContainer& force_copy(ForwardContainer& container) {
// Can easily be optimized away for implementations that have eager
// copies and simple iterators
typename ForwardContainer::iterator dummy = container.begin();
return container;
}

-------------------- Snip --------------------

This way the function call in the above example becomes:

std::string::const_iterator it = findSomething(force_copy(aCopy));

This has worked fine for me, but I still fear that in dark places of my
extensively generic library I might still run into this sort of problem
without having realized it just yet.

Best regards
Kimon

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Daniel Krügler

unread,
Nov 25, 2008, 11:16:20 AM11/25/08
to
On 24 Nov., 23:02, Kimon Hoffmann <kimon.hoffm...@usenet.cnntp.org>
wrote:
> [..] I recently had a few issues with the iterator

Right, your observed behavior is consistent with the currently
valid standard 14882:2003(E), which supports implementations of
std:string using reference-counting.

> An example implementation exhibiting this behavior is libstdc++, which
> uses this rule to provide lazy copying of a strings contents.
> While I totally understand the desire to grant library implementations
> as much freedom as possible when it comes to providing optimized
> implementations, I think this particular behavior may lead to subtle,
> hard to find errors in the context of generic programming and should
> thus be reconsidered. But since I'm no expert on this subject I'd really
> like to hear a few opinions. Have you also encountered this problem? Is
> there some clever way to safely work around it?
>
> For the moment I've helped myself with a helper function like the
following:
>
> -------------------- Snip --------------------
>
> template<typename ForwardContainer>
> inline ForwardContainer& force_copy(ForwardContainer& container) {
> // Can easily be optimized away for implementations that have eager
> // copies and simple iterators
> typename ForwardContainer::iterator dummy = container.begin();
> return container;
>
> }
>
> -------------------- Snip --------------------
>
> This way the function call in the above example becomes:
>
> std::string::const_iterator it = findSomething(force_copy(aCopy));
>
> This has worked fine for me, but I still fear that in dark places of my
> extensively generic library I might still run into this sort of problem
> without having realized it just yet.

For exactly this unexpected behavior, the most recent draft
has banned reference-counted std::string implementations.
The new wording in N2798, [string.require]/4 is:

"References, pointers, and iterators referring to the elements of a
basic_string sequence may be invalidated by the following uses of
that basic_string object:

— as an argument to any standard library function taking a reference
to non-const basic_string as an argument.[footnote]
— Calling non-const member functions, except operator[], at, front,
back,
begin, rbegin, end, and rend."

Greetings from Bremen,

Daniel Krügler

Lance Diduck

unread,
Nov 25, 2008, 11:57:32 AM11/25/08
to
On Nov 24, 5:02 pm, Kimon Hoffmann <kimon.hoffm...@usenet.cnntp.org>
wrote:

> Have you also encountered this problem? Is
> there some clever way to safely work around it?
The problem exists because the standard explictly allows for Copy-
on_Write (COW) implementations of std::basic_string. So oany access to
a modifying operation (i.e. a non-const operation) forces a copy of
the string internals, thereby invalidating any previous iterators to
the string
However, this behaviour does not mean that the container is not a
forward container. The "forwardness" is a property of the container
iterators, and not of the container itself.
The "forward" property does not consider the lifetime of the iterator.

To appreciate the difference, consider the history of the STL. The STL
was first an actual language itself, an APL derivative called
"Tecton." Then it was ported to Scheme as a library. Scheme has GC.
Then, ported to Ada (which nominally has GC) and then to C++. (The
most singular fact -- STL predated C++ itself!!!)
Also, STL --until is was adopted for inclusion in C++-- did not have
any containers!! ALL containers (except for a few Ada implementations,
that had a rudimentary vector) were user-defined. Furthermore, STL was
a late addition to C++98, and containers were added (the point of
adopting the STL was to have a foundation to specify containers.
Several containers were specified, but that was not the goal of STL).
As time passed, things peculiar to C++ were specified for containers,
like allocators, exception guarantees, lifetimes, and the like. But
the focus was not the containers, but rather the iterators they
provided, and their interrelation to the algorithms. To get a good
sense of C++ library *before* the STL, pick up a cheap copy of "The
Draft C++ Standard" by Plaugher et al.
Then consider the what "generic programming" is. "Generic" was a term
to differentiate from the methods used for the buzz at the time "OO".
What generic does is to select the most perfomant algorithms at
compile time, rather than opt for ease of deployabilty, which is the
cornerstone of OO. How this came to be in C++ is that OO generalized
types, whereas Generic generalized expressions. In C++ by gereralizing
expressions, it was easier to build libraries that could choose the
most performant algorithms for the particular types used in a
expression. This is hard to do in OO, which resolves the actual type
at runtime.

Now we get into the history of basic_string. This existed long before
the STL was around, and STL concepts were just tacked onto it. But
basic_string is more a creature of interactions with IOStreams than
the STL containers. There is virtually no hope of unravelling it, and
most commentations hope to just pin it down, and move to more useful
constructs than try to improve it. (for example, basic_string is
useless for Unicode).

So what is the point of all this? You are asking a little too much
from your approach. Especially from basic_string, which simply has too
many things asked of it to be genericized.
Lance

Maxim Yegorushkin

unread,
Nov 25, 2008, 4:57:56 PM11/25/08
to
On Nov 24, 10:02 pm, Kimon Hoffmann <kimon.hoffm...@usenet.cnntp.org>
wrote:

Besides this quote, there is no requirement, if I remember correctly,
that const_iterator's be comparable to iterators, although it often
works.

So, my guess is that the root of your problem is that you call a non-
const version of std::string::end() to compare its result against a
const_iterator. In other words, your usage of begin() and end() is
mismatched.

Thus, your code can be fixed by making sure you call the matching
versions of begin/end():

std::string::const_iterator it = findSomething(aCopy);

if (it != const_cast<std::string const&>(aCopy).end()) {

--
Max

Kimon Hoffmann

unread,
Nov 25, 2008, 4:56:43 PM11/25/08
to
Kimon Hoffmann wrote:
>
> While for every normal model of ForwardContainer the if condition would
> evaluate to false, for strings it, more or less unexpectedly, may
> evaluate to true because of the following passage of the standard (C++
> 98 final draft):
>
> [...]
>

Pardon me, I accidentally switched "true" and "false" here. I guess no
posting of mine goes without at least one error ;).

Martin T.

unread,
Nov 26, 2008, 2:16:37 PM11/26/08
to
Lance Diduck wrote:
> (...)

> constructs than try to improve it. (for example, basic_string is
> useless for Unicode).
>

While I generally try to avoid nit-picking on parenthesized comments, I
would really like to know what your definition of "useless for unicode"
is. :)
Because, if basic_string is "useless for unicode", then I think every
other language level string in every language I've ever used is too.

cheers,
Martin

Joshua...@gmail.com

unread,
Nov 27, 2008, 11:31:52 AM11/27/08
to
On Nov 26, 11:16 am, "Martin T." <0xCDCDC...@gmx.at> wrote:
> Lance Diduck wrote:
> > (...)
> > constructs than try to improve it. (for example, basic_string is
> > useless for Unicode).
>
> While I generally try to avoid nit-picking on parenthesized comments, I
> would really like to know what your definition of "useless for unicode"
> is. :)

Well, it's not useless for unicode, but mostly useless. The three main
encoding of unicode are UTF8, UTF16, and UTF32, each of which can hold
any unicode string. Only UTF32 stores the characters with a constant
width encoding, where each character gets exact the same allocation
size (where character is an encoding for a unicode code point, not a C+
+ char). UTF8 and UTF16 use encodings which map characters to
different length encodings. basic_string is premised upon constant
width encoding, and thus cannot hold UTF8 and UTF16 strings.


> Because, if basic_string is "useless for unicode", then I think every
> other language level string in every language I've ever used is too.

For comparison, the Java String is not simply a wrapper over an array
of type like std::string. (I know std::string doesn't actually have to
be implemented this way, but they have to behave as if they were
implemented this way, mostly.) Java Strings actually store the string
in a slightly modified UTF8 encoding (meaning an offset access cannot
be implemented as a simple pointer addition, among other things).


Now, in C++, if we had a type guaranteed to be able to hold 32 bit
values, (wchar is not that), then basic_string<int_32> could hold
UTF32 strings. However, this is horribly space inefficient for nearly
all languages. Most people use UTF8 and UTF16 to save space (and
consequently generally save time as well due to locality, etc.)
(though sometimes UTF32 is used to allow transformations which assume
constant width encoding. This is not meant to be a total examination
of all issues), and that is why Lance wrote that basic_string is
useless for unicode.

Lance Diduck

unread,
Nov 28, 2008, 10:39:34 AM11/28/08
to
To add to basic_string's misery:
1. It is possible to have various Unicode byte sequences to actually
represent the same string -- this occurs often in languages with lots
of diacritical marks. In order to have a meaningful operator== you
can't just compare the underlying bytes.The sequence first has to be
"normalized." IIRC JavaScript string use Normal Form 3.
2. OK so we externally normalized our byte sequences. Now we are going
to put these into a set. Of course, we expect these to sort according
to the collation rules for that normalization. string::operator< can
only compare if the bytes value are less than each other. The
standared does provide some help, using Locale::operator() as the
comparator, so it is possible to do this if you are an expert in
Locales and Unicode and your admins enabled this locale on your
machine.
3. How long is my string? Ooops -- no help there either. string can
only return the bytes used, and not the number of "Unicode
characters." In all encodings (except for UTF32) there are multiple
bytes needed to encode Unicode. In the mid nineties, you could get
away (as Java did) by using 16bit units for strings, and have a one to
one mapping. But even they (as of Java 5 I beleive) have expanded
support.
4. Iterate character by character? Only with UTF32
5. transcode from one encoding to the next? no help there either (an
anyone who has tried to use Xerces-C can attest, this is a nighmare)
6. make a char_traits that does Unicode? Ha! The fundamental problem
is that C/C++ view strings as "dumb arrays of smart characters"
whereas Unicode (and virtually all other modern languages) view
strings as "smart arrays of dumb characters." There is a world of
difference in these approaches. Characters Encodings always consider
their context (i.e. the string that they are contained in) by the C/C+
+ approach has traditionally been "each character can be classified on
its own." This is true of ASCII encodings, but hardly true anywhere
else.


I think we have beat up on poor string enough by now.

Lance

Brendan

unread,
Nov 28, 2008, 3:50:25 PM11/28/08
to

> Now, in C++, if we had a type guaranteed to be able to hold 32 bit
> values, (wchar is not that), then basic_string<int_32> could hold
> UTF32 strings. However, this is horribly space inefficient for nearly
> all languages. Most people use UTF8 and UTF16 to save space (and
I'm not sure that really accomplishes that goal. If you have a lot of
foreign text with high code points, UTF8 would actually take more
space.

I'm not an expert on this, but my understanding is the reason people
have standardized on UTF8 is that it's possible for legacy ASCII based
applications to process. Also, in general, there's no guarantee that
any length integer can hold a code point. Win32 windows apps
originally used UTF16 for unicode, which bit those guys in the ass
when unicode came up with more code points than can be represented in
16 bits. Thus, people are skeptical of fixed width encodings now.

> consequently generally save time as well due to locality, etc.)
> (though sometimes UTF32 is used to allow transformations which assume
> constant width encoding. This is not meant to be a total examination
> of all issues), and that is why Lance wrote that basic_string is
> useless for unicode.

In practice I see people use strings as a dynamic array of bytes that
can hold text, binary data, whatever. The fact is that with ascii and
UTF8, which are the common formats on unix systems, 90% of operations
done with a std::string. You can't get easy access to the code point
values, or index the nth code point, but you can search of a arbitrary
UTF8 std::string since as a property of UTF8 encoding one encoded
character is never a substring of another encoded character. In
practice a lot of algorithms work fine this way, and you can always
define a code point iterator over a standard string.

Like I was saying, the only thing that doesn't work is code point
indexing, which works in linear time on std::string's UTF8
representation, whereas if you stored the actual code points in an
array (say in a 32 bit int) you could index them in constant time. The
problem here is that converting a UTF8 representation to code points
is a linear time operation, which is usually unnecessary. So if we had
unencoded "unicode" strings, by which I mean an array of code points,
instead of UTF8 strings, we'd be wasting some time encoding and
unencoding them which is rarely necessary.

This isn't that big of a deal, although if you think about it it's a
little weird that we are in the position where it's proper to use a
std::string for binary data (a UTF8 encoded string) and a
std::vector<uint32> for what might be considered actual "string
data" (code points). However, in practice people often don't think
about the difference between the encoding and the code point, because
people usually think in terms of ascii where there isn't that much
difference.

Anyway, the upshot is that UTF8 means that std::string is perfectly
useful for unicode, if maybe not optimal in all situations.

Brendan

Maxim Yegorushkin

unread,
Nov 28, 2008, 3:51:24 PM11/28/08
to
On Nov 26, 7:16 pm, "Martin T." <0xCDCDC...@gmx.at> wrote:
> Lance Diduck wrote:
> > (...)
> > constructs than try to improve it. (for example, basic_string is
> > useless for Unicode).
>
> While I generally try to avoid nit-picking on parenthesized comments, I
> would really like to know what your definition of "useless for unicode"
> is. :)
> Because, if basic_string is "useless for unicode", then I think every
> other language level string in every language I've ever used is too.

Are you sure about every other language?

Every Perl string is a UTF-8 string. Python strings can be both ASCII
and Unicode.

--
Max

Martin T.

unread,
Nov 28, 2008, 3:52:36 PM11/28/08
to
Joshua...@gmail.com wrote:
> On Nov 26, 11:16 am, "Martin T." <0xCDCDC...@gmx.at> wrote:
>> Lance Diduck wrote:
>>> (...)
>>> constructs than try to improve it. (for example, basic_string is
>>> useless for Unicode).
>> While I generally try to avoid nit-picking on parenthesized comments, I
>> would really like to know what your definition of "useless for unicode"
>> is. :)
>
> Well, it's not useless for unicode, but mostly useless. The three main
> encoding of unicode are UTF8, UTF16, and UTF32, each of which can hold
> any unicode string. Only UTF32 stores the characters with a constant
> width encoding, where each character gets exact the same allocation
> size (where character is an encoding for a unicode code point, not a C+
> + char). UTF8 and UTF16 use encodings which map characters to
> different length encodings. basic_string is premised upon constant
> width encoding, and thus cannot hold UTF8 and UTF16 strings.
>

Ah yes. I always forget that UTF16 is "multibyte" as well.
Which makes me wonder ... how many UNICODE programs on windows would
still work correctly if they encountered some such characters that don't
fit into utf16 :)

>
>> Because, if basic_string is "useless for unicode", then I think every
>> other language level string in every language I've ever used is too.
>
> For comparison, the Java String is not simply a wrapper over an array
> of type like std::string. (I know std::string doesn't actually have to
> be implemented this way, but they have to behave as if they were
> implemented this way, mostly.) Java Strings actually store the string
> in a slightly modified UTF8 encoding (meaning an offset access cannot
> be implemented as a simple pointer addition, among other things).
>

Which make me really wonder why we don't have a std::ustring yet (like
the one from Glib e.g.).
I mean, C++0X goes and adds "unicode support" but fails to add a class
that can handle it. (or am I mistaken?)


cheers,
Martin

Pete Becker

unread,
Nov 29, 2008, 3:19:49 PM11/29/08
to
On 2008-11-28 09:52:36 -0500, "Martin T." <0xCDC...@gmx.at> said:

>
> Which make me really wonder why we don't have a std::ustring yet (like
> the one from Glib e.g.).
> I mean, C++0X goes and adds "unicode support" but fails to add a class
> that can handle it. (or am I mistaken?)
>

Read about u16string and u32string.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

Daniel Krügler

unread,
Nov 29, 2008, 3:21:04 PM11/29/08
to
> Which make me really wonder why we don't have a std::ustring yet (like
> the one from Glib e.g.).
> I mean, C++0X goes and adds "unicode support" but fails to add a class
> that can handle it. (or am I mistaken?)

You are mistaken. The recent draft N2798 provides new character types
char16_t and char32_t, see [lex.ccon]/2:

"A character literal that begins with the letter u, such as u'y', is
a character literal of type char16_t. The value of a char16_t literal
containing a single c-char is equal to its ISO 10646 code point
value,
provided that the code point is representable with a single 16-bit
code unit.[..]
A character literal that begins with the letter U, such as U'z', is
a character literal of type char32_t. The value of a char32_t literal
containing a single c-char is equal to its ISO 10646 code point value.
[..]"

and there are corresponding std::basic_string typedefs:

namespace std {
typedef basic_string<char16_t> u16string;
typedef basic_string<char32_t> u32string;
}

Furtheron the library adds some code conversions,
as codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16,
and string conversions, like wstring_convert and
wbuffer_convert.

Greetings from Bremen,

Daniel Krügler


Joshua...@gmail.com

unread,
Nov 30, 2008, 6:26:14 PM11/30/08
to
On Nov 29, 12:19 pm, Pete Becker <p...@versatilecoding.com>
and
On Nov 29, 12:21 pm, Daniel Krügler <daniel.krueg...@googlemail.com>
talked about std::u16string and std::u32string. It seems that
u16string is UCS-2, not UTF-16, and u32string is UTF-32 (and UCS-4, if
it existed).

As for the new character literals, they are still stored with constant-
width encoding for each unicode code point, correct?

Then this makes me sad, as it does little to actually address the
problem of unicode. We do get UTF-32 strings, but we will not get the
more common and more useful UTF-8 and UTF-16 encodings, either string
literals of these or standard library string classes of these.

Joshua...@gmail.com

unread,
Nov 30, 2008, 6:25:48 PM11/30/08
to
On Nov 28, 12:50 pm, Brendan <catph...@catphive.net> wrote:
> > Now, in C++, if we had a type guaranteed to be able to hold 32 bit
> > values, (wchar is not that), then basic_string<int_32> could hold
> > UTF32 strings. However, this is horribly space inefficient for nearly
> > all languages. Most people use UTF8 and UTF16 to save space (and
>
> I'm not sure that really accomplishes that goal. If you have a lot of
> foreign text with high code points, UTF8 would actually take more
> space.

As I said, UTF16 will take less space than UTF32 for all current
spoken languages. UTF8 will also take less space then UTF32. However,
UTF8 may take more or less space than UTF16, depending on the
particular language.

> I'm not an expert on this, but my understanding is the reason people
> have standardized on UTF8 is that it's possible for legacy ASCII based
> applications to process.

A very western-central argument. I'm not sure this is actually the
case either. For example, Windows NT uses UTF16 as its native encoding
(as you bring up in your next quote).

> Also, in general, there's no guarantee that
> any length integer can hold a code point. Win32 windows apps
> originally used UTF16 for unicode, which bit those guys in the ass
> when unicode came up with more code points than can be represented in
> 16 bits. Thus, people are skeptical of fixed width encodings now.

They used UCS-2, not UTF-16. UTF-16 is a multi-byte encoding scheme
covering all current unicode code points, whereas UCS-2 is a constant
width encoding scheme covering only the BMP. Also, a 32 bit integer
can hold all currently mapped unicode code points, and short of some
insane drive to invent absurd amounts of fictitious languages, or
fictitious languages of absurd sizes, and then give them unicode code
points, 32 bits is more than enough to last humanity for all of its
existence.

> > consequently generally save time as well due to locality, etc.)
> > (though sometimes UTF32 is used to allow transformations which assume
> > constant width encoding. This is not meant to be a total examination
> > of all issues), and that is why Lance wrote that basic_string is
> > useless for unicode.
>
> In practice I see people use strings as a dynamic array of bytes that
> can hold text, binary data, whatever. The fact is that with ascii and
> UTF8, which are the common formats on unix systems, 90% of operations
> done with a std::string. You can't get easy access to the code point
> values, or index the nth code point, but you can search of a arbitrary
> UTF8 std::string since as a property of UTF8 encoding one encoded
> character is never a substring of another encoded character. In
> practice a lot of algorithms work fine this way, and you can always
> define a code point iterator over a standard string.

Except no.

I assume most algorithms will want to do some sort of comparison or
transformation.

**Collation
As Lance mentioned, there's the collation problem, aka sorting, aka
comparisons, including equality comparisons. Sorting rules are
specific to culture and language. For example, German has 2 different
sorting rules for the same alphabet, one for most uses and one for
phone books. Thus, depending on context, you would sort the two exact
same byte representations in different orders. This is just a small
taste.

There's also the problem of combining characters, like Diacretic marks
and accents. The unit for sorting generally isn't unicode code points.
Two different byte strings may represent the same string. For example,
take a letter with two accents on it. The accents may be represented
each by its own unicode code point following the base letter. The
accents could be in either order. The glyph could also be represented
by a single code point which captures the base character and the
accent. (Basically, normalization.)

Even then, the sorting rules employed by real languages today aren't
simple lexicographic sorts, adding even more pain.

**Transformations.
When you want to transform some strings, you generally want to work in
terms of glyphs / characters. Except that there isn't a simple mapping
from unicode code points to glyphs (as talked about above), and there
isn't a naive way to go from 8bit values to unicode code points.
There's combining characters, which means if you split the string at
the wrong point, you might have a string starting with a combining
character, which I would guess makes no sense. You can't just say
"give me the first 8 glyphs / characters" by taking an offset because
of the multi-byte encoding of UTF8, and because of combining
characters.

> Like I was saying, the only thing that doesn't work is code point
> indexing, which works in linear time on std::string's UTF8
> representation,

That and collation is basically everything I think you could do on a
string, besides read it in and print it out, both of which are again
outside the scope of basic_string. (Input is notoriously complex for
eastern languages. Even then input is specific to the particular
language. The only other alternative is typing in hex codes or
something for the unicode code point values. Also, the Han unification
has made using a single font for unicode also effectively impossible.)

> whereas if you stored the actual code points in an
> array (say in a 32 bit int) you could index them in constant time.

You could access unicode code points in constant time, not "glyphs" /
"characters".

> The
> problem here is that converting a UTF8 representation to code points
> is a linear time operation, which is usually unnecessary. So if we had
> unencoded "unicode" strings, by which I mean an array of code points,
> instead of UTF8 strings, we'd be wasting some time encoding and
> unencoding them which is rarely necessary.

Linear in the length of the string, yes. Is something like this
necessary? Yes, if you want to actually do a correct transformation on
real unicode data.

> This isn't that big of a deal, although if you think about it it's a
> little weird that we are in the position where it's proper to use a
> std::string for binary data (a UTF8 encoded string) and a
> std::vector<uint32> for what might be considered actual "string
> data" (code points).

> However, in practice people often don't think
> about the difference between the encoding and the code point, because
> people usually think in terms of ascii where there isn't that much
> difference.

Which is the source of the problem. What we assume about encodings due
to our familiarity with ASCII does not extrapolate well with real
unicode. English is wonderfully simple script compared to other real
spoked languages (for the purposes of this discussion).

> Anyway, the upshot is that UTF8 means that std::string is perfectly
> useful for unicode, if maybe not optimal in all situations.

No.

0 new messages