Committee feedback on N3572

1,173 views
Skip to first unread message

DeadMG

unread,
Apr 20, 2013, 8:21:20 AM4/20/13
to std-pr...@isocpp.org
The LEWG liked the approach of offering new generic Unicode algorithms. They did not like the encoded_string class and especially did not like the flexibility of it's encoding and that it exposed the encoding to the user. The LEWG's recommendation was to split the string and algorithms into two papers, and force the user into using one implementation-defined encoding, and simply provide conversion to and from the existing string mechanisms. Then it will be easy to pass the algorithms separately through the Committee.

I intend to add to this thread with a draft of both revisions later.

Olaf van der Spek

unread,
Apr 20, 2013, 9:38:22 AM4/20/13
to std-pr...@isocpp.org
> Committee feedback on N3572

Would be handy to include the title of the paper in the subject

Michał Dominiak

unread,
Apr 20, 2013, 10:08:35 AM4/20/13
to std-pr...@isocpp.org
Wait a second, are you telling us that they want unicode strings to have one specific encoding a user has totally no control over, and that writing an application using, say, both UTF-8 and UTF-32 would not be possible? If I understood what you wrote correctly, that would render the entire proposal quite useless, and there would be no point in working on it. 

If I misunderstood something, what would that "forcing the user into using one implementation-defined encoding" mean?

DeadMG

unread,
Apr 20, 2013, 11:31:56 AM4/20/13
to std-pr...@isocpp.org
Wait a second, are you telling us that they want unicode strings to have one specific encoding a user has totally no control over, and that writing an application using, say, both UTF-8 and UTF-32 would not be possible?

Give or take. The approach is that other interfaces deal in arbitrary encodings, and at the interface boundaries, you convert to/from the implementation-defined encoding. You, specifically, would not be able to choose or even observe (the interface is carefully designed for this) the encoding. This is quite similar to what other languages already do, except that the encoding is implementation-defined instead of defined.

Frankly, I intend to give the Committee what they want.

Jeffrey Yasskin

unread,
Apr 20, 2013, 12:05:05 PM4/20/13
to std-pr...@isocpp.org
The idea we wanted for a data type was that users could use a single
type to represent unicode characters, rather than a separate type for
each external encoding. This asserts that it's a good idea to
transcode data as it enters or exits the system and use a specific
encoding inside the system, rather than propagating variable encodings
throughout. There are a couple options for implementing that:

1) Take Python 3's approach of representing strings in UTF-32, with an
optimization for strings with no code points above 255 (store in 1
byte each) and another optimization for strings with no code points
above 65535 (store in 2 bytes each). Optionally, if the unicode string
is a cord/rope type, then each segment can have the optimization
applied independently.

2) Store the string in UTF-8 or UTF-16 depending on the platform (or
another encoding for less common platforms), and provide, say,
"array_view<const char> as_utf8(std::vector<char>& storage);" and
"array_view<const char16_t> as_utf16(std::vector<char16_t>& storage);"
accessors: these would copy the data if it's stored in the other
encoding, or return a reference to the internal storage if it's
already in the desired encoding. Implementations would be free to
define other accessors, but I suspect these are all the standard
needs.

Option 1 has the benefit of allowing random-access iterators, or at
least indexing, which the ICU folks I spoke to thought would be
useful. Option 2 has the benefit of allowing some, maybe most,
external interactions without copying.


Regardless, I expect it'll be easier to get an algorithms library
through than a string type, especially for algorithms that are
justified by appearing in both a Unicode Report and ICU. Also be sure
to synchronize with other existing papers, for example
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html,
which is a response to the comments at
http://wiki.edg.com/twiki/bin/view/Wg21portland2012/LibraryWorkingGroup#Afternoon_AN2.

Another thing the ICU folks I spoke to said was that it would be
useful for efficiency to allow users to pass a maximum code point to
each algorithm. Some implementations can run much faster if they know
their whole input is under U+100 than if they have to handle
everything up to U+10FFFF. Many users won't have a maximum code point
to pass, but, for example, the string class mentioned in option (1)
has to store it anyway, and can benefit from the ability to pass it.

HTH,
Jeffrey
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "ISO C++ Standard - Future Proposals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to std-proposal...@isocpp.org.
> To post to this group, send email to std-pr...@isocpp.org.
> Visit this group at
> http://groups.google.com/a/isocpp.org/group/std-proposals/?hl=en.
>
>

Nicol Bolas

unread,
Apr 20, 2013, 12:54:34 PM4/20/13
to std-pr...@isocpp.org
On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
The idea we wanted for a data type was that users could use a single
type to represent unicode characters, rather than a separate type for
each external encoding. This asserts that it's a good idea to
transcode data as it enters or exits the system and use a specific
encoding inside the system, rather than propagating variable encodings
throughout.

By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We should just have had a single `u` type that would convert it into some Unicode representation, depending on the platform.

It is a good idea to "transcode data as it enters or exits the system," but only as part of the use of a known encoding of string data. If 95% of my data coming into the system is UTF-8, I shouldn't have to transcode it to UTF-16 or whatever that this string type wants to use. I should be using UTF-8 internally, because that's what most of my data is. Having a dedicated string type that can use UTF-8 is a big part of that. Without having a specific type for that, I have no recourse other than `vector<unsigned char>` if I actually want a UTF-8 string.

I'm guessing the counter-argument will be that you could just do `using utf8_string = vector<unsigned char>;`. But that doesn't work, because it's still a `vector<unsigned char>`. It will behave no differently than any other `vector<unsigned char>`.

If we had support for strong typedefs, then I might be OK with that. But otherwise, we need an actual type which can be different from `vector`, so that it doesn't accidentally participate in overload resolution with it. We need a type that I can pass to iostreams and have it understand what I'm doing.

We need a real type for strings of a known, specific encoding.

I don't mind having a single type for paths, because paths are a very specialized case. They're not just arbitrary strings; they're strings with a purpose. That purpose being interfacing with the host filesystem. Therefore, the encoding will be whatever is most efficient for the platform. That's fine.

But that's the only platform-facing interface that C++ deals with. There is no reason to take away from users the knowledge of how a string is encoded.

Really, what's the point of a Unicode string type if you don't even know how it's encoded? What can you do with it? You can't give it to some API to use, because right now, every existing C++ API that uses Unicode in any way expects a specific encoding of Unicode, or provides several options for encodings. So all of those APIs are either completely unusable or you have to copy and convert the string type.

Why should I waste performance doing a pointless copy, when I gave the Unicode string UTF-8 encoding, and the API I want to hand it to uses UTF-8 encoding?

This kind of string is very internally locked. It's a needless performance hole for anyone who doesn't already use it.

C++ is not Python. Stop trying to turn it into a low-rent version of Python. We don't use C++ because it's easy; we use it because it is powerful. We shouldn't throw away power just to allow slightly easier usage. We don't need a one-size-fits-all Unicode string. Give us choices.

It's sad that the C++ standards committee of 2013 doesn't see the simple wisdom in doing what the C++ standards committee of 1998 did in having `basic_string` be a template based on a character type.
 
There are a couple options for implementing that:

1) Take Python 3's approach of representing strings in UTF-32, with an
optimization for strings with no code points above 255 (store in 1
byte each) and another optimization for strings with no code points
above 65535 (store in 2 bytes each). Optionally, if the unicode string
is a cord/rope type, then each segment can have the optimization
applied independently.

2) Store the string in UTF-8 or UTF-16 depending on the platform (or
another encoding for less common platforms), and provide, say,
"array_view<const char> as_utf8(std::vector<char>& storage);" and
"array_view<const char16_t> as_utf16(std::vector<char16_t>& storage);"
accessors: these would copy the data if it's stored in the other
encoding, or return a reference to the internal storage if it's
already in the desired encoding. Implementations would be free to
define other accessors, but I suspect these are all the standard
needs.

Option 1 has the benefit of allowing random-access iterators, or at
least indexing, which the ICU folks I spoke to thought would be
useful. Option 2 has the benefit of allowing some, maybe most,
external interactions without copying.

These two options represent two entirely different classes with entirely different internal data representations and entirely different performance characteristics. We shouldn't allow any one class to be able to be implemented in such a widely varying way.

It'd be like if we had a single map type which could be a hashtable, a sorted vector, or a tree. How could anyone use it and know what they're getting? Why wouldn't you want separate classes with separate implementations of mapping, for separate circumstances?

Jeffrey Yasskin

unread,
Apr 20, 2013, 12:57:22 PM4/20/13
to std-pr...@isocpp.org
That's why the algorithms library should come first.

Nicol Bolas

unread,
Apr 20, 2013, 1:11:57 PM4/20/13
to std-pr...@isocpp.org

No, it shouldn't.

Algorithms are a vital tool for actually doing stuff with Unicode text. But they aren't everything. Unicode algorithms without the string type are like STL algorithms without the STL containers: a fine and useful idea certainly, but there's clearly something missing.

Right now, what we have are dozens of string types littered across dozens of different products, all taking their own specific encodings of Unicode. If we don't provide a string type that could replace all of those, then there's no possibility for that type to ever actually do so. Yes, it's highly unlikely that it would. But it certainly won't happen if don't provide one.

We need a type that encapsulates the rules of Unicode. We need a type that can concatenate, subdivide, and do all of the other things we need for Unicode strings. We should not be encouraging the continued mass of string types by providing all of the tools to make one, but then not actually making that type.

That's why I would prefer that this proposal not be divided. It effectively holds the useful algorithms hostage, forcing the committee to either not have the algorithms, or to actually put in the work in getting a solid Unicode string type together. By dividing it, you make it easy to pass one while letting the other languish.

Algorithms are important, yes. But so is an actual Unicode string type. They are both necessary and essential parts of a solid Unicode system.

This notion that the committee seems to have of just getting some of the way to the goal is the easiest way to fail to achieve that goal. If you only take half-steps, you'll never get where you're going.

Daniel Krügler

unread,
Apr 20, 2013, 3:12:24 PM4/20/13
to std-pr...@isocpp.org
2013/4/20 Nicol Bolas <jmck...@gmail.com>:
> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
>>
>> The idea we wanted for a data type was that users could use a single
>> type to represent unicode characters, rather than a separate type for
>> each external encoding. This asserts that it's a good idea to
>> transcode data as it enters or exits the system and use a specific
>> encoding inside the system, rather than propagating variable encodings
>> throughout.
>
> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
> should just have had a single `u` type that would convert it into some
> Unicode representation, depending on the platform.

I agree that this would be the most natural thing, when the new
character types had been introduced from begin with. But many (most?)
code bases use quite different containers for such character types,
e.g. wchar_t or short, depending on the purpose. In addition, you
often have to respect the API of a useful thirdparty library which
expect some such character-like type and it would be quite enoying, if
I would need to pay the conversion costs, just because the standard
restricts me to a single type. This does not mean that user-code is
encouraged to use different types than the new ones.

> It is a good idea to "transcode data as it enters or exits the system," but
> only as part of the use of a known encoding of string data. If 95% of my
> data coming into the system is UTF-8, I shouldn't have to transcode it to
> UTF-16 or whatever that this string type wants to use. I should be using
> UTF-8 internally, because that's what most of my data is. Having a dedicated
> string type that can use UTF-8 is a big part of that. Without having a
> specific type for that, I have no recourse other than `vector<unsigned
> char>` if I actually want a UTF-8 string.

Sure. And I would recommend to use the intended ones that you need.
But this does not mean that the library itself should only accept the
new character types. This would IMO strongly reduce the acceptance of
these functions. Usually the library and the language don't try to
enforce a particular idiom so to make the functionality of broader
interest.

I think it makes very much sense to start with algorithms here and
than (possibly) consider a stronger character type, if there is some
convincing desire for them.

> If we had support for strong typedefs, then I might be OK with that.

I don't think that we should make such library decisions of Library
features *depending* on some core language feature. I would express it
the other way around and say: *If* strong typedefs exist, the needs
for a specific encoding aware type would much more decrease.

> We need a real type for strings of a known, specific encoding.

I'm not denying this, but I also don't see that we both decisions are
dependent on each other.

> It's sad that the C++ standards committee of 2013 doesn't see the simple
> wisdom in doing what the C++ standards committee of 1998 did in having
> `basic_string` be a template based on a character type.

Please get this right: I don't think that there is a fundamental
"no-interest-in-this" position, it is just so that the current
interest is much stronger in an algorithm library. The committee usual
prefers to start with a often asked-for subset of useful
functionality, because this it always some risk and a lot of work to
integrate things in the library. The initial step often makes obvious
that natural interaction with other parts of the library exist or
would be very desirable, which would cause further API adaptions here
and there.

- Daniel

Nicol Bolas

unread,
Apr 20, 2013, 6:23:09 PM4/20/13
to std-pr...@isocpp.org


On Saturday, April 20, 2013 12:12:24 PM UTC-7, Daniel Krügler wrote:
2013/4/20 Nicol Bolas <jmck...@gmail.com>:
> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
>>
>> The idea we wanted for a data type was that users could use a single
>> type to represent unicode characters, rather than a separate type for
>> each external encoding. This asserts that it's a good idea to
>> transcode data as it enters or exits the system and use a specific
>> encoding inside the system, rather than propagating variable encodings
>> throughout.
>
> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
> should just have had a single `u` type that would convert it into some
> Unicode representation, depending on the platform.

I agree that this would be the most natural thing, when the new
character types had been introduced from begin with.

There seems to be a misunderstanding. I was posting an absurdity for the purpose of holding the idea of a "one string fits all" solution to ridicule via analogy.

Using a platform-defined encoding for general strings is a bad idea, whether it's in a string literal or a string class.

The idea that we shouldn't be able to declare literals in whatever Unicode encoding we desire, that we should just accept some platform-specific default is just... wrong. It's not natural and it's highly unnecessary; it's performance killing, because the actual result depends entirely on the platform, and the user has no way to change it. Switching from one platform to another can degrade performance through a lot of pointless re-encoding.

DeadMG

unread,
Apr 20, 2013, 6:24:19 PM4/20/13
to std-pr...@isocpp.org
I think that there could be a middle ground to be found here. I was reviewing the original paper, and the only place that the encoding parameter was actually used in the interface was to specify the return value for C string interoperation. If, instead, I changed that so you could request a C string of any encoding type from any encoding (so for example c_str() was a template), as is supported by the original traits design, then that would make a polymorphic encoding possible- and also the stored encoding would be non-observable, except perhaps in the complexity of requesting a C string.

Mikhail Semenov

unread,
Apr 21, 2013, 8:55:00 AM4/21/13
to std-pr...@isocpp.org
 
I think there are several issues here:
(1) platform encoding;
(2) application encoding;
(3) external file encoding.
 
The application encoding depending on the needs should provide mainly the following representations:
-- char for ASCII (UTF-8 for codes <= 127);
-- UTF-16 for Unicode (95%);
-- UTF-32 for Unicode (rare cases).
 
UTF-8 is not practical for internal representation for characters with codes over 127: string comparison won't work.
The library should cope with conversions between all these encodings: platform<-> application, file<-> application.
 
If I was writing full Unicode support I would use UTF-32 for application encoding, although it is a bit extensive.

DeadMG

unread,
Apr 21, 2013, 9:30:46 AM4/21/13
to std-pr...@isocpp.org
Unicode string comparison works fine with UTF-8. You cannot use basic_string::operator== or strcmp on *any* Unicode encoding unless you're effectively only storing ASCII. Even then, I'm not sure it's really valid.

Mikhail Semenov

unread,
Apr 21, 2013, 9:54:41 AM4/21/13
to std-pr...@isocpp.org
(1) In an application, surely you'd like to see one string element per Unicode character, not several (unless you don't care and pass to the system APIs).
(2) As for comparison, of course there are language specific issues, especially with accent characters (like in French) and some letters (like "yo" in Russian), but at least comparison
by Unicode code should work, which is not perfect; there are culture-specific issues.
(3) There is also probably an issue between the program text encoding (which is often UTF-8 and can be UTF-16) and the application encoding.
 
 
 

DeadMG

unread,
Apr 21, 2013, 10:21:19 AM4/21/13
to std-pr...@isocpp.org
comparison by Unicode code should work, which is not perfect; there are culture-specific issues.

It doesn't. The only correct comparison is to use a Unicode-specific normalizing comparison operation. You cannot do *any* comparison with un-normalized data. 

DeadMG

unread,
Apr 21, 2013, 12:02:26 PM4/21/13
to std-pr...@isocpp.org
I have attached a draft of a new revision. I think that this new version should address at least some of the concerns about the previous variant. I have looked at Beman's paper n3398, and this paper almost entirely supersedes that and will address all of the issues once basic_string is adapted to feature encoded_string compatibility. I have also tuned the algorithms interface and removed the case_insensitive stuff.
unicode.html

Nicol Bolas

unread,
Apr 21, 2013, 7:21:09 PM4/21/13
to std-pr...@isocpp.org
On Saturday, April 20, 2013 3:24:19 PM UTC-7, DeadMG wrote:
I think that there could be a middle ground to be found here. I was reviewing the original paper, and the only place that the encoding parameter was actually used in the interface was to specify the return value for C string interoperation. If, instead, I changed that so you could request a C string of any encoding type from any encoding (so for example c_str() was a template), as is supported by the original traits design, then that would make a polymorphic encoding possible- and also the stored encoding would be non-observable, except perhaps in the complexity of requesting a C string.

That's not a middle ground; that's not a compromise.

My position is not that I should be able to fetch a sequence of characters in an arbitrary encoding. My position is that I should have complete control over the encoding of the string.

If I'm using UTF-8 strings, I should not have to copy that string just to pass it to a C API that takes a `const char*` of UTF-8 characters. The only way I can guarantee that is if my Unicode string type is in UTF-8. And if I don't have control over the encoding, then the string type is worthless to me.

There's no middle ground here. One side says, "I want the encoding to be implementation-defined." The other side says, "I want direct control over the encoding." You can't provide both without using type erasure and other needlessly performance-damaging techniques.

And that's the problem with your attempted compromise proposal. There's no way to implement that without type-erasure. And there's no way to do that without making every iterator access and other command slower than it needs to be.

Also, in your proposal, you forgot to add constructors that can take encodings. Not every `const char*` is a narrow encoded string. And your proposal doesn't represent a string that can work with any encoding. If I want to work with UTF-EBCDIC or GB18030 or whatever, why shouldn't I?

Lastly, I wouldn't suggest calling the function to get the internal data `c_str`. That has certain expectations about not copying data. It should probably be `str`, since that function (at least in `std::stringstream::str`) is expected to copy data.

Jeffrey Yasskin

unread,
Apr 22, 2013, 8:02:24 PM4/22/13
to std-pr...@isocpp.org, DeadMG
Thanks for the update!

* The paper would be easier to read if it were divided into sections.
* I still suggest writing separate papers on encoded_string and the
processing algorithms. I'll skip encoded_string for now.
* If this paper is intended to supersede N3398, the paper should say
which of Beman's interfaces are covered by which of your interfaces.

* In the free comparison functions:
* You should link to the relevant standard. I assume that's
http://www.unicode.org/reports/tr10/? What's the interface of the
matching ICU functions?
* "The comparison is performed at L3 or greater" means that it's
implementation-defined which level is actually used? Why is that the
right decision?
* How is the locale argument used?
* If I have an implementation that can compare UTF-8 directly, faster
than converting to code points and comparing those, how do I use it to
implement this interface?
* IIRC, it's possible to convert unicode strings into sort keys for a
particular collation order, and then compare those keys byte-wise,
which can dramatically speed up sorting. Is that supported in your
interface? If not, is it just a V2 feature, or do you think it's
unnecessary?

* In the boundary finders: What's the meaning of the return type? How
do these compare to the equivalent ICU algorithms?
* normalize: Same question about UTF-8 input and output. What's the
return value of the range version?
* encoding_convert says that "The source encoding is that indicated by
the Iterator's value_type." There are more encodings than that. It
might make sense to handle normalization as part of conversion, since
both need to happen on entry to the system. This also needs to deal
with an output encoding.
* validate() is underspecified, and codepoints are probably the wrong
level to call it. For example, utf-8 can be invalid, and the iterators
need to catch that.
* Where do codepoint_properties come from? Link each one to part of
the Unicode standard. Why is a reference to a big struct the right
interface?

* What sort of data size is needed to implement this? Can the data be
shared with an ICU installation? Can implementations for constrained
environments omit chunks of data at the programmer's discretion?


Btw, which Shift-JIS characters were you saying didn't exist in Unicode?

Martinho Fernandes

unread,
Apr 23, 2013, 8:53:44 AM4/23/13
to std-pr...@isocpp.org, DeadMG
On Tue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <jyas...@google.com> wrote:
<snip>
 * "The comparison is performed at L3 or greater" means that it's
implementation-defined which level is actually used? Why is that the
right decision?

There isn't much of a decision to be made here anyway. The only thing that can be effectively required of an implementation is a minimum level. If an implementation using Ln is conforming, any implementation that uses Ln+1 is conforming as well. That means the "or greater" part in the text is actually redundant, since requiring L3 does not forbid L4 implementations (and there is no reason that I can think of to forbid them).

Choosing L3 as the minimum requirement stems from the conformance requirement C2 in UTS#10: "A conformant implementation shall support at least three levels of collation." I discussed this with the author when he was drafting the original proposal in January, and the intent is that op< should support the highest level of collation available to the implementation, i.e. it provides the strictest sorting order available. Lower levels of collation (less strict orders) could be provided using the generic algorithms.

<snip>
* What sort of data size is needed to implement this?

A few megabytes (~6-8 MB in my experiment) to support all non-tailored (i.e. locale-independent) algorithms. Locale data requires more and the exact amount depends on the number of locales supported.
 
Can the data be
shared with an ICU installation?

I see no reason an implementation could not use the ICU data and the CLDR if so desired and available for the target platform.

Can implementations for constrained
environments omit chunks of data at the programmer's discretion?

Maybe. I expect that all locale-specific data can be omitted, or maybe all but one or two locales. Omitting more data would require making some support for some algorithms optional.

<snip>

Mit freundlichen Grüßen,

Martinho

Martinho Fernandes

unread,
Apr 23, 2013, 9:34:43 AM4/23/13
to std-pr...@isocpp.org, DeadMG
On Tue, Apr 23, 2013 at 2:53 PM, Martinho Fernandes <martinho....@gmail.com> wrote:
On Tue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <jyas...@google.com> wrote:
<snip>
 * "The comparison is performed at L3 or greater" means that it's
implementation-defined which level is actually used? Why is that the
right decision?

There isn't much of a decision to be made here anyway. The only thing that can be effectively required of an implementation is a minimum level. If an implementation using Ln is conforming, any implementation that uses Ln+1 is conforming as well. That means the "or greater" part in the text is actually redundant, since requiring L3 does not forbid L4 implementations (and there is no reason that I can think of to forbid them).

Wait, there is a difference indeed...

The results from sorting according to Ln+1 are always also sorted according to Ln. However, C++ uses op< to treat non-comparability as equivalence in some places. An L4 implementation would yield a different notion of "equivalence" in those cases. I don't know how important that is though.

Mit freundlichen Grüßen,

Martinho


FrankHB1989

unread,
Apr 24, 2013, 5:02:59 AM4/24/13
to std-pr...@isocpp.org


在 2013年4月21日星期日UTC+8上午12时54分34秒,Nicol Bolas写道:


C++ is not Python. Stop trying to turn it into a low-rent version of Python. We don't use C++ because it's easy; we use it because it is powerful. We shouldn't throw away power just to allow slightly easier usage. We don't need a one-size-fits-all Unicode string. Give us choices.


I think we need both encoding-aware string and non-encoding-aware string, eventually. The latter one is not only for convenience, but also the confidence of "no care of which encoding to use" in some contexts. Throwing any one of them away and force users to use the other seems to be less powerful. Give us choices.


Nicol Bolas

unread,
Apr 24, 2013, 6:46:35 AM4/24/13
to std-pr...@isocpp.org

If you don't care what encoding a string uses, then you can just use an encoding-aware string anyway. All of them should be inter-convertible between each other (though it should require explicit conversion). And they should all be buildable from raw data (an iterator range and the encoding of that range). So all you need to do is pick one and you're fine.

I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?

Martinho Fernandes

unread,
Apr 24, 2013, 6:52:06 AM4/24/13
to std-pr...@isocpp.org
On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <jmck...@gmail.com> wrote:

I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?

Don't we have that as std::basic_string already, anyway?


Mit freundlichen Grüßen,

Martinho


Ville Voutilainen

unread,
Apr 24, 2013, 7:00:16 AM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 13:52, Martinho Fernandes <martinho....@gmail.com> wrote:
On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <jmck...@gmail.com> wrote:

I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?

Don't we have that as std::basic_string already, anyway?



We do. And the reason why we need it that it can be used for conveying the bits across module boundaries
without caring what the encoded strings in whatever modules are, which is sometimes useful. Kind of like
having an ip_address type that can hold either ip4_address or ip6_address...

Nicol Bolas

unread,
Apr 24, 2013, 7:15:12 AM4/24/13
to std-pr...@isocpp.org

I'm not sure I understand the analogy. An `any_ip_address` class wouldn't be about crossing "module boundaries". It would be about being able to access an Internet resource from either one or other other address, as needed. The choice of IPv4 or IPv6 is sometimes not up to the application at all; it comes from what site the user wants to access. If they enter an IPv4 address, you need to be able to use it like an IPv4 address, and likewise for IPv6.

This matters because there is no direct conversion from IPv4 to IPv6. Mapping an IPv4 address to IPv6 is something that can cause problems, so in many cases, it's best to just access IPv4 addresses through IPv4, rather than through IPv6.

That is not the case for Unicode. All full Unicode encodings are cross-convertible, with no loss of information. So no particular encoding is functionally better or worse.

If a module is giving you a Unicode string in an encoding you don't care about, and you likewise don't care to store that string in any particular encoding, then what exactly does it matter if you pick a specific encoding? What have you lost?

The only thing I can think of is that you lose the possibility of avoiding copying the string. If your chosen encoding and the module's don't match, then a conversion is needed. Whereas a non-denominational string would be polymorphic, and thus be whatever encoding it was created with. Moving such a string around is therefore possible with some assurance of not cross-converting.

But if we're talking about "conveying the bits across module boundaries" (where "bits" doesn't mean "C++ object", so we're talking serialization of some form), then you really do need to know what those bits are and how they're encoded. A non-denominational string isn't going to help there, since both the source and the destination need to agree. That means an explicit protocol needs to be established.

And if "conveying the bits across module boundaries" isn't talking about serialization, what's wrong with just passing C++ types? You know what string type the module uses, so just use the string type it uses, and everyone's fine.

Ville Voutilainen

unread,
Apr 24, 2013, 7:35:23 AM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 14:15, Nicol Bolas <jmck...@gmail.com> wrote:
And if "conveying the bits across module boundaries" isn't talking about serialization, what's wrong with just passing C++ types? You know what string type the module uses, so just use the string type it uses, and everyone's fine.




The issue is that there may be multiple modules using different encodings, and the mediating module wants
to use a single common type. That's the analogy with an any_address as well. It should become obvious
if you try it out, the mediating part will have an explosion in the amount of types it needs to deal with,
which is not the case if it can use a common type.

Nicol Bolas

unread,
Apr 24, 2013, 9:01:26 AM4/24/13
to std-pr...@isocpp.org

How? This is what the mediating module would look like:

unicode_string<utf8> str = module_a::get_some_string(...);
module_b
::use_some_string(..., str);

Whatever Unicode encoding `module_a::get_some_string` returns, `str` will always be UTF-8 encoded. It will simply transcode the return value. Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be transcoded into it as needed.

So where exactly is the "explosion in the amount of types" that you're concerned about?

Ville Voutilainen

unread,
Apr 24, 2013, 9:06:52 AM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 16:01, Nicol Bolas <jmck...@gmail.com> wrote:
How? This is what the mediating module would look like:

unicode_string<utf8> str = module_a::get_some_string(...);
module_b
::use_some_string(..., str);


I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>
elsewhere, which is the explosion of types I mentioned.
 
Whatever Unicode encoding `module_a::get_some_string` returns, `str` will always be UTF-8 encoded. It will simply transcode the return value. Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be transcoded into it as needed.

That would assume that the transcoding cost is ok for the mediating part. I don't think that's the general case.


Klaim - Joël Lamotte

unread,
Apr 24, 2013, 9:37:03 AM4/24/13
to std-pr...@isocpp.org

On Wed, Apr 24, 2013 at 3:06 PM, Ville Voutilainen <ville.vo...@gmail.com> wrote:
That would assume that the transcoding cost is ok for the mediating part. I don't think that's the general case.

I'm failing to see how the transcoding cost can be avoided if two modules forces the user to work with specific and different encodings?

Joel Lamotte

Nicol Bolas

unread,
Apr 24, 2013, 9:38:57 AM4/24/13
to std-pr...@isocpp.org
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:
On 24 April 2013 16:01, Nicol Bolas <jmck...@gmail.com> wrote:
How? This is what the mediating module would look like:

unicode_string<utf8> str = module_a::get_some_string(...);
module_b
::use_some_string(..., str);


I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>
elsewhere, which is the explosion of types I mentioned.

I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.

Why should the standard be responsible for people who can't settle on a convention?

Whatever Unicode encoding `module_a::get_some_string` returns, `str` will always be UTF-8 encoded. It will simply transcode the return value. Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be transcoded into it as needed.

That would assume that the transcoding cost is ok for the mediating part. I don't think that's the general case.

The only way for transcoding to be avoided is if none of the modules between the producing module and the consuming one do anything with the string that requires a specific encoding. It must treat the string as a sequence of codepoints that are properly Unicode formatted and arranged. So the consuming module can't write it to a file, to a stream (not without some serious upgrades to iostream to start taking codepoint sequences), send it across the internet, or any number of other processes that need the actual encoding.

There are quite a few operations that don't do any of those. But all of the user-facing APIs will need a specific encoding. That's why applications tend to just pick an encoding and stick with it. They pick whatever their user-facing APIs use and just go with that.

The general rubric for C++ is (and should be): you accept whatever, convert it ASAP into your standard encoding, do any manipulation in that encoding, and then convert it if some specific API needs a different encoding. This is how it must be, because the entire C++ world is not going to suddenly switch to our new Unicode string. There will still be many APIs that only accept a specific encoding.

I don't see the need for the standard to support an alternate way of using Unicode strings.

As for the performance issue, I don't see how you can make a performance-based case for `any_unicode_string` at all. `any_unicode_string` will have significantly degraded access performance, since it will have to use type erasure to store and access the actual data.

Remember: all of the truly useful stuff to do with Unicode strings in C++ comes from iterator-based algorithms, not members of the string class itself. And `any_unicode_string` will have to use type-erased iterators; every `++` or `*` operation will have to go through type erasure, thus degrading performance. On every use of these, you will effectively get the overhead of a type-erased call.

Take the example code I gave before. Even if it does two separate transcode sequences to get the string from Module A to B, that's still likely to be a win performance-wise over `any_unicode_string` if Module B does multiple passes over the data. So if Module B is doing actual work with the string, you win performance-wise.

I suppose you could make versions of the algorithms that are members of the string, in which case the type-erasure part happens once for the operation. Or that the algorithms could be specialized on `any_unicode_string::iterator`.

Ville Voutilainen

unread,
Apr 24, 2013, 10:03:55 AM4/24/13
to std-pr...@isocpp.org
The point isn't avoiding the cost,  but being able to choose where it's paid.

Ville Voutilainen

unread,
Apr 24, 2013, 10:09:45 AM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 16:38, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:
I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>
elsewhere, which is the explosion of types I mentioned.

I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.

Yes, "they'll". The question is who's "them", and where.
 

Why should the standard be responsible for people who can't settle on a convention?

In order to be useful?

The only way for transcoding to be avoided is if none of the modules between the producing module and the consuming one do anything with the string that requires a specific encoding. It must treat the string as a sequence of codepoints that are properly Unicode formatted and arranged. So the consuming module can't write it to a file, to a stream (not without some serious upgrades to iostream to start taking codepoint sequences), send it across the internet, or any number of other processes that need the actual encoding.

Mostly correct, although the write-to-file/stream aren't quite that clear-cut.
 

The general rubric for C++ is (and should be): you accept whatever, convert it ASAP into your standard encoding, do any manipulation in that encoding, and then convert it if some specific API needs a different encoding. This is how it must be, because the entire C++ world is not going to suddenly switch to our new Unicode string. There will still be many APIs that only accept a specific encoding.

I don't see the need for the standard to support an alternate way of using Unicode strings.

The "accept whatever" includes accepting a general unicode type. Do we suddenly agree completely? Note that
it's just an idea, we might not end up having such a general type.
 

As for the performance issue, I don't see how you can make a performance-based case for `any_unicode_string` at all. `any_unicode_string` will have significantly degraded access performance, since it will have to use type erasure to store and access the actual data.

You're reaching too far and too early into implementation details if you think it *has to* use erasure.

 

Nicol Bolas

unread,
Apr 24, 2013, 12:23:53 PM4/24/13
to std-pr...@isocpp.org
On Wednesday, April 24, 2013 7:09:45 AM UTC-7, Ville Voutilainen wrote:
On 24 April 2013 16:38, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:
I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>
elsewhere, which is the explosion of types I mentioned.

I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.

Yes, "they'll". The question is who's "them", and where.

I'm not sure what you're getting at here. "Where" will be when they transcode strings. Transcoding can't be hidden in this system, because it's right there in the type system. Any time you try to copy/move a `unicode_string<A>` into a `unicode_string<B>`, you get a transcode. This is not difficult to track down.

As for who "them" is, it would be people who can't keep conventions straight, or who can't use a typedef properly. IE: not very many C++ programmers.

Why should the standard be responsible for people who can't settle on a convention?

In order to be useful?

By that logic, we should also have garbage collection. Because it's "useful".

C++ simply doesn't do things this way. It doesn't tend to have types that could cover anything of a general category, with substantial implementation differences based on construction. And when it tries that, it generally works out poorly (see iostreams and its needlessly awful performance).

The general rubric for C++ is (and should be): you accept whatever, convert it ASAP into your standard encoding, do any manipulation in that encoding, and then convert it if some specific API needs a different encoding. This is how it must be, because the entire C++ world is not going to suddenly switch to our new Unicode string. There will still be many APIs that only accept a specific encoding.

I don't see the need for the standard to support an alternate way of using Unicode strings.

The "accept whatever" includes accepting a general unicode type. Do we suddenly agree completely?

By "accept whatever", I mean "I call an API that returns a Unicode string of a particular encoding." I write my code against "whatever" encoding that the API uses, transcoding it to my required, internal encoding. It's not "a specific API could return any arbitrary Unicode encoding," which is what you're asking for.

As for the performance issue, I don't see how you can make a performance-based case for `any_unicode_string` at all. `any_unicode_string` will have significantly degraded access performance, since it will have to use type erasure to store and access the actual data.

You're reaching too far and too early into implementation details if you think it *has to* use erasure.

Whether it's type erasure or something else, this iterator access is not going to be a simple pointer access. Each call to `++` or `*` is going to have to do a lot more work than a specific encoder's iterator. It's going to have to figure out which encoding the type actually is, then call an appropriate function based on that.

And type erasure is the only way to make this work for arbitrary, potentially user-defined encodings. Again, there's no reason I shouldn't be able to use EBCDIC or whatever, so long as it is a proper Unicode encoding.

Ville Voutilainen

unread,
Apr 24, 2013, 12:55:43 PM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 19:23, Nicol Bolas <jmck...@gmail.com> wrote:

I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.

Yes, "they'll". The question is who's "them", and where.

I'm not sure what you're getting at here. "Where" will be when they transcode strings. Transcoding can't be hidden in this system,

Precisely. And mediating layers don't need/want to do that. The transcoding can be done in places where and when it needs
to be done.
 

Why should the standard be responsible for people who can't settle on a convention?

In order to be useful?

By that logic, we should also have garbage collection. Because it's "useful".

That's completely beside the point. Having a common type for multiple different encoded strings has
nothing to do with things like garbage collection.
 

C++ simply doesn't do things this way. It doesn't tend to have types that could cover anything of a general category, with

Oh, like std::exception and exception_ptr? shared_ptr? function? The forthcoming polymorphic allocators?

 

You're reaching too far and too early into implementation details if you think it *has to* use erasure.

Whether it's type erasure or something else, this iterator access is not going to be a simple pointer access. Each call to `++` or `*` is going to have to do a lot more work than a specific encoder's iterator. It's going to have to figure out which encoding the type actually is, then call an appropriate function based on that.

That's not how it needs to be done. And such a common type doesn't necessarily need to do encoding-specific
traversal at all, it may well be sufficient for such type to allow blasting the raw bits into various sinks, or
just convey the type between different subsystems.

Nicol Bolas

unread,
Apr 24, 2013, 2:08:57 PM4/24/13
to std-pr...@isocpp.org

OK, clearly before this conversation can proceed any further, there needs to be some definition of what exactly we're talking about.

When I hear the word "string", I generally think of an object that contains an ordered sequence of characters, which can be accessed in some way and probably have basic sequencing operations performed on them. If it's mutable, insertion, deletion, and such can be used. If it's not mutable, then there should probably be some APIs that do copy-insertion/delection (creating a new string that is the result of inserting/deleting).

If all you're talking about is some memory object which cannot be useful until it is transferred into some other object, that's not a "string" by any definition I'm aware of. That's not even an iterator range.

So what exactly are you arguing we should have? A string or something else?

Furthermore, what good is "blasting the raw bits into various sinks"? Ignoring the fact that there's no such thing as a "sink", the class is designed so that the user explicitly doesn't know the actual encoding of the data. So the "raw bits" themselves are completely meaningless to anyone. And let's not forget endian conversion issues on top of that.

I cannot imagine what use it would be to take a string of an unknown encoding and send its "raw bits" somewhere. At least, what use that would be compared to taking an actual, specific encoding.

Ville Voutilainen

unread,
Apr 24, 2013, 2:15:11 PM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 21:08, Nicol Bolas <jmck...@gmail.com> wrote:
If all you're talking about is some memory object which cannot be useful until it is transferred into some other object, that's not a "string" by any definition I'm aware of. That's not even an iterator range.

I'm talking about a type from which you can do a conversion to a more specific type.
 
Furthermore, what good is "blasting the raw bits into various sinks"? Ignoring the fact that there's no such thing as a "sink", the

You never dump data into debug files? Into error streams? You never memcpy anything anywhere? You never
send raw data with out-of-band information about the encoding so that it can be decoded on the receiving
side?

I find your trouble of understanding the uses for such a type odd. But I don't need to convince you that
such types are useful.

Jeffrey Yasskin

unread,
Apr 24, 2013, 2:41:14 PM4/24/13
to std-pr...@isocpp.org
I'd classify the options into two general categories:

1) A unicode string class that presents its contents as a sequence of
code points, without exposing its clients to the sequence of bytes
that underlie these code points. This could be the python-style object
I've been suggesting or could be an object that presents a
bidirectional iterator that converts on the fly.

2) An "encoded" string class that presents its contents as a sequence
of bytes along with a description of the encoding that should be used
to interpret those bytes, probably along with an iterator that can
convert from each encoding.

Neither of these is wrong, but we only want to standardize one, and
it's not totally obvious which is better. (If it's totally obvious to
you, that probably means you're not considering enough viewpoints.)

What *is* pretty obvious is that we need ways to convert byte
sequences from one encoding to another (what
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html
addresses) and ways to run the various Unicode algorithms over both
byte and codepoint sequences. (We need byte sequence support even if
we eventually pick class (1) in order to support users who want to
stick another encoding into a vector<char>.) I'm hoping we can get the
algorithms into TS2 before we need to firmly decide on the less
obvious questions.

Jeffrey

Tony V E

unread,
Apr 24, 2013, 3:41:21 PM4/24/13
to std-pr...@isocpp.org
On Wed, Apr 24, 2013 at 2:41 PM, Jeffrey Yasskin <jyas...@google.com> wrote:

I'd classify the options into two general categories:

1) A unicode string class that presents its contents as a sequence of
code points, without exposing its clients to the sequence of bytes
that underlie these code points. This could be the python-style object
I've been suggesting or could be an object that presents a
bidirectional iterator that converts on the fly.

2) An "encoded" string class that presents its contents as a sequence
of bytes along with a description of the encoding that should be used
to interpret those bytes, probably along with an iterator that can
convert from each encoding.

Neither of these is wrong, but we only want to standardize one, and
it's not totally obvious which is better. (If it's totally obvious to
you, that probably means you're not considering enough viewpoints.)


Let me attempt to claim (_somewhat_ devil's advocate) that we want class 1, with implementation via UTF8, thus getting a specific case of 2 as well.  ie not just 1 that may or may not be UTF8, but define that it must be UTF8 so that you can rely on the bytes if you want or need to.

Reasons:

- UTF8 can work with things like strcpy(), so lots of code just works (although "just works" can sometimes be considered harmful if it wasn't expected)
- UTF8 is size efficient
- UTF8 is not *too* iterator inefficient as you never need to go more than a few bytes left or right to find the start of a code point (ie you don't need to go to the beginning of the string, and you can tell if a byte is in the middle of a code point or not).  Of course, with an iterator, you should never be in the middle of a codepoint anyhow.

Downsides
 - Windows uses UTF16.  That's Windows' fault.  UTF16 is the worst of both worlds (still requires multibyte sequences, yet takes up too much space).
 
I'd be OK with functions that convert to other encodings, but I think UTF8 should be the default and the focus.

Tony

DeadMG

unread,
Apr 24, 2013, 3:47:12 PM4/24/13
to std-pr...@isocpp.org
It's far more than just Windows. It's Java, .NET, and every Windows-focused application. You cannot just dump the other encodings. It is nothing Windows-specific, it's simple compatibility. If you have an existing UTF-16 application that interoperates with a bunch of other UTF-16 applications, there's absolutely no reason whatsoever to go to UTF-8. Your other points also equally apply to UTF-16 in the relevant ecosystems. There is nothing special about UTF-8.

Ville Voutilainen

unread,
Apr 24, 2013, 3:57:21 PM4/24/13
to std-pr...@isocpp.org
On 24 April 2013 22:47, DeadMG <wolfei...@gmail.com> wrote:
It's far more than just Windows. It's Java, .NET, and every Windows-focused application. You cannot just dump the other encodings. It is nothing Windows-specific, it's simple compatibility. If you have an existing UTF-16 application that interoperates


Sure, but do remember that not standardizing them isn't the same as dumping them. It's unlikely that
we'll ever standardize libicu, so we need to consider what can be reasonably done.

DeadMG

unread,
Apr 24, 2013, 3:59:49 PM4/24/13
to std-pr...@isocpp.org
Standardizing a Unicode string as UTF-8 and then only ever using that in new Standard interfaces would be dumping the other encodings.

I can see the argument for a Pythonic mystery encoding, maybe. But there's no way I'd prevent any implementer from setting that encoding to UTF-16 on Windows and Jeffrey would want to be able to set his to UTF-32 with some storage magic and so on and so forth. Option 1 vs Option 2 is a debate, but "UTF-8 everywhere" is not even a question. I will never propose such a thing.

Tony V E

unread,
Apr 24, 2013, 4:03:01 PM4/24/13
to std-pr...@isocpp.org
You only need the other encodings on the edges of your app.

Thing long term.  Some day down the road Windows and Java won't exist, and/or they'll have seen the error of their ways and converted to UTF8.


On Wed, Apr 24, 2013 at 3:59 PM, DeadMG <wolfei...@gmail.com> wrote:
Standardizing a Unicode string as UTF-8 and then only ever using that in new Standard interfaces would be dumping the other encodings.

I can see the argument for a Pythonic mystery encoding, maybe. But there's no way I'd prevent any implementer from setting that encoding to UTF-16 on Windows and Jeffrey would want to be able to set his to UTF-32 with some storage magic and so on and so forth. Option 1 vs Option 2 is a debate, but "UTF-8 everywhere" is not even a question. I will never propose such a thing.

--

DeadMG

unread,
Apr 24, 2013, 4:31:22 PM4/24/13
to std-pr...@isocpp.org
Since I have clearly stated that I am not going to propose that, if you want to, then work on your own proposal. In either case, kindly stop wasting my time.

Jeffrey Yasskin

unread,
Apr 24, 2013, 4:31:25 PM4/24/13
to std-pr...@isocpp.org
On Wed, Apr 24, 2013 at 12:59 PM, DeadMG <wolfei...@gmail.com> wrote:
> Standardizing a Unicode string as UTF-8 and then only ever using that in new
> Standard interfaces would be dumping the other encodings.
>
> I can see the argument for a Pythonic mystery encoding, maybe. But there's
> no way I'd prevent any implementer from setting that encoding to UTF-16 on
> Windows and Jeffrey would want to be able to set his to UTF-32 with some
> storage magic and so on and so forth.

(Disclaimer: I haven't checked with our ICU folks or the other C++
folks at Google, so the following is just an educated guess.)

FWIW, I don't think we would want to be able to set the unistring's
internal encoding to UTF-32 if that wasn't the default. The reason to
use a Python3-style encoding would be to allow random access, but that
has to be part of the interface to be useful. I think we'd prefer to
live with a UTF-8 string and no random access rather than using a
non-standard random access interface.

Don't interpret that as an argument _for_ the "UTF-8 everywhere"
option (on which I'm neutral); I'm just trying not to be an example
against it. :)

Zhihao Yuan

unread,
Apr 24, 2013, 4:48:06 PM4/24/13
to std-pr...@isocpp.org
On Wed, Apr 24, 2013 at 3:41 PM, Tony V E <tvan...@gmail.com> wrote:
> - UTF8 is size efficient

You joke. UTF-8 use 3 bytes to encode Asian characters, while any Asian
language-specific encoding needs only 2 bytes. More interestingly, GB18030,
as a full Unicode implementation, can encode any CJK characters in 2
bytes. UTF-8 sucks.

> - UTF8 is not *too* iterator inefficient as you never need to go more than a
> few bytes left or right to find the start of a code point (ie you don't need
> to go to the beginning of the string, and you can tell if a byte is in the
> middle of a code point or not). Of course, with an iterator, you should
> never be in the middle of a codepoint anyhow.

AFAIK, it's the slowest.

> Downsides
> - Windows uses UTF16. That's Windows' fault. UTF16 is the worst of both
> worlds (still requires multibyte sequences, yet takes up too much space).

UTF-16 balances the space usage, and it's very fast. To mix the concept
of bytes and string is C's big fault.

> I'd be OK with functions that convert to other encodings, but I think UTF8
> should be the default and the focus.

Absolutely no.

--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://4bsd.biz/

Nicol Bolas

unread,
Apr 24, 2013, 11:08:33 PM4/24/13
to std-pr...@isocpp.org
On Wednesday, April 24, 2013 1:03:01 PM UTC-7, Tony V E wrote:
You only need the other encodings on the edges of your app.

Thing long term.  Some day down the road Windows and Java won't exist, and/or they'll have seen the error of their ways and converted to UTF8.

... what? That's what you're banking on? That Windows and Java will vanish into the aether in some 10+ years down the line? (neither of them are going to revamp their APIs just because some people prefer UTF-8)

We shouldn't make decisions based on events that might eventually happen. We should made decisions based on good knowledge.

Also, Windows/Java aren't the only people who use UTF-16. QT does too in their QString class.

I'm personally in favor of UTF-8 over all other encodings. However, that is simply not realistic.

Nicol Bolas

unread,
Apr 24, 2013, 11:11:31 PM4/24/13
to std-pr...@isocpp.org


On Wednesday, April 24, 2013 1:48:06 PM UTC-7, Zhihao Yuan wrote:
On Wed, Apr 24, 2013 at 3:41 PM, Tony V E <tvan...@gmail.com> wrote:
> - UTF8 is size efficient

You joke.  UTF-8 use 3 bytes to encode Asian characters, while any Asian
language-specific encoding needs only 2 bytes.  More interestingly, GB18030,
as a full Unicode implementation, can encode any CJK characters in 2
bytes. UTF-8 sucks.

Some would disagree on that point. I'll let those arguments (and the data they use to support it) speak for themselves.

Note that this doesn't mean we shouldn't support UTF-16 just as fully as we do UTF-8. Choice of Unicode encoding is the right of every C++ programmer.

Even if they choose the wrong one ;)

Nicol Bolas

unread,
Apr 24, 2013, 11:41:24 PM4/24/13
to std-pr...@isocpp.org

#2 is not what people actually need from such a string. What people need is a string that does all of the following:

1: Has an explicit encoding, such that you can hand it a block of data in that encoding and no transcoding will take place. This also means that I can get (const) access to the string's data as an array of code-units, for passing to legacy APIs. The encoding should be flexible, so that users can provide their own encodings for things that we don't provide them for (much like allocators).

2: Guarantees the encoding. No operations on this string will cause its data to be encoded wrongly. Any attempt to pass improperly encoded data will throw an exception.

3: Guarantees Unicode. The Unicode spec has rules about what codepoints can appear where. The string should abide by those rules and fail at any operation that would violate them.

4: Work as a proper codepoint range. I should not have to copy my string (again) or fumble about with out-of-class iterators. This means all of our algorithms will work on them naturally. All forward-facing iterators will be codepoint iterators; you don't get (direct) iterator access to the codeunits, nor do you get operator[].

5: Transcoding support. It can take arbitrarily encoded data and convert it to its given encoding.

In short, it should be a sequence of Unicode codepoints, where the encoding is directly exposed to the user, so that they can more easily interface with other APIs that don't use this string type. And that's the main reason why we need the encoding to be directly exposed: because only the user of the type knows what encoding their eventual destination uses. Therefore, only the user of the object can know whether they want to match it or not.

If we need some kind of generic `Unicode codepoint range` class that could work with any encoding transparently, we can have that. But it would not own the actual storage; it would be like a `string_ref/view`.

The actual storage should always have an actual, forward-facing encoding.

Neither of these is wrong, but we only want to standardize one, and
it's not totally obvious which is better. (If it's totally obvious to
you, that probably means you're not considering enough viewpoints.)

Or it means that we're looking at how Unicode works in the real world of C++. This "any_unicode_string" has not been written into any C++-facing library that supports Unicode (that I'm aware of. Python's string type is not C++-facing, though obviously C++ Python modules can use it). Whereas we have Qt's QString, MFC's CString, wxWidget's wxString, ICU's UnicodeString, and many other Unicode strings. All of which use a specific Unicode encoding. None of the major libraries out there have adopted a class anything like your Option #1.

The only upgrade from all of those types (besides getting rid of the stupid stuff in some of them) that we're asking for is the ability to template the type on a Unicode encoding, just as we allow the character type of `basic_string` to be a template parameter.

So I would say that it is obvious which is better: the kind that is in use in millions of lines of actual C++ code. Not the kind that only exists in Python.

Standard practice has weighed in on this issue. Why should we go against standard practice for Unicode string types?

DeadMG

unread,
Apr 25, 2013, 1:33:36 AM4/25/13
to std-pr...@isocpp.org
What I would also add is that the original paper is slightly defective. I had intended for the encoding to not show up in the interface at all, but it did for c_str(). I should have changed that so that you could request a C-string of any encoding from an encoded_string of any encoding (only guarantee O(1) for encoding matches). Then, if you do

    auto str = f();
    // use str

then the encoding of str is irrelevant unless you need to know.

Lawrence Crowl

unread,
Apr 25, 2013, 5:10:04 PM4/25/13
to std-pr...@isocpp.org
On 4/24/13, Tony V E <tvan...@gmail.com> wrote:
It "just works" by a combination of design and accident. However,
strlen fails to return the right data. Repurposing functions
because of accidents is not the path to clear code.

> - UTF8 is size efficient

Efficiency depends on your corpus. UTF8 is most space efficient
for Latin scripts. For some European or Middle Eastern scripts,
UTF8 and UTF16 are space equivalent. For East Asian scripts,
UTF16 is most efficient.

On systems with 12-bit (e.g. PDP-8) or 24-bit words, UTF12 is most
space efficient.

For scripts outside the basic plane, UTF16 and UTF32 are
space-equivalent, but UTF32 is more time efficient. Likewise,
UTF12 and UTF24 are space equivalent but versus UTF24 is more time
efficient for South and East Asian scripts.

> - UTF8 is not *too* iterator inefficient as you never need to go
> more than a few bytes left or right to find the start of a code
> point (ie you don't need to go to the beginning of the string, and
> you can tell if a byte is in the middle of a code point or not).
> Of course, with an iterator, you should never be in the middle
> of a codepoint anyhow.

UTF32 has the fastest iterator performance. It can matter, because
it is decision-less, which makes it viable for use in vector units.
UTF16 is somewhat harder. UTF8 is much harder.

> Downsides
> - Windows uses UTF16. That's Windows' fault. UTF16 is the worst
> of both worlds (still requires multibyte sequences, yet takes up
> too much space).

You're forgetting that UTF8 requires more validation. There are
lots of byte sequences that do not map to code points, so what do
you do with them?

> I'd be OK with functions that convert to other encodings, but I
> think UTF8 should be the default and the focus.

It seems to me that striving for one type is not likely to work,
given disparate uses and existing legacy files. I also think we
are likely to need 'unvalidated UTF8' and 'validated UTF8' types
in the mix as well.

Even so, we need a 'vocabulary' type, and I think it should adapt
its representation to the needs of its content. Doing so would
probably result in the least overall pain.

--
Lawrence Crowl

corn...@google.com

unread,
Apr 26, 2013, 5:45:21 AM4/26/13
to std-pr...@isocpp.org


On Thursday, April 25, 2013 11:10:04 PM UTC+2, Lawrence Crowl wrote:
On 4/24/13, Tony V E <tvan...@gmail.com> wrote:
> - UTF8 is not *too* iterator inefficient as you never need to go
> more than a few bytes left or right to find the start of a code
> point (ie you don't need to go to the beginning of the string, and
> you can tell if a byte is in the middle of a code point or not).
> Of course, with an iterator, you should never be in the middle
> of a codepoint anyhow.

UTF32 has the fastest iterator performance.  It can matter, because
it is decision-less, which makes it viable for use in vector units.
UTF16 is somewhat harder.  UTF8 is much harder.

On the other hand, for UTF8 and western scripts, you can fit 4 times as much text into the L1 cache. That may be quite a significant gain.

Sebastian

FrankHB1989

unread,
Apr 26, 2013, 3:35:32 PM4/26/13
to std-pr...@isocpp.org


在 2013年4月24日星期三UTC+8下午6时46分35秒,Nicol Bolas写道:
If you don't care what encoding a string uses, then you can just use an encoding-aware string anyway. All of them should be inter-convertible between each other (though it should require explicit conversion). And they should all be buildable from raw data (an iterator range and the encoding of that range). So all you need to do is pick one and you're fine.
 Yes I can. And I have to.
 

I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?
Firstly and conceptually, the encoding should not be exposed through the interface of a "pure" string, namely a sequence of characters. If the encoding is mandated in your mind, you are actually talking about a sequence of code points, but not only a string.
Secondly, indeterminate encoding can lead to better optimization of transcoding for string operands with different encodings. The implementation should know better about the performance of transcoding algorithms using specific intermediate encoding than users in most cases, and can perform transcoding only when it is really necessary.

P.S. std::basic_string is still too strict to be the proper abstraction. Do we have something like traits?

FrankHB1989

unread,
Apr 26, 2013, 3:48:05 PM4/26/13
to std-pr...@isocpp.org


在 2013年4月25日星期四UTC+8上午4时48分06秒,Zhihao Yuan写道:
UTF-16 balances the space usage, and it's very fast.  To mix the concept
of bytes and string is C's big fault.


UTF-16 is variable-length. In general, it is fast only you throw away surrogate pairs, i.e. code points out of BMP. (And thus it becomes UCS-2.)

Message has been deleted

Mikhail Semenov

unread,
Apr 28, 2013, 3:15:08 PM4/28/13
to std-pr...@isocpp.org
I agree with Lawrence on that: UTF32 is more efficient for representing general Unicode characters.
I think the issue here is that it is difficult to resolve the following two issues:
(1) to select a preferable encoding (for reading from a file, system representation and exchange);
(2) to select a common string format for internal representation (arrays of characters that we can easily compare).
 
The reason for the second point is that the Unicode itself propose 4 different types of representation http://unicode.org/reports/tr15/#Examples:
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose ligatures and other character types.
Point (2) is to create strings for easy access of elements and comparison. The comparison is an issue: even French words have special way of comparison based on accented characters.
Other languages have their own specific ways of comparing words and there may be more than one way of doing so. I think this issue can be left.
 
My suggesting would be to have two basic forms of representation:
(1) encoded strings;
(2) simple strings of characters (char8, char16 and char32).
 An implementation should provide
(a) some forms of encoding for encoded strings;
(b) some conversions between encoded strings and simple strings (of char8, char16 and char32);
(c) in addition to standard comparison of simple strings (like arrays of elements), there should be conversion routines for
various languages.
 
The user should be able to use these encodings, conversions and comparisons, and should be able to provide their own.
 
There is also GB18030 Standard (for Chinese characters), which is different from Unicode.

Mikhail.

Nicol Bolas

unread,
Apr 28, 2013, 9:13:36 PM4/28/13
to std-pr...@isocpp.org


On Sunday, April 28, 2013 12:15:08 PM UTC-7, Mikhail Semenov wrote:
I agree with Lawrence on that: UTF32 is more efficient for representing general Unicode characters.
I think the issue here is that it is difficult to resolve the following two issues:
(1) to select a preferable encoding (for reading from a file, system representation and exchange);
(2) to select a common string format for internal representation (arrays of characters that we can easily compare).
 
The reason for the second point is that the Unicode itself propose 4 different types of representation http://unicode.org/reports/tr15/#Examples:
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose ligatures and other character types.
Point (2) is to create strings for easy access of elements and comparison. The comparison is an issue: even French words have special way of comparison based on accented characters.
Other languages have their own specific ways of comparing words and there may be more than one way of doing so. I think this issue can be left.
 
My suggesting would be to have two basic forms of representation:
(1) encoded strings;
(2) simple strings of characters (char8, char16 and char32).

If we have "encoded strings" (presumably allowing for arbitrary encodings), why would we need "simple strings"? Isn't `basic_string` a "simple string"?

 An implementation should provide
(a) some forms of encoding for encoded strings;
(b) some conversions between encoded strings and simple strings (of char8, char16 and char32);
(c) in addition to standard comparison of simple strings (like arrays of elements), there should be conversion routines for
various languages.

What kind of conversions are you talking about? We already have Unicode normalization via algorithms. So it's not clear what kind of language-based conversions you're looking for.

Mikhail Semenov

unread,
Apr 29, 2013, 4:20:50 AM4/29/13
to std-pr...@isocpp.org
Nicol,
 
I thought I made it clear. For example, elements of UTF-8 are bytes: each element does not represent a character (unless it is an ASCII string) you can convert it to
string of char32 so that each character really represent one Unicode character. On the other hand, if you are only interested in the main coding plane: string of char16 will be enough. And if you only use European languages: string of char8 will be fine. In UTF-8 on the other hand, each Unicode character can be coded by 1, 2 ,3 ... bytes.
 
 
In .NET, Microsoft uses 2-byte characters because in most applications it's enough to use only the main Unicode plane, which covers most characters of most languages.
 
Yo cannot use UTF-8 strings, for example, to easily mainipulate, for example, Chinese charcaters: each character is represented by several bytes in UTF-8.
 
Mikhail.

Giovanni Piero Deretta

unread,
Apr 29, 2013, 5:41:13 AM4/29/13
to std-pr...@isocpp.org

On Monday, April 29, 2013 9:20:50 AM UTC+1, Mikhail Semenov wrote:
Nicol,
 
I thought I made it clear. For example, elements of UTF-8 are bytes: each element does not represent a character (unless it is an ASCII string) you can convert it to
string of char32 so that each character really represent one Unicode character.

None of the Unicode encodings maps a code unit to a character. At most (with UTF-32) you can map a code unit to a code point. But a code point, because of compositing characters, is still not necessarily what would be considered a character (whose definition is often application specific).

 
On the other hand, if you are only interested in the main coding plane: string of char16 will be enough. And if you only use European languages: string of char8 will be fine. In UTF-8 on the other hand, each Unicode character can be coded by 1, 2 ,3 ... bytes.

So what? As long as you are restricting yourself to a subset of unicode, a string of bytes is enough to represent ASCII. And with utf-16 even european characters can be represented with multiple code units when using compositing accents for example.
 
 
 
In .NET, Microsoft uses 2-byte characters because in most applications it's enough to use only the main Unicode plane, which covers most characters of most languages.

.NET uses full UTF-16 and doesn't certainly assume only the basic plane. Some functions may assume it, but they are market as so.
 
 
Yo cannot use UTF-8 strings, for example, to easily mainipulate, for example, Chinese charcaters: each character is represented by several bytes in UTF-8.

For most string manipulations you would use high level algorithms anyway so, really, character level access is often not really necessary. And when you need it, for example for parsing protcols (which are invariably specified as using a byte level encoding, usually utf-8), you can still do many character level operations in utf-8 because many interesting unicode codepoints (' ', '\r', '\n') map to a single code unit.

-- gpd

Martinho Fernandes

unread,
Apr 29, 2013, 5:58:07 AM4/29/13
to std-pr...@isocpp.org
On Sun, Apr 28, 2013 at 9:09 PM, Mikhail Semenov <mikhailse...@gmail.com> wrote:
 

The reason for the second point is that the Unicode itself propose 4 different types of representation http://unicode.org/reports/tr15/#Examples:
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose ligatures and other character types.

No, no, no, no, and no. WTF. Compatibility normalization is a destructive process! Suggesting that as what encoded_string uses is too limiting.

And for that matter, NFD is also destructive (U+387 GREEK ANO TELEIA mistakenly decomposes to U+00B7 MIDDLE DOT); and since the first step of NFC is the same as NFD, that makes it destructive as well. I would prefer not having any automatic normalization performed.

Different normal forms lend themselves to different uses cases. I say let the user choose.



Point (2) is to create strings for easy access of elements and comparison. The comparison is an issue: even French words have special way of comparison based on accented characters.
Other languages have their own specific ways of capering words and there may be more than one way of doing so. I think this issue can be left.

That would be done with locales, I would expect.

There is also GB18030 Standard (for Chinese characters), which is different from Unicode.

In what way is it different? Unicode defines a character set, and as far as I know, GB18030 is yet another encoding form for that character set.

Martinho

Mikhail Semenov

unread,
Apr 29, 2013, 7:58:33 AM4/29/13
to std-pr...@isocpp.org
Sorry, Gentlemen.
 
I think no-one is listening to what I am saying.
Speaking of Unicode: yes the user can choose.
This is the point. You've got to convert from an encoded sequence to an array.
The conversion is the user's choice.
Encoded sequence -> array (which is a string of char8, char16 or char 32).
 
When you convert to an array you may choose NFD, NFC, NFKD and NFKC, or just ASCII, or whatever you like.
 
This conversion can be provided by the an implementation or by the user.
When we obtain this string (array) we can get access to single characters (coding points), whatever they are.
It's the users choice what elements to use char8, char16, char32.
But the point is that each element has a fixed size!.
 
Now, after processing, you can convert back or to another encoding:
string -> encoded sequence.
 
Mikhail.


 

Ville Voutilainen

unread,
Apr 29, 2013, 8:03:25 AM4/29/13
to std-pr...@isocpp.org
On 29 April 2013 14:58, Mikhail Semenov <mikhailse...@gmail.com> wrote:
Sorry, Gentlemen.
I think no-one is listening to what I am saying.

Did you read what signore Deretta wrote? If not, go read it again, it should be illuminating.

Mikhail Semenov

unread,
Apr 29, 2013, 8:28:21 AM4/29/13
to std-pr...@isocpp.org
GB18030 has more coding points (1,587,600) than Unicode, but not all of them used.

On 29 April 2013 10:58, Martinho Fernandes <martinho....@gmail.com> wrote:

Mikhail Semenov

unread,
Apr 29, 2013, 9:53:22 AM4/29/13
to std-pr...@isocpp.org
Ville,
 
I read it again. But I disagree with high-level manipulation of characters, not using arrays. I would hate to manipulate, for instance, strings in Chinese,
using directly UTF-8 encoded strings; the same applies to Russian. I need one element per code point.
UTF-8 is very good for files, but not for string manipulation (unless, of course, use use ASCII <128).
 
Regards,
Mikhail.


 

--

Martinho Fernandes

unread,
Apr 29, 2013, 10:13:30 AM4/29/13
to std-pr...@isocpp.org
On Mon, Apr 29, 2013 at 3:53 PM, Mikhail Semenov <mikhailse...@gmail.com> wrote:
Ville,
 
I read it again. But I disagree with high-level manipulation of characters, not using arrays. I would hate to manipulate, for instance, strings in Chinese,
using directly UTF-8 encoded strings; the same applies to Russian. I need one element per code point.
UTF-8 is very good for files, but not for string manipulation (unless, of course, use use ASCII <128).
 
Regards,
Mikhail.

From what I gathered, while there are some disagreements about how such a thing should be achieved, I believe most people in this discussion agree with one point: no one should have to manipulate UTF-8/UTF-16/UTF-32/ASCII/Windows-1252/GB18030/Big-5/whatever directly as code units except in very rare and very special circumstances. Most of the uses for getting such raw access to data involve interoperation, either with legacy code or with external systems.

That said, I don't know what kind of manipulations you are concerned with here; lack of a common ground with respect to that may be the source of some misunderstandings.

Nicol Bolas

unread,
Apr 29, 2013, 11:21:08 AM4/29/13
to std-pr...@isocpp.org
On Monday, April 29, 2013 1:20:50 AM UTC-7, Mikhail Semenov wrote:
Nicol,
 
I thought I made it clear. For example, elements of UTF-8 are bytes: each element does not represent a character (unless it is an ASCII string) you can convert it to
string of char32 so that each character really represent one Unicode character. On the other hand, if you are only interested in the main coding plane: string of char16 will be enough. And if you only use European languages: string of char8 will be fine. In UTF-8 on the other hand, each Unicode character can be coded by 1, 2 ,3 ... bytes.
 
 
In .NET, Microsoft uses 2-byte characters because in most applications it's enough to use only the main Unicode plane, which covers most characters of most languages.
 
Yo cannot use UTF-8 strings, for example, to easily mainipulate, for example, Chinese charcaters: each character is represented by several bytes in UTF-8.

Yes, that's why we want a string class that makes it easy to manipulate arbitrary codepoint sequences in an arbitrary, specified encoding. The whole point is to have a string class, with an explicit encoding parameter, which allows you to manipulate it as a codepoint sequence, while still having basic access to the encoded data as an array of code units.

We already have the basic tools to be able to do that: specialized iterators for various encodings, which output codepoints, where ++ and -- will move along the encoded array properly. All we need is to aggregate these into a storage object, template that object on an encoding type (which is what provides the iterators), and add some basic operations.

At which point, I can use a UTF-8 string just as easily as I can a UTF-32 in any Unicode operation, from searching for a codepoint sequence, to normalizing it, to anything.

Tony V E

unread,
May 1, 2013, 7:59:59 PM5/1/13
to std-pr...@isocpp.org

--



Do we want the option of an encoding that changes at runtime (ie per string, or even as the string changes)?

I can see

string<encoding_dontcare> str = str_from_somewhere;
f1(f2(f3(f4(str))));

 
I don't want "encoding_dontcare" to mean that on Windows it is UTF16, and Linux UTF8, I want "dontcare" to mean whatever is given to it.  In that way, if each function along the way doesn't care, there is a chance that no re-encoding ever happens.  Whatever encoding str_from_somewhere was, that is the encoding (internally) returned from f1.

ie I think we want both encoding_platform and encoding_flexible (as a still not good, but better name).

Or do I need to write all my functions as templates?



Nicol Bolas

unread,
May 1, 2013, 9:30:20 PM5/1/13
to std-pr...@isocpp.org

There's no (good) way to implement "encoding_flexible" as a template parameter to a string type that expects a specific, fixed encoding. Not without creating a whole new specialization of that type that has a different interface, which is really little different from just creating a new class type.

Personally, I don't like `encoding_platform` at all, as this assumes that the platform-specific encoding is: 1) a good idea to use and 2) somehow specific to that platform.

The problem with a string type that can handle any encoding (even user-defined ones) is that such a string would necessarily have to use type-erasure to access the data in that string. Iterators for such a type will be slower to user because of the overhead.

In short, you're trading potentially less transcoding for always slower use of the string.

I'm not saying that we shouldn't have an `any_encoded_string`. I'm saying that we need to also have a `fixed_encoded_string` (with a forward-facing encoding that can be user-provided), and the two types cannot be the same type.

Lawrence Crowl

unread,
May 7, 2013, 2:50:21 PM5/7/13
to std-pr...@isocpp.org
On 5/1/13, Nicol Bolas <jmck...@gmail.com> wrote:
> On Wednesday, May 1, 2013 4:59:59 PM UTC-7, Tony V E wrote:
> > I don't want "encoding_dontcare" to mean that on Windows it
> > is UTF16, and Linux UTF8, I want "dontcare" to mean whatever
> > is given to it. In that way, if each function along the way
> > doesn't care, there is a chance that no re-encoding ever happens.
> > Whatever encoding str_from_somewhere was, that is the encoding
> > (internally) returned from f1. ie I think we want both
> > encoding_platform and encoding_flexible (as a still not good,
> > but better name).

If you mean use the given encoding, you should infer the type
from the object given to it. In the case of variables, use auto.
In the case of functions, use a template parameter.

> > Or do I need to write all my functions as templates?

If you want flexibility without run-time overhead, yes.

> There's no (good) way to implement "encoding_flexible" as a
> template parameter to a string type that expects a specific,
> fixed encoding. Not without creating a whole new specialization of
> that type that has a different interface, which is really little
> different from just creating a new class type.
>
> Personally, I don't like `encoding_platform` at all, as this
> assumes that the platform-specific encoding is: 1) a good idea
> to use and 2) somehow specific to that platform.
>
> The problem with a string type that can handle *any* encoding (even
> user-defined ones) is that such a string would necessarily have
> to use type-erasure to access the data in that string. Iterators
> for such a type will be slower to user because of the overhead.
>
> In short, you're trading potentially less transcoding for always
> slower *use* of the string.

An intermediate approach is to use function templates as above,
and then permit implicit transcoding conversions where necessary.

> I'm not saying that we shouldn't have an `any_encoded_string`.
> I'm saying that we need to also have a `fixed_encoded_string`
> (with a forward-facing encoding that can be user-provided),
> and the two types cannot be the same type.

They can be different specializations of the same template though.

--
Lawrence Crowl

Mikhail Semenov

unread,
May 7, 2013, 3:50:57 PM5/7/13
to std-pr...@isocpp.org
I think there should be a base class for encoding
template <class EncodingElement, class CharType>
class encoding
{
public:       
    virtual std::basic_string<EncodingElement> encode(std::basic_string<CharType> str) = 0;
    virtual std::basic_string<CharType> decode(std::basic_string<EncodingElement> str) = 0;   
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
...
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
...
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
...
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
...
};
Inside the program the encoded strings should be decoded when necessary.
Such approach makes it possible to use various encodings in one program.
The system one will be just one of the encodings.

Mikhail Semenov

unread,
May 7, 2013, 3:54:29 PM5/7/13
to std-pr...@isocpp.org

I think there should be a base class for encoding

template <class EncodingElement, class CharType>
class encoding
{
public:       

    virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
    virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;   
};

Then particular encoding classes can be implemented:

class encoding_utf8_char32: public encoding<char, char32_t>
{
...
};

class encoding_utf8_char16: public encoding<char, char16_t>
{
...
};

class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
...
};

class encoding_GB18030_char32: public encoding<char, char32_t>
{
...
};

Inside the program the encoded strings should be decoded when necessary.
Such approach makes it possible to use various encodings in one program.

 

Nicol Bolas

unread,
May 7, 2013, 11:06:31 PM5/7/13
to std-pr...@isocpp.org

Why? They need different interfaces; in particular, the fixed-encoded string needs a function to return an array of code-units, so that you can use them with C APIs that take that encoding. That's not really possible with the any-encoded string, because the type it returns could be anything, rather than a single, fixed type. The any-encoded string probably should also have APIs that will internally convert the string to a specific encoding (still using the any-encoded API), without doing a copy to a new string object.

Again, I point to `vector<bool>`; specializations that have different APIs should not be specializations.

Mikhail Semenov

unread,
May 8, 2013, 3:48:37 AM5/8/13
to std-pr...@isocpp.org
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages? Besides, there is a case for the end-of-line as well:
you can easily ideintify it by one encoded element (depending on the size of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
That makes it easier to split the initial text into lines.
 
I don't think it is worth covering various "packed" encodings and those used for encryption.

--

Mikhail Semenov

unread,
May 8, 2013, 3:49:48 AM5/8/13
to std-pr...@isocpp.org
Sorry, I meant 0xA for the end-of-line.

Nicol Bolas

unread,
May 8, 2013, 6:57:06 AM5/8/13
to std-pr...@isocpp.org
On Wednesday, May 8, 2013 12:48:37 AM UTC-7, Mikhail Semenov wrote:
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages?

Considering that an encoding is just a specialized set of iterators, a few basic functions, and a couple of typedefs, I see no reason why we should explicitly limit this string type to only certain encodings. If the user wants to use UTF-7 as an encoding, we shouldn't prevent them from being able to do so with the fixed encoding string type. This would allow them to more easily utilize the transcoding and other machinery that such a string will have.

Except for the most specialized needs or legacy code, nobody should have a reason to use some other string type for a sequence of Unicode codepoints.

DeadMG

unread,
May 8, 2013, 7:48:25 AM5/8/13
to std-pr...@isocpp.org
On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages? Besides, there is a case for the end-of-line as well:
you can easily ideintify it by one encoded element (depending on the size of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
That makes it easier to split the initial text into lines.

No, no, it doesn't. I seriously have to question how much you know what we are even talking about here. The Unicode Standard provides a line-break algorithm and they provide it for a reason, and that reason is "Split on "\n"" doesn't work. 

Mikhail Semenov

unread,
May 8, 2013, 8:08:43 AM5/8/13
to std-pr...@isocpp.org
There are several issues here:
(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians), UTF-32, GB81030 (GBK is a subset of this);
     but they have oen thing in common that the end-of-line can be easily identified without decoding.
(2) It is much easier to consider different types for an "encoded element" and "string char"; for example, in UTF-8, the "encoded element" is a char (byte),
but decoded string will be a string of char, char16_t or char32_t depending on the requirement. It is not convenient to deal with a UTF-8 string
as a string of char if you'd want to use other languages (Greek, Chinese, etc.). It's easier to convert from string of char to string of char16_t and deal with
 a string type. Potentially, and I stress potentailly, it is possible to create a whole class that is dealing with the encoding, but then it will be a new string class;
it you don't provide proper conversion to an array it will be inefficient: imagine if you've got a several page document and you'd like to replace 2-byte elements
with 3-byte ones (say, you use UTF-8), it will be very very inefficient.
(3) It is possible to create encode() and decode() functions that allow move, which will allow to embrace both worlds (for UTF-7 you'll just pass the string through without any changes). 
 


 

Mikhail Semenov

unread,
May 8, 2013, 8:21:49 AM5/8/13
to std-pr...@isocpp.org
I am not speaking about how to do line-breaking of text without end-of-lines, but the fact that for most encodings avaliable (not all of them), the end-of-line can be easily identified, but you need to know what encoding is used in the text in question.

 

--

Nicol Bolas

unread,
May 8, 2013, 8:28:06 AM5/8/13
to std-pr...@isocpp.org


On Wednesday, May 8, 2013 5:21:49 AM UTC-7, Mikhail Semenov wrote:
On 8 May 2013 12:48, DeadMG <wolfei...@gmail.com> wrote:
On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages? Besides, there is a case for the end-of-line as well:
you can easily ideintify it by one encoded element (depending on the size of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).
That makes it easier to split the initial text into lines.

No, no, it doesn't. I seriously have to question how much you know what we are even talking about here. The Unicode Standard provides a line-break algorithm and they provide it for a reason, and that reason is "Split on "\n"" doesn't work. 



I am not speaking about how to do line-breaking of text without end-of-lines, but the fact that for most encodings avaliable (not all of them), the end-of-line can be easily identified, but you need to know what encoding is used in the text in question.


Um, yes. To understand any string of text, you need to know what encoding it is. That includes the EOL character, but it also includes every other character. So why are you singling out EOL as something special?

 

Nicol Bolas

unread,
May 8, 2013, 8:31:41 AM5/8/13
to std-pr...@isocpp.org
On Wednesday, May 8, 2013 5:08:43 AM UTC-7, Mikhail Semenov wrote:
There are several issues here:
(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians), UTF-32, GB81030 (GBK is a subset of this);
     but they have oen thing in common that the end-of-line can be easily identified without decoding.

What does the "end-of-line" character have to do with anything? Who cares about how easy or not easy it is to identify the EOL character?

(2) It is much easier to consider different types for an "encoded element" and "string char"; for example, in UTF-8, the "encoded element" is a char (byte),
but decoded string will be a string of char, char16_t or char32_t depending on the requirement.

Um, no it won't.

A Unicode encoding specifies a mapping between a sequence of code units (where each code unit is some particular bit-depth, as specified by the encoding) and a sequence of codepoints, where each codepoint is a Unicode codepoint of 21-bits in size (which can be stored in larger types for convenience).

A UTF-8-encoded sequence of code units can be decoded to a sequence of codepoints, which can then be re-encoded into a sequence of code units of some other Unicode encoding. But a Unicode encoding can only be decoded into a sequence of codepoints. Not UTF-16, UTF-7, or any other encoding. Just codepoints.

It is not convenient to deal with a UTF-8 string
as a string of char

Nobody's suggesting that a UTF-8 string be treated "as a string of char", regardless of what language it is. Well, outside of using C APIs that only take `char*`s. We're suggesting that any Unicode encoding be treated as a series of codepoints, with operations like insertion, removal, and so forth on codepoint-based boundaries and iterators.

Mikhail Semenov

unread,
May 8, 2013, 8:49:13 AM5/8/13
to std-pr...@isocpp.org
OK. The point I was making is that if you have UTF-32, UTF-16, GB18030 or UTF8 (UTF-7 if you like), you can easily find end-of-lines in the text and then decode it line by line. That's all. You don't have to do it, if you've got a solid block of text without eols. If you've got a Visual C++, Intel or a GCC compiler, you will easily find eols without decoding the text.
 
Of course, you may event encoding where you have to decode all the previous characters before you hit the eol: it's easy to do so.
 
To be honest, I don't like reference to C: it's looking backwards.
 

 
---------- Forwarded message ----------
From: Nicol Bolas <jmck...@gmail.com>
Date: 8 May 2013 13:31
Subject: Re: [std-proposals] Re: Committee feedback on N3572
To: std-pr...@isocpp.org


Martinho Fernandes

unread,
May 8, 2013, 9:22:05 AM5/8/13
to std-pr...@isocpp.org
On Wed, May 8, 2013 at 2:49 PM, Mikhail Semenov <mikhailse...@gmail.com> wrote:
you can easily find end-of-lines in the text and then decode it line by line.

Unless you plan in decoding only randomly accessed lines, I see little benefit in not having to decode the text so that you can decode it immediately afterwards.

To be honest, I don't like reference to C: it's looking backwards.

You may pretend that reality does not exist all you want. It won't make it disappear.

Mikhail Semenov

unread,
May 8, 2013, 9:49:12 AM5/8/13
to std-pr...@isocpp.org
I apologise for my brusque comment.
 
The reality is that if somebody is using UTF-16 or UTF-32, it's just easier to use them as they are with char16_t and char32_t and probably without any decoding.
In this case, why should I be talking about a string of char? That's all. I think a lot of people a speaking about UTF-8, which obviously is a string of char (or an array
of bytes, if you wish).
 
 

--

Martinho Fernandes

unread,
May 8, 2013, 10:38:38 AM5/8/13
to std-pr...@isocpp.org
On Wed, May 8, 2013 at 3:49 PM, Mikhail Semenov <mikhailse...@gmail.com> wrote:
The reality is that if somebody is using UTF-16 or UTF-32, it's just easier to use them as they are with char16_t and char32_t and probably without any decoding.

And I believe the burden of proof is on you that it is easier to manipulate such strings for some reason. I hope your reasoning does not involve pretending UTF-16 is UCS-2.
 
In this case, why should I be talking about a string of char? That's all. I think a lot of people a speaking about UTF-8, which obviously is a string of char (or an array
of bytes, if you wish).

As I said before, from what I gather, a lot of people here are speaking about strings that abstract the encoding away. With their ideal interface the user does not see the encoding getting in their way: all such strings, regardless of encoding, provide the same interface that treats code points, not 8-bit bytes, not 16-bit words, not 32-bit words, as the basic unit of text.

In normal usage of such strings there is no encoding. This is in the same vein of using primitive types like int: when you use int there is no endianness, there is no two's complement; there are only numbers. The language gives you operations that are completely agnostic of the underlying representation. It doesn't make sense to ask whether + for ints is little endian or big endian: it operates on numbers, not ordered sequences of bytes.

Sometimes, particularly when crossing interoperation boundaries, it is important to go beyond the numbers, and have some control over the representation, like when you sending numbers across the network with things like htonl() and ntohl().

Nicol (please, correct me if I am wrong) wants to have the same ability for handling text: normal operations on such ideal strings are such that talking about encoding when related to them does not even make sense; and yet you don't discard the possibility of picking specific representations for crossing boundaries.

You appear to keep insisting on processing strings based on their raw code unit form. If that is all you want I don't even know why you are wasting your time here. You don't need anything new from C++ if you want to deal with code unit sequences directly. std::string, std::u16string, and std::u32string (or maybe std::wstring as well if you are into that) are pretty much that: sequence containers of code units.

Zhihao Yuan

unread,
May 8, 2013, 11:06:36 AM5/8/13
to std-pr...@isocpp.org
On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
<martinho....@gmail.com> wrote:
> In normal usage of such strings there is no encoding. This is in the same
> vein of using primitive types like int: when you use int there is no
> endianness, there is no two's complement; there are only numbers. The
> language gives you operations that are completely agnostic of the underlying
> representation. It doesn't make sense to ask whether + for ints is little
> endian or big endian: it operates on numbers, not ordered sequences of
> bytes.

Yes, that might be the final answer. We need a class type, namely 'unicode'
or whatever. Its representation is totally implementation-defined. A
library can
choose UTF-8, UTF-16, UTF-32, GB18030, UTF-EBCDIC, homemade, anything.
But when you do `s[n]`, you get an object of type 'codepoint'.

--
Zhihao Yuan, ID lichray
The best way to predict the future is to invent it.
___________________________________________________
4BSD -- http://4bsd.biz/

Mikhail Semenov

unread,
May 8, 2013, 11:09:01 AM5/8/13
to std-pr...@isocpp.org
I think you are mistaking me for someone else.  I just want a good interface for manipulate between various encodings and to be able to deal with various files that use them.
I actually do not like "behind the scenes" code: it can interfere with transfer of data. Let the user decide when to encode and decode strings.


 

Martinho Fernandes

unread,
May 8, 2013, 11:42:09 AM5/8/13
to std-pr...@isocpp.org
On Wed, May 8, 2013 at 5:06 PM, Zhihao Yuan <lic...@gmail.com> wrote:
On Wed, May 8, 2013 at 10:38 AM, Martinho Fernandes
<martinho....@gmail.com> wrote:
> In normal usage of such strings there is no encoding. This is in the same
> vein of using primitive types like int: when you use int there is no
> endianness, there is no two's complement; there are only numbers. The
> language gives you operations that are completely agnostic of the underlying
> representation. It doesn't make sense to ask whether + for ints is little
> endian or big endian: it operates on numbers, not ordered sequences of
> bytes.

Yes, that might be the final answer.  We need a class type, namely 'unicode'
or whatever.  Its representation is totally implementation-defined.

Yes, and the original point of contention here was about that "totally implementation-defined" bit, which the committee seemed to prefer.

I don't agree with it. It might have been a good choice in a green environment, but I think the existing ecosystem is too fractured to make that the best option. I agree with Nicol that we should allow the user to decided what underlying representation will be the cheapest for their purposes. If I need to interop with environments that expect ENCODINGX all the time, I would appreciate having the option of not paying any price for transcoding on those boundaries. (This goes back to the "don't pay for what you don't use mantra".)

And FWIW, I don't understand why you cannot have both if you really want to. Consider the following.

template <typename Encoding>
class generic_unicode_string;

using implementation_defined_unicode_string = generic_unicode_string<implementation_defined_encoding>;

What drawbacks would this approach have?

Mikhail Semenov

unread,
May 8, 2013, 12:14:09 PM5/8/13
to std-pr...@isocpp.org
When I mentioned the following encoding, some of the classes can be defined as implementation defined encoding:
class encoding
{
public:       
    virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
    virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;   
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
...
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
...
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
...
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
...
};
 
etc...

The only point is that there can be several of them. For example, UTF-8 cane be converted to char, char16_t or char32_t depending on what the user perfers.
 

--

Mikhail Semenov

unread,
May 8, 2013, 12:13:35 PM5/8/13
to std-pr...@isocpp.org
When I mentioned the following encoding, some of the classes can be defined as implementation defined encoding:
class encoding
{
public:       
    virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
    virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;   
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
...
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
...
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
...
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
...
};
 
etc...

The only point is that there can be several of them. For example, UTF-8 cane be converted to char, char16_t or char32_t depending on what the user perfers.
 

--

Mikhail Semenov

unread,
May 8, 2013, 12:19:20 PM5/8/13
to std-pr...@isocpp.org
When I mentioned the following classes, I meant that some of them could be treated and implementation-defined conversions.
class encoding
{
public:       
    virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
    virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;   
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
...
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
...
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
...
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
...
};
 
 
For example, if the UTF-8 is the implementation-defined encoding then the follwing at least three conversions should exist string of char, string of char16_t (maybe even ignoring surrogates) or string of char32_t, depending on what the user wants.

 

--

Nicol Bolas

unread,
May 8, 2013, 12:27:13 PM5/8/13
to std-pr...@isocpp.org


On Wednesday, May 8, 2013 9:14:09 AM UTC-7, Mikhail Semenov wrote:
When I mentioned the following encoding, some of the classes can be defined as implementation defined encoding:
class encoding
{
public:       
    virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
    virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;   
};

First, we won't be using inheritance. It can't do the things we need to do. For example "EncodingElement" is a type that changes based on the encoding. Which you can't do with virtual functions. You also can't specialize iterators, which is important since most of the algorithms work on codepoint iterators. Oh, and there's no reason to throw performance away on virtual function overhead.

Second, it won't be using basic_string. The entire point of an encoded string is that you treat it like a sequence of codepoints. Nothing in the `basic_string` API can handle that. It must be a new type.

So pretty much everything about this suggestion is a bad idea.

Mikhail Semenov

unread,
May 8, 2013, 12:37:43 PM5/8/13
to std-pr...@isocpp.org
(1) Are you say that the Committee is happy with the idea of an Ecoded String class?
 
(2) My proposal was to use the ecoding class only for conversion. You can encode and decode the whole text in one go.
 
(3) I was think about have iterators as well in a different settings (without encoding/decoding). They are also to crawl through a string (say a UTF-8 string), but they won't work if the you'd like to replace say a 2-byte code with a 3-byte one inside a string: too inefficient!

 

--

Martinho Fernandes

unread,
May 8, 2013, 12:39:14 PM5/8/13
to std-pr...@isocpp.org
On Wed, May 8, 2013 at 5:09 PM, Mikhail Semenov <mikhailse...@gmail.com> wrote:
I think you are mistaking me for someone else.

I apologize if that is the case. I got the idea from statements like "It is not convenient to deal with a UTF-8 string as a string of char" and similar. No one here is arguing for dealing with a UTF-8 string as a string of char, so I cannot really understand why anyone would keep arguing about it.
 
I just want a good interface for manipulate between various encodings and to be able to deal with various files that use them.

But other people seem to want more than encoding conversions. Unicode is not encodings and I believe encodings should be the least important thing of all. Please note all the generic algorithms to handle text that were included in the proposal. And FWIW, C++11 already has encoding conversions for the UTF encodings in it.

Martinho Fernandes

unread,
May 8, 2013, 12:44:28 PM5/8/13
to std-pr...@isocpp.org
On Wed, May 8, 2013 at 6:37 PM, Mikhail Semenov <mikhailse...@gmail.com> wrote:
(3) I was think about have iterators as well in a different settings (without encoding/decoding). They are also to crawl through a string (say a UTF-8 string), but they won't work if the you'd like to replace say a 2-byte code with a 3-byte one inside a string: too inefficient!

Is this is a relevant use case? What about wanting to replace "Århus" with "Москва"? Or even in an ASCII string, where every character is a single byte, what about wanting to replace "Paris" with "Moskva"?

Mikhail Semenov

unread,
May 8, 2013, 1:23:09 PM5/8/13
to std-pr...@isocpp.org
It depends what you want to do. I meant something like that:
 

for (char32_t& x: a_utf_string)
{
     if (x == 'a')
     {
        x = 'π’
     }  
}

You may allow such manipulations even if they are inefficient. But is it worth the shot: we have to resize the array.

 

 

Nicol Bolas

unread,
May 8, 2013, 10:09:19 PM5/8/13
to std-pr...@isocpp.org


On Wednesday, May 8, 2013 9:37:43 AM UTC-7, Mikhail Semenov wrote:
(1) Are you say that the Committee is happy with the idea of an Ecoded String class?
 
(2) My proposal was to use the ecoding class only for conversion. You can encode and decode the whole text in one go.

... and? Look at the proposal; it already has transcoding support for "the whole text in one go".

(3) I was think about have iterators as well in a different settings (without encoding/decoding). They are also to crawl through a string (say a UTF-8 string), but they won't work if the you'd like to replace say a 2-byte code with a 3-byte one inside a string: too inefficient!

Codepoint iterators would only provide value access to codepoints. You can't set a codepoint via a codepoint iterator. The only encoding where setting a codepoint by iterator would ever work (without the container) would be UTF-32.

Encoded string would have the ability to insert codepoints or codepoint ranges into explicit locations in the string (locations denoted by codepoint iterators).

Really, just look at the proposal sometime. It's got all this stuff in there, fairly well specified.

Lawrence Crowl

unread,
May 9, 2013, 2:39:13 PM5/9/13
to std-pr...@isocpp.org
On 5/8/13, Nicol Bolas <jmck...@gmail.com> wrote:
> On May 8, 2013, Mikhail Semenov wrote:
> > (1) Are you say that the Committee is happy with the idea of
> > an Ecoded String class?
> >
> > (2) My proposal was to use the ecoding class only for
> > conversion. You can encode and decode the whole text in one go.
>
> ... and? Look at the proposal; it already has transcoding support
> for "the whole text in one go".
>
> > (3) I was think about have iterators as well in a different
> > settings (without encoding/decoding). They are also to crawl
> > through a string (say a UTF-8 string), but they won't work if
> > the you'd like to replace say a 2-byte code with a 3-byte one
> > inside a string: too inefficient!
>
> Codepoint iterators would only provide value access to
> codepoints. You can't set a codepoint via a codepoint iterator. The
> only encoding where setting a codepoint by iterator would ever work
> (without the container) would be UTF-32.

I think an output iterator appending to the string would handle
codepoints just fine in any encoding. Indeed, that work must
effectively be done by any transcoder. We might as well make the
primitive available.

> Encoded string would have the ability to insert codepoints or
> codepoint ranges into explicit locations in the string (locations
> denoted by codepoint iterators).
>
> Really, just look at the proposal sometime. It's got all this
> stuff in there, fairly well specified.

--
Lawrence Crowl

Nicol Bolas

unread,
May 9, 2013, 9:16:18 PM5/9/13
to std-pr...@isocpp.org


On Thursday, May 9, 2013 11:39:13 AM UTC-7, Lawrence Crowl wrote:
On 5/8/13, Nicol Bolas <jmck...@gmail.com> wrote:
> On May 8, 2013, Mikhail Semenov wrote:
> > (1) Are you say that the Committee is happy with the idea of
> > an Ecoded String class?
> >
> > (2) My proposal was to use the ecoding class only for
> > conversion. You can encode and decode the whole text in one go.
>
> ... and? Look at the proposal; it already has transcoding support
> for "the whole text in one go".
>
> > (3) I was think about have iterators as well in a different
> > settings (without encoding/decoding). They are also to crawl
> > through a string (say a UTF-8 string), but they won't work if
> > the you'd like to replace say a 2-byte code with a 3-byte one
> > inside a string: too inefficient!
>
> Codepoint iterators would only provide value access to
> codepoints. You can't set a codepoint via a codepoint iterator. The
> only encoding where setting a codepoint by iterator would ever work
> (without the container) would be UTF-32.

I think an output iterator appending to the string would handle
codepoints just fine in any encoding.  Indeed, that work must
effectively be done by any transcoder.  We might as well make the
primitive available.

Those are iterators based on containers, not ranges. I'm talking about doing something like std::for_each and modifying the codepoints in-situ. That's not reasonable with pure iterator logic.

Tony V E

unread,
May 9, 2013, 11:58:41 PM5/9/13
to std-pr...@isocpp.org
You could use a proxy-based iterator. I suppose. Obviously has trade-offs.

Sent from my portable Analytical Engine


From: "Nicol Bolas" <jmck...@gmail.com>
To: "std-pr...@isocpp.org" <std-pr...@isocpp.org>
Sent: 9 May, 2013 10:16 PM

Subject: Re: [std-proposals] Re: Committee feedback on N3572

Martinho Fernandes

unread,
May 10, 2013, 6:28:04 AM5/10/13
to std-pr...@isocpp.org
On Thu, May 9, 2013 at 8:39 PM, Lawrence Crowl <cr...@googlers.com> wrote:
I think an output iterator appending to the string would handle
codepoints just fine in any encoding.  Indeed, that work must
effectively be done by any transcoder.  We might as well make the
primitive available.

Wouldn't that simply be std::back_inserter?

Mikhail Semenov

unread,
May 11, 2013, 12:06:00 PM5/11/13
to std-pr...@isocpp.org
Lawrence,
 
Could you tell me, please, what is the situation with 3398?
Has it been approved?
 
Mikhail.
 

 
It is loading more messages.
0 new messages