Wait a second, are you telling us that they want unicode strings to have one specific encoding a user has totally no control over, and that writing an application using, say, both UTF-8 and UTF-32 would not be possible?
The idea we wanted for a data type was that users could use a single
type to represent unicode characters, rather than a separate type for
each external encoding. This asserts that it's a good idea to
transcode data as it enters or exits the system and use a specific
encoding inside the system, rather than propagating variable encodings
throughout.
There are a couple options for implementing that:
1) Take Python 3's approach of representing strings in UTF-32, with an
optimization for strings with no code points above 255 (store in 1
byte each) and another optimization for strings with no code points
above 65535 (store in 2 bytes each). Optionally, if the unicode string
is a cord/rope type, then each segment can have the optimization
applied independently.
2) Store the string in UTF-8 or UTF-16 depending on the platform (or
another encoding for less common platforms), and provide, say,
"array_view<const char> as_utf8(std::vector<char>& storage);" and
"array_view<const char16_t> as_utf16(std::vector<char16_t>& storage);"
accessors: these would copy the data if it's stored in the other
encoding, or return a reference to the internal storage if it's
already in the desired encoding. Implementations would be free to
define other accessors, but I suspect these are all the standard
needs.
Option 1 has the benefit of allowing random-access iterators, or at
least indexing, which the ICU folks I spoke to thought would be
useful. Option 2 has the benefit of allowing some, maybe most,
external interactions without copying.
2013/4/20 Nicol Bolas <jmck...@gmail.com>:
> On Saturday, April 20, 2013 9:05:05 AM UTC-7, Jeffrey Yasskin wrote:
>>
>> The idea we wanted for a data type was that users could use a single
>> type to represent unicode characters, rather than a separate type for
>> each external encoding. This asserts that it's a good idea to
>> transcode data as it enters or exits the system and use a specific
>> encoding inside the system, rather than propagating variable encodings
>> throughout.
>
> By this logic, we shouldn't have `u8`, `u` and `U` prefixes either. We
> should just have had a single `u` type that would convert it into some
> Unicode representation, depending on the platform.
I agree that this would be the most natural thing, when the new
character types had been introduced from begin with.
comparison by Unicode code should work, which is not perfect; there are culture-specific issues.
I think that there could be a middle ground to be found here. I was reviewing the original paper, and the only place that the encoding parameter was actually used in the interface was to specify the return value for C string interoperation. If, instead, I changed that so you could request a C string of any encoding type from any encoding (so for example c_str() was a template), as is supported by the original traits design, then that would make a polymorphic encoding possible- and also the stored encoding would be non-observable, except perhaps in the complexity of requesting a C string.
* "The comparison is performed at L3 or greater" means that it's
implementation-defined which level is actually used? Why is that the
right decision?
* What sort of data size is needed to implement this?
Can the data be
shared with an ICU installation?
Can implementations for constrained
environments omit chunks of data at the programmer's discretion?
On Tue, Apr 23, 2013 at 2:02 AM, Jeffrey Yasskin <jyas...@google.com> wrote:
<snip>* "The comparison is performed at L3 or greater" means that it's
implementation-defined which level is actually used? Why is that the
right decision?There isn't much of a decision to be made here anyway. The only thing that can be effectively required of an implementation is a minimum level. If an implementation using Ln is conforming, any implementation that uses Ln+1 is conforming as well. That means the "or greater" part in the text is actually redundant, since requiring L3 does not forbid L4 implementations (and there is no reason that I can think of to forbid them).
C++ is not Python. Stop trying to turn it into a low-rent version of Python. We don't use C++ because it's easy; we use it because it is powerful. We shouldn't throw away power just to allow slightly easier usage. We don't need a one-size-fits-all Unicode string. Give us choices.
I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?
On Wed, Apr 24, 2013 at 12:46 PM, Nicol Bolas <jmck...@gmail.com> wrote:
I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?
Don't we have that as std::basic_string already, anyway?
And if "conveying the bits across module boundaries" isn't talking about serialization, what's wrong with just passing C++ types? You know what string type the module uses, so just use the string type it uses, and everyone's fine.
unicode_string<utf8> str = module_a::get_some_string(...);
module_b::use_some_string(..., str);
How? This is what the mediating module would look like:
unicode_string<utf8> str = module_a::get_some_string(...);
module_b::use_some_string(..., str);
Whatever Unicode encoding `module_a::get_some_string` returns, `str` will always be UTF-8 encoded. It will simply transcode the return value. Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be transcoded into it as needed.
That would assume that the transcoding cost is ok for the mediating part. I don't think that's the general case.
On 24 April 2013 16:01, Nicol Bolas <jmck...@gmail.com> wrote:
How? This is what the mediating module would look like:
unicode_string<utf8> str = module_a::get_some_string(...);
module_b::use_some_string(..., str);
I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>elsewhere, which is the explosion of types I mentioned.
Whatever Unicode encoding `module_a::get_some_string` returns, `str` will always be UTF-8 encoded. It will simply transcode the return value. Whatever Unicode encoding `module_b::use_some_string` takes, `str` will be transcoded into it as needed.
That would assume that the transcoding cost is ok for the mediating part. I don't think that's the general case.
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>elsewhere, which is the explosion of types I mentioned.
I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.
Why should the standard be responsible for people who can't settle on a convention?
The only way for transcoding to be avoided is if none of the modules between the producing module and the consuming one do anything with the string that requires a specific encoding. It must treat the string as a sequence of codepoints that are properly Unicode formatted and arranged. So the consuming module can't write it to a file, to a stream (not without some serious upgrades to iostream to start taking codepoint sequences), send it across the internet, or any number of other processes that need the actual encoding.
The general rubric for C++ is (and should be): you accept whatever, convert it ASAP into your standard encoding, do any manipulation in that encoding, and then convert it if some specific API needs a different encoding. This is how it must be, because the entire C++ world is not going to suddenly switch to our new Unicode string. There will still be many APIs that only accept a specific encoding.
I don't see the need for the standard to support an alternate way of using Unicode strings.
As for the performance issue, I don't see how you can make a performance-based case for `any_unicode_string` at all. `any_unicode_string` will have significantly degraded access performance, since it will have to use type erasure to store and access the actual data.
On 24 April 2013 16:38, Nicol Bolas <jmck...@gmail.com> wrote:
On Wednesday, April 24, 2013 6:06:52 AM UTC-7, Ville Voutilainen wrote:I don't think that's what it would look like. It's using unicode_string<utf8> there, potentially unicode_string<something_else>elsewhere, which is the explosion of types I mentioned.
I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.
Yes, "they'll". The question is who's "them", and where.
Why should the standard be responsible for people who can't settle on a convention?In order to be useful?
The general rubric for C++ is (and should be): you accept whatever, convert it ASAP into your standard encoding, do any manipulation in that encoding, and then convert it if some specific API needs a different encoding. This is how it must be, because the entire C++ world is not going to suddenly switch to our new Unicode string. There will still be many APIs that only accept a specific encoding.
I don't see the need for the standard to support an alternate way of using Unicode strings.The "accept whatever" includes accepting a general unicode type. Do we suddenly agree completely?
As for the performance issue, I don't see how you can make a performance-based case for `any_unicode_string` at all. `any_unicode_string` will have significantly degraded access performance, since it will have to use type erasure to store and access the actual data.
You're reaching too far and too early into implementation details if you think it *has to* use erasure.
I'm still not clear on the problem. They're all inter-compatible; so what if they use UTF8 in some places and UTF16 in others? It won't break anything; they'll just get degraded performance due to user error.
Yes, "they'll". The question is who's "them", and where.
I'm not sure what you're getting at here. "Where" will be when they transcode strings. Transcoding can't be hidden in this system,
Why should the standard be responsible for people who can't settle on a convention?In order to be useful?
By that logic, we should also have garbage collection. Because it's "useful".
C++ simply doesn't do things this way. It doesn't tend to have types that could cover anything of a general category, with
You're reaching too far and too early into implementation details if you think it *has to* use erasure.
Whether it's type erasure or something else, this iterator access is not going to be a simple pointer access. Each call to `++` or `*` is going to have to do a lot more work than a specific encoder's iterator. It's going to have to figure out which encoding the type actually is, then call an appropriate function based on that.
If all you're talking about is some memory object which cannot be useful until it is transferred into some other object, that's not a "string" by any definition I'm aware of. That's not even an iterator range.
Furthermore, what good is "blasting the raw bits into various sinks"? Ignoring the fact that there's no such thing as a "sink", the
I'd classify the options into two general categories:
1) A unicode string class that presents its contents as a sequence of
code points, without exposing its clients to the sequence of bytes
that underlie these code points. This could be the python-style object
I've been suggesting or could be an object that presents a
bidirectional iterator that converts on the fly.
2) An "encoded" string class that presents its contents as a sequence
of bytes along with a description of the encoding that should be used
to interpret those bytes, probably along with an iterator that can
convert from each encoding.
Neither of these is wrong, but we only want to standardize one, and
it's not totally obvious which is better. (If it's totally obvious to
you, that probably means you're not considering enough viewpoints.)
It's far more than just Windows. It's Java, .NET, and every Windows-focused application. You cannot just dump the other encodings. It is nothing Windows-specific, it's simple compatibility. If you have an existing UTF-16 application that interoperates
Standardizing a Unicode string as UTF-8 and then only ever using that in new Standard interfaces would be dumping the other encodings.I can see the argument for a Pythonic mystery encoding, maybe. But there's no way I'd prevent any implementer from setting that encoding to UTF-16 on Windows and Jeffrey would want to be able to set his to UTF-32 with some storage magic and so on and so forth. Option 1 vs Option 2 is a debate, but "UTF-8 everywhere" is not even a question. I will never propose such a thing.
--
You only need the other encodings on the edges of your app.Thing long term. Some day down the road Windows and Java won't exist, and/or they'll have seen the error of their ways and converted to UTF8.
On Wed, Apr 24, 2013 at 3:41 PM, Tony V E <tvan...@gmail.com> wrote:
> - UTF8 is size efficient
You joke. UTF-8 use 3 bytes to encode Asian characters, while any Asian
language-specific encoding needs only 2 bytes. More interestingly, GB18030,
as a full Unicode implementation, can encode any CJK characters in 2
bytes. UTF-8 sucks.
Neither of these is wrong, but we only want to standardize one, and
it's not totally obvious which is better. (If it's totally obvious to
you, that probably means you're not considering enough viewpoints.)
On 4/24/13, Tony V E <tvan...@gmail.com> wrote:
> - UTF8 is not *too* iterator inefficient as you never need to go
> more than a few bytes left or right to find the start of a code
> point (ie you don't need to go to the beginning of the string, and
> you can tell if a byte is in the middle of a code point or not).
> Of course, with an iterator, you should never be in the middle
> of a codepoint anyhow.
UTF32 has the fastest iterator performance. It can matter, because
it is decision-less, which makes it viable for use in vector units.
UTF16 is somewhat harder. UTF8 is much harder.
If you don't care what encoding a string uses, then you can just use an encoding-aware string anyway. All of them should be inter-convertible between each other (though it should require explicit conversion). And they should all be buildable from raw data (an iterator range and the encoding of that range). So all you need to do is pick one and you're fine.
I don't see why anyone would need a string that is explicitly unaware of its encoding. What does that gain you?
UTF-16 balances the space usage, and it's very fast. To mix the concept
of bytes and string is C's big fault.
I agree with Lawrence on that: UTF32 is more efficient for representing general Unicode characters.I think the issue here is that it is difficult to resolve the following two issues:
(1) to select a preferable encoding (for reading from a file, system representation and exchange);
(2) to select a common string format for internal representation (arrays of characters that we can easily compare).The reason for the second point is that the Unicode itself propose 4 different types of representation http://unicode.org/reports/tr15/#Examples:
NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose ligatures and other character types.Point (2) is to create strings for easy access of elements and comparison. The comparison is an issue: even French words have special way of comparison based on accented characters.Other languages have their own specific ways of comparing words and there may be more than one way of doing so. I think this issue can be left.My suggesting would be to have two basic forms of representation:
(1) encoded strings;(2) simple strings of characters (char8, char16 and char32).
An implementation should provide(a) some forms of encoding for encoded strings;(b) some conversions between encoded strings and simple strings (of char8, char16 and char32);(c) in addition to standard comparison of simple strings (like arrays of elements), there should be conversion routines forvarious languages.
Nicol,I thought I made it clear. For example, elements of UTF-8 are bytes: each element does not represent a character (unless it is an ASCII string) you can convert it tostring of char32 so that each character really represent one Unicode character.
On the other hand, if you are only interested in the main coding plane: string of char16 will be enough. And if you only use European languages: string of char8 will be fine. In UTF-8 on the other hand, each Unicode character can be coded by 1, 2 ,3 ... bytes.
In .NET, Microsoft uses 2-byte characters because in most applications it's enough to use only the main Unicode plane, which covers most characters of most languages.
Yo cannot use UTF-8 strings, for example, to easily mainipulate, for example, Chinese charcaters: each character is represented by several bytes in UTF-8.
The reason for the second point is that the Unicode itself propose 4 different types of representation http://unicode.org/reports/tr15/#Examples:NFD, NFC, NFKD and NFKC. I personally prefer NFKC, but then you lose ligatures and other character types.
Point (2) is to create strings for easy access of elements and comparison. The comparison is an issue: even French words have special way of comparison based on accented characters.
Other languages have their own specific ways of capering words and there may be more than one way of doing so. I think this issue can be left.
There is also GB18030 Standard (for Chinese characters), which is different from Unicode.
Sorry, Gentlemen.I think no-one is listening to what I am saying.
--
Ville,I read it again. But I disagree with high-level manipulation of characters, not using arrays. I would hate to manipulate, for instance, strings in Chinese,using directly UTF-8 encoded strings; the same applies to Russian. I need one element per code point.UTF-8 is very good for files, but not for string manipulation (unless, of course, use use ASCII <128).Regards,Mikhail.
Nicol,I thought I made it clear. For example, elements of UTF-8 are bytes: each element does not represent a character (unless it is an ASCII string) you can convert it tostring of char32 so that each character really represent one Unicode character. On the other hand, if you are only interested in the main coding plane: string of char16 will be enough. And if you only use European languages: string of char8 will be fine. In UTF-8 on the other hand, each Unicode character can be coded by 1, 2 ,3 ... bytes.In .NET, Microsoft uses 2-byte characters because in most applications it's enough to use only the main Unicode plane, which covers most characters of most languages.Yo cannot use UTF-8 strings, for example, to easily mainipulate, for example, Chinese charcaters: each character is represented by several bytes in UTF-8.
--
I think there should be a base class for encoding
template <class EncodingElement, class CharType>
class encoding
{
public:
virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;
};
Then particular encoding classes can be implemented:
class encoding_utf8_char32: public encoding<char, char32_t>
{
...
};
class encoding_utf8_char16: public encoding<char, char16_t>
{
...
};
class encoding_utf16_char32: public encoding<char16_t, char32_t>
{
...
};
class encoding_GB18030_char32: public encoding<char, char32_t>
{
...
};
Inside the program the encoded strings should be decoded when necessary.
Such approach makes it possible to use various encodings in one program.
--
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages?
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages? Besides, there is a case for the end-of-line as well:you can easily ideintify it by one encoded element (depending on the size of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).That makes it easier to split the initial text into lines.
--
On 8 May 2013 12:48, DeadMG <wolfei...@gmail.com> wrote:
On Wednesday, May 8, 2013 8:48:37 AM UTC+1, Mikhail Semenov wrote:
Do we really need fits all encoding, or shall we deal with typical cases used to cover most languages? Besides, there is a case for the end-of-line as well:you can easily ideintify it by one encoded element (depending on the size of the encoded element: 1 , 2 or 4 bytes) with the same code (0x10).That makes it easier to split the initial text into lines.No, no, it doesn't. I seriously have to question how much you know what we are even talking about here. The Unicode Standard provides a line-break algorithm and they provide it for a reason, and that reason is "Split on "\n"" doesn't work.
I am not speaking about how to do line-breaking of text without end-of-lines, but the fact that for most encodings avaliable (not all of them), the end-of-line can be easily identified, but you need to know what encoding is used in the text in question.
There are several issues here:(1) Yes, we should allow UTF-7, UTF-8, UTF-16 (including 2 endians), UTF-32, GB81030 (GBK is a subset of this);but they have oen thing in common that the end-of-line can be easily identified without decoding.
(2) It is much easier to consider different types for an "encoded element" and "string char"; for example, in UTF-8, the "encoded element" is a char (byte),but decoded string will be a string of char, char16_t or char32_t depending on the requirement.
It is not convenient to deal with a UTF-8 stringas a string of char
you can easily find end-of-lines in the text and then decode it line by line.
To be honest, I don't like reference to C: it's looking backwards.
--
The reality is that if somebody is using UTF-16 or UTF-32, it's just easier to use them as they are with char16_t and char32_t and probably without any decoding.
In this case, why should I be talking about a string of char? That's all. I think a lot of people a speaking about UTF-8, which obviously is a string of char (or an array
of bytes, if you wish).
On Wed, May 8, 2013 at 10:38 AM, Martinho FernandesYes, that might be the final answer. We need a class type, namely 'unicode'
<martinho....@gmail.com> wrote:
> In normal usage of such strings there is no encoding. This is in the same
> vein of using primitive types like int: when you use int there is no
> endianness, there is no two's complement; there are only numbers. The
> language gives you operations that are completely agnostic of the underlying
> representation. It doesn't make sense to ask whether + for ints is little
> endian or big endian: it operates on numbers, not ordered sequences of
> bytes.
or whatever. Its representation is totally implementation-defined.
--
--
--
When I mentioned the following encoding, some of the classes can be defined as implementation defined encoding:class encoding
{
public:
virtual std::basic_string<EncodingElement> encode(const std::basic_string<CharType>& str) = 0;
virtual std::basic_string<CharType> decode(const std::basic_string<EncodingElement>& str) = 0;
};
--
I think you are mistaking me for someone else.
I just want a good interface for manipulate between various encodings and to be able to deal with various files that use them.
(3) I was think about have iterators as well in a different settings (without encoding/decoding). They are also to crawl through a string (say a UTF-8 string), but they won't work if the you'd like to replace say a 2-byte code with a 3-byte one inside a string: too inefficient!
for (char32_t& x: a_utf_string)
{
if (x == 'a')
{
x = 'π’
}
}
You may allow such manipulations even if they are inefficient. But is it worth the shot: we have to resize the array.
(1) Are you say that the Committee is happy with the idea of an Ecoded String class?(2) My proposal was to use the ecoding class only for conversion. You can encode and decode the whole text in one go.
(3) I was think about have iterators as well in a different settings (without encoding/decoding). They are also to crawl through a string (say a UTF-8 string), but they won't work if the you'd like to replace say a 2-byte code with a 3-byte one inside a string: too inefficient!
On 5/8/13, Nicol Bolas <jmck...@gmail.com> wrote:
> On May 8, 2013, Mikhail Semenov wrote:
> > (1) Are you say that the Committee is happy with the idea of
> > an Ecoded String class?
> >
> > (2) My proposal was to use the ecoding class only for
> > conversion. You can encode and decode the whole text in one go.
>
> ... and? Look at the proposal; it already has transcoding support
> for "the whole text in one go".
>
> > (3) I was think about have iterators as well in a different
> > settings (without encoding/decoding). They are also to crawl
> > through a string (say a UTF-8 string), but they won't work if
> > the you'd like to replace say a 2-byte code with a 3-byte one
> > inside a string: too inefficient!
>
> Codepoint iterators would only provide value access to
> codepoints. You can't set a codepoint via a codepoint iterator. The
> only encoding where setting a codepoint by iterator would ever work
> (without the container) would be UTF-32.
I think an output iterator appending to the string would handle
codepoints just fine in any encoding. Indeed, that work must
effectively be done by any transcoder. We might as well make the
primitive available.
I think an output iterator appending to the string would handle
codepoints just fine in any encoding. Indeed, that work must
effectively be done by any transcoder. We might as well make the
primitive available.