> One thing I found surprisingly different / radical about GO was its
> unicode handling. strings are compared bytewise, length returns
> number of bytes, [] returns a byte, yet all string literals are
> assumed to be utf-8 unicode and the for loop iterates using runes.
String literals need not contain valid UTF-8. You can use \x and
friends to put arbitrary bytes in string literals.
> That's a bit odd, and yet there seems to be minimal doc and rationale
> on this. How many newbies won't realize a map of strings doesn't do
> unicode normalization of the keys? How exactly do I get a map of
> UnicodeNormalizedStrings? Where is the Unicode FAQ, for people
> coming from other languages? For gods sake, the wikipedia page for go
> has one word matching 'unicode'... and your faq isn't any better!
We are actively working on support for normalization, collation, etc.,
but it's a complex subject and it's important to get it right. You can
see some initial ideas in the exp/norm package.
> Which brings me to my next point.. the doc on map sucks. "The key can
> be of any type for which the equality operator is defined" excuse me,
> what is this operator and how do I define it on my type? Or am I not
> able to at all?
You can not define equality operators. I assume you are quoting
Effective Go. Right after that clause, the document lists the types
that can be used as map key types.
Ian
Karl <karl.p...@gmail.com> writes:> One thing I found surprisingly different / radical about GO was its
> unicode handling. strings are compared bytewise, length returns
> number of bytes, [] returns a byte, yet all string literals are
> assumed to be utf-8 unicode and the for loop iterates using runes.String literals need not contain valid UTF-8. You can use \x and
friends to put arbitrary bytes in string literals.
> Which brings me to my next point.. the doc on map sucks. "The key can
> be of any type for which the equality operator is defined" excuse me,
> what is this operator and how do I define it on my type? Or am I not
> able to at all?You can not define equality operators. I assume you are quoting
Effective Go. Right after that clause, the document lists the types
that can be used as map key types.
Ian
> Karl <karl.p...@gmail.com> writes:
>
>> One thing I found surprisingly different / radical about GO was its
>> unicode handling. strings are compared bytewise, length returns
>> number of bytes, [] returns a byte, yet all string literals are
>> assumed to be utf-8 unicode and the for loop iterates using runes.
>
> String literals need not contain valid UTF-8. You can use \x and
> friends to put arbitrary bytes in string literals.
Why is that even allowed? Why not require people to use a []byte if
they're going to pass around arbitrary bytes?
The answer might be that a []byte is not immutable, but that seems to be
an orthogonal issue, i.e., using a string where you really want an
immutable []byte feels like a misuse of string to me.
--
Martin Geisler
Mercurial links: http://mercurial.ch/
Ian Lance Taylor writes:
> Karl writes:
>
>> One thing I found surprisingly different / radical about GO was its
>> unicode handling. strings are compared bytewise, length returns
>> number of bytes, [] returns a byte, yet all string literals are
>> assumed to be utf-8 unicode and the for loop iterates using runes.
>
> String literals need not contain valid UTF-8. You can use \x and
> friends to put arbitrary bytes in string literals.Why is that even allowed? Why not require people to use a []byte if
they're going to pass around arbitrary bytes?
In short, there are three kinds of strings, and you're conflating
them, a common misunderstanding. They are:
1) the substring of the source that lexes into a string literal.
2) a string literal.
3) a value of type string.
Only the first is required to be UTF-8. The second is required to be
written in UTF-8, but its contents are interpreted various ways (*)
and may encode arbitrary bytes. The third can contain any bytes at
all.
Try this on:
var s string = "\xFF語"
Source substring: "\xFF語", UTF-8 encoded. The data:
22
5c
78
46
46
e8
aa
9e
22
String literal: \xFF語 (between the quotes). The data:
5c
78
46
46
e8
aa
9e
The string value (unprintable; this is a UTF-8 stream). The data:
ff
e8
aa
9e
And for record, the characters (code points):
<erroneous byte FF, will appear as U+FFFD if you range over the string value>
語 U+8a9e
Please make a note of it.
-rob
* Examples:
\u1234 \U00012345 \377 ÿ 語 correspond to various numbers of bytes in UTF-8
\xFF encodes exactly one byte, not UTF-8
> Strings are *not* required to be UTF-8. Go source code *is* required
> to be UTF-8. There is a complex path between the two.
>
> In short, there are three kinds of strings, and you're conflating
> them, a common misunderstanding. They are:
>
> 1) the substring of the source that lexes into a string literal.
> 2) a string literal.
> 3) a value of type string.
This is all perfectly clear.
My question was really why the string type is allowed to carry non-UTF-8
data. Especially when the built-in range construct has a clear
preference for UTF-8 encoded characters.
Interoperability for another. What happens if you're reading a
database and it returns a string in Shift-JIS? If you know it's
Shift-JIS, you can plan for it and use the language to help you. If Go
said "no" that would not be helpful.
But mostly it's because there's no compelling need to restrict them
this way. The programmer should have some freedom in their use.
-rob
There will be a more complete story for Unicode in Go at some point,
but what's there now is less than rudimentary.
-rob
However, using an 8-bit Unicode string to contain arbitrary byte data sounds like a mistake to me. You'd never know if the sequences of valid UTF-8 in the byte data were really characters or something else, a SJIS source (or a JPEG source) could well contain fragments of valid UTF-8. Iterating through the SJIS would get the wrong runes, wouldn't divide up on the right character boundaries, etc. Programming languages that handle this carefully distinguish between an array of bytes and an 8-bit Unicode string. Often the separation was made after it became obvious that mixing them up causes no end of errors: an example is http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit