| * The characters of a string are encoded in UTF-16. Decoding UTF-16, which |
| * combines surrogate pairs, yields Unicode code points. Following a similar |
| * terminology to Go we use the name "rune" for an integer representing a |
| * Unicode code point. The runes of a string are accessible through the [runes] |
| * getter. I know this is a bit of a bike shedding issue and rune is used in Go as well, but IMHO, CodePoints or even java's Character - char would be a better name. |
* The characters of a string are encoded in UTF-16. Decoding UTF-16, which * combines surrogate pairs, yields Unicode code points. Following a similar * terminology to Go we use the name "rune" for an integer representing a * Unicode code point. The runes of a string are accessible through the [runes] * getter. I know this is a bit of a bike shedding issue and rune is used in Go as well, but IMHO, CodePoints or even java's Character - char would be a better name.
it cannot be called "character", because with UFT-16 Strings, getCharCodeAt(n) now returns 16-bit int, which is NOT a character in Unicode sense.
Right now, in dart, it's impossible to distinguish between array of ints and array of Unicode characters represented as ints - both have type List<int>
Rune (according to google)NounA letter of an ancient Germanic alphabet, related to the Roman alphabet.A similar mark of mysterious or magic significance.I like "mysterious or magic significance" part. Cool word.To critics:it cannot be called "character", because with UFT-16 Strings, getCharCodeAt(n) now returns 16-bit int, which is NOT a character in Unicode sense.
Right now, in dart, it's impossible to distinguish between array of ints and array of Unicode characters represented as ints - both have type List<int>
Returns the 16-bit UTF-16 code unit at the given index.
Why not use "char" now instead of "rune"?3. runes, as in
Isn't it a bit confusing?
In the latest version, there are 3 different terms in use:
1. charCodes, as in
very informative rant about unicode support in popular languages:
http://unspecified.wordpress.com/2012/04/
> The word "Character" has several problems.
>
> 1) Character excludes Non-characters, code points do not:
> http://www.unicode.org/glossary/#noncharacter
And the word "rune" excludes all characters that are not runes.
The prevailing terminology seems to be "code units" and "code points", even if Unicode uses bunch of synonyms (code value, scalar value, I don't know what else), so what's wrong with sticking to that?
LT
The fromCharCodes name is deliberately a little vague.
Rune is short, easy to type, close enough in meaning to character, but different enough that you initially need to look it up instead of making an incorrect assumption.
I like it.
> > ... and using plain old "decode"?
> Decode WHAT?
factory String.decode(List<int> codes);
Decode a list of integers that can be whatever the implementation takes. They don't necessarily have to code characters.
> I can as well argue that "encode" is more appropriate here. We have a code before, and a code after.
No. We have a String after.
> I have a better idea: let's revert to Characters here, and I take "rune" and use it in the meaning of "Representation of Unified Non-Equalities"
:-)
You know, the representations idea is not going to fly, as representations don't have meaning. People don't like to deal with stuff that doesn't mean anything -- it's hard to judge whether it is compatible with what we want.
LT
> People don't like to deal with stuff that doesn't mean anything -- it's hard to judge whether it is compatible with what we want.Same applies to current version of rune (let's call it rune-1). It's just a 32-bit integer.
@Ladislav:I found a rant that explains intricate relationships between Unicode and Unicode encoding schemes - please take a lookAmong other things, it confirms my decode/encode conjecture - that is, whatever you described as "decode" in your proposal is in fact very much encode, and vice versa
There's another interesting paper to look at:It tells you that there are 3 standard ways to encode unicode code point as int, one of which is UTF-32 (rarely mentioned anywhere)