Rune?

317 views
Skip to first unread message

Mehmet D. Akın

unread,
Feb 13, 2013, 6:08:50 AM2/13/13
to mi...@dartlang.org
Hi,

In experimental branch it seems the name "Rune" is now used to represent Unicode code points.


* The characters of a string are encoded in UTF-16. Decoding UTF-16, which
 * combines surrogate pairs, yields Unicode code points. Following a similar
 * terminology to Go we use the name "rune" for an integer representing a
 * Unicode code point. The runes of a string are accessible through the [runes]
 * getter. I know this is a bit of a bike shedding issue and rune is used in Go as well, but IMHO, CodePoints or even java's Character - char would be a better name.

Ladislav Thon

unread,
Feb 13, 2013, 6:13:02 AM2/13/13
to mi...@dartlang.org

* The characters of a string are encoded in UTF-16. Decoding UTF-16, which
 * combines surrogate pairs, yields Unicode code points. Following a similar
 * terminology to Go we use the name "rune" for an integer representing a
 * Unicode code point. The runes of a string are accessible through the [runes]
 * getter. I know this is a bit of a bike shedding issue and rune is used in Go as well, but IMHO, CodePoints or even java's Character - char would be a better name.

+1.

Runes? Are we all writing tengwar now?

LT

Alex Tatumizer

unread,
Feb 13, 2013, 9:51:44 AM2/13/13
to mi...@dartlang.org
Rune (according to google)
Noun
A letter of an ancient Germanic alphabet, related to the Roman alphabet.
A similar mark of mysterious or magic significance.

I like "mysterious or magic significance" part. Cool word. 

To critics:
it cannot be called "character", because with UFT-16 Strings, getCharCodeAt(n) now returns 16-bit int, which is NOT a character in Unicode sense.
Right now, in dart, it's impossible to distinguish between array of ints and array of Unicode characters represented as ints - both have type List<int>



Ladislav Thon

unread,
Feb 13, 2013, 9:54:54 AM2/13/13
to mi...@dartlang.org
it cannot be called "character", because with UFT-16 Strings, getCharCodeAt(n) now returns 16-bit int, which is NOT a character in Unicode sense.
Right now, in dart, it's impossible to distinguish between array of ints and array of Unicode characters represented as ints - both have type List<int>

The Unicode terminology distinguishes between code points (= characters) and code units. Both of them are integers.

LT

Erik Corry

unread,
Feb 13, 2013, 9:58:31 AM2/13/13
to mi...@dartlang.org
See also https://code.google.com/p/dart/issues/detail?id=8519
> --
> Consider asking HOWTO questions at Stack Overflow:
> http://stackoverflow.com/tags/dart
>
>



--
Erik Corry
Google Denmark ApS
Sankt Petri Passage 5, 2. sal
1165 København K - Denmark - CVR nr. 28 86 69 84

Florian Loitsch

unread,
Feb 13, 2013, 10:02:59 AM2/13/13
to General Dart Discussion
On Wed, Feb 13, 2013 at 3:51 PM, Alex Tatumizer <tatu...@gmail.com> wrote:
Rune (according to google)
Noun
A letter of an ancient Germanic alphabet, related to the Roman alphabet.
A similar mark of mysterious or magic significance.

I like "mysterious or magic significance" part. Cool word. 

To critics:
it cannot be called "character", because with UFT-16 Strings, getCharCodeAt(n) now returns 16-bit int, which is NOT a character in Unicode sense.
fyi: getCharCodeAt(n) is deprecated in favor of getCodeUnitAt(n).
 
Right now, in dart, it's impossible to distinguish between array of ints and array of Unicode characters represented as ints - both have type List<int>

--
Consider asking HOWTO questions at Stack Overflow: http://stackoverflow.com/tags/dart
 
 



--
Give a man a fire and he's warm for the whole day,
but set fire to him and he's warm for the rest of his life. - Terry Pratchett

Alex Tatumizer

unread,
Feb 13, 2013, 11:58:15 AM2/13/13
to mi...@dartlang.org

Indeed:

abstract int codeUnitAt(int index)

Returns the 16-bit UTF-16 code unit at the given index.

Why not use "char" now instead of "rune"?

In the latest version, there are 3 different terms in use:
1. charCodes, as in

factory String.fromCharCodes(Iterable<int> charCodes)

(NOT DEPRECATED!)

2. codeUnits, as in

final Iterable<int> codeUnits


3. runes, as in

final Iterable<int> runes

Isn't it a bit confusing?









Alex Tatumizer

unread,
Feb 13, 2013, 12:34:50 PM2/13/13
to mi...@dartlang.org
very informative rant about unicode support in popular languages:
http://unspecified.wordpress.com/2012/04/


Florian Loitsch

unread,
Feb 13, 2013, 1:20:41 PM2/13/13
to General Dart Discussion
On Wed, Feb 13, 2013 at 5:58 PM, Alex Tatumizer <tatu...@gmail.com> wrote:

Indeed:

abstract int codeUnitAt(int index)

Returns the 16-bit UTF-16 code unit at the given index.

Why not use "char" now instead of "rune"?
Because "char" has a completely different meaning: "é" is generally considered to be one character. However it might not be represented as a single code-unit, or even rune. Calling it "character" would be extremely misleading.
 

In the latest version, there are 3 different terms in use:
1. charCodes, as in
charCodes represent both codeUnits and runes. 

factory String.fromCharCodes(Iterable<int> charCodes)

You could remove this function and add String.fromCodeUnits and String.fromRunes.
It's a trade-off. Currently you can create Strings without really knowing what these integers are. If we split into fromCodeUnits and fromRunes you need to know more about your list (and Unicode).
 
(NOT DEPRECATED!)

2. codeUnits, as in

final Iterable<int> codeUnits


3. runes, as in

final Iterable<int> runes

Isn't it a bit confusing?






--
Consider asking HOWTO questions at Stack Overflow: http://stackoverflow.com/tags/dart
 
 

Florian Loitsch

unread,
Feb 13, 2013, 1:31:26 PM2/13/13
to General Dart Discussion
On Wed, Feb 13, 2013 at 6:34 PM, Alex Tatumizer <tatu...@gmail.com> wrote:
very informative rant about unicode support in popular languages:
http://unspecified.wordpress.com/2012/04/
Initially Dart had exactly the Strings he would prefer. After several months we switched to UTF-16.
The biggest reason was Webkit: Webkit is UTF-16 and the interaction with it would have been very expensive. It also made troubles with Strings that weren't well-formed (not correct UTF-16). The initial spec required a String to be a sequence of Unicode scalar values. That doesn't work if your String comes from Webkit. You also want your strings to be the same after a round-trip through Dart code (for example JS->Webkit->Dart->Webkit->JS).
For all these reasons (and maybe some others I'm currently forgetting) we switched to UTF-16.
Also note that scalar values are not enough either. You still have to deal with combining characters, and similar. Scalar values make it slightly easier to make less errors, but you still get things wrong if you don't know about Unicode: even simple things like comparing two strings won't work on scalar values. Sorting probably doesn't do what you want either.




--
Consider asking HOWTO questions at Stack Overflow: http://stackoverflow.com/tags/dart
 
 

Alex Tatumizer

unread,
Feb 13, 2013, 3:28:30 PM2/13/13
to mi...@dartlang.org
@Florian: I agree with almost everything you said, except a couple of points:

> Calling it "character" would be extremely misleading.
I agree it would be misleading, but not "extremely" misleading, just somewhat misleading, and even this residual misleading is due to some uncertainty that still surrounds the very idea of character.
Note that the standard 10646 is called "Universal character set", and here's what wikipedia says about the differences:
ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards

It looks like 2 standards bodies couldn't agree on definitions.

I'm not against rune, it's just it looks like rune means exactly the same thing that is called a "character" in UCS-4  - please correct me if I'm wrong, I'm confused, too! (wikipedia says nothing at all about rune in that sense, so I couldn't figure it out).

>
>factory String.fromCharCodes(Iterable<int> charCodes)
>It's a trade-off. Currently you can create Strings without really knowing what these integers are. If we split into fromCodeUnits and fromRunes you need to know more about your list (and Unicode). If we split into fromCodeUnits and fromRunes you need to know more about your list (and Unicode).
 
But if "character" is extremely misleading, what is "char" - is it not an abbreviation for "character"? (BTW, abbreviations are taboo in dart, I don't know how this "char" made it that far). Moreover, in this function, charCodes are in Schrödinger state, being either utf-16, or UCS-4, or runes (if really different from UCS-4). Is it possible to NOT know about Unicode and still being able to easily manipulating notions like runes, UTF-16, charCodes, etc?

Erik Corry

unread,
Feb 14, 2013, 3:43:03 AM2/14/13
to mi...@dartlang.org
The word "Character" has several problems.

1) Character excludes Non-characters, code points do not:
http://www.unicode.org/glossary/#noncharacter
2) Character is not a number, it's a character. The rune is its
number. Unlike Smalltalk, Dart has no character type, but short
strings are normally used (like in JS).
3) In the future we might like to use the word 'character' for
composed characters that consist of a base character and perhaps some
accents (represented with multiple runes)

The fromCharCodes name is deliberately a little vague. You can give it
UTF-16 char units or arbitrary 21 runes and it will be liberal in what
it accepts. It is also the name that people from the JS world are
used to. We considered having separate fromCodeUnits and fromRunes
constructors, which would throw exceptions on seeing unexpected input,
but it's hard to see the utility: Normally if someone feeds your
program an unusual set of input codes you prefer the program to 'do
the obvious thing' rather than throw completely unexpected exceptions.
> --
> Consider asking HOWTO questions at Stack Overflow:
> http://stackoverflow.com/tags/dart
>
>



--

Erik Corry

unread,
Feb 14, 2013, 5:14:52 AM2/14/13
to mi...@dartlang.org
Oops! For "char units or arbitrary 21 runes" you should read "code
units or arbitrary 21 bit runes".

Alex Tatumizer

unread,
Feb 14, 2013, 9:54:52 AM2/14/13
to mi...@dartlang.org
@Eric:
My theory is that non-characters are in fact characters of type "non-characters" (they are covered by UCS, but not called non-characters there)..
It's like NaN - number called "non-a-number". Function can have type "num" and return NaN, and it's not a bug. 

Anyway, take it with a grain of salt. Term "rune" sounds good to me, especially "mysterious, magic" part. 

The only problem is that previously, there was a consistency between names of getter (charCodes) and constructor name (fromCharCodes).
Now it's gone. Reference to javascript is not convincing: javascript still has charCodeAt (which you renamed) and a bunch of others that you renamed and/or removed completely.
(Please think about it, no need to comment).
Thanks,
A.T


Alex Tatumizer

unread,
Feb 14, 2013, 10:14:53 AM2/14/13
to mi...@dartlang.org
BTW, this post explains some difficulties in applying standard concepts to javascript treatment of Strings: 



Ladislav Thon

unread,
Feb 14, 2013, 11:00:39 AM2/14/13
to General Dart Discussion


> The word "Character" has several problems.
>
> 1) Character excludes Non-characters, code points do not:
> http://www.unicode.org/glossary/#noncharacter

And the word "rune" excludes all characters that are not runes.

The prevailing terminology seems to be "code units" and "code points", even if Unicode uses bunch of synonyms (code value, scalar value, I don't know what else), so what's wrong with sticking to that?

LT

William Hesse

unread,
Feb 14, 2013, 11:18:05 AM2/14/13
to General Dart Discussion
I am strongly against using "rune". I think we should use "character"
as a synonym for code point. I don't think there are many confusions
or misunderstandings caused by this, and if there are, they should be
solved by repeating "A character is a unicode code point" many times
in the documentation.

I think we should be using "character" and "character code" in
documentation, discussions, and even function names as a synonym for
"code point". There is nothing else which "character" should mean -
if people mean "glyph" or "combined display form", they can say so,
and most often people won't need to talk about those things. The most
important concepts for people will then be:

Code unit: A (UTF-16) 16-bit number, either a basic plane code point
or half of a surrogate pair.

Code point, character, character code: A unicode code point.

Then people will need to talk about byte streams that contain
character data by saying that they are a sequence of UTF-8 code units,
or of ASCII characters, or of 1-byte code points (Latin 1, because of
Webkit's 8-bit strings).




--
William Hesse

Alex Tatumizer

unread,
Feb 14, 2013, 11:38:47 AM2/14/13
to mi...@dartlang.org
From "Rune: disambiguation" (wikipedia)
Rune, programming jargon meaning a Unicode code point represented as a 32-bit integer


From "Code point" (wikipedia)
Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data....
...The distinction between a code point and the corresponding abstract character is not pronounced in Unicode...

Looks like rune is just a programming jargon for "character"

The only problem is that 99.9% of programmers are not familiar with this jargon (try search for "rune" in google). Otherwise, it's a cute word.




Mehmet D. Akın

unread,
Feb 14, 2013, 12:08:07 PM2/14/13
to mi...@dartlang.org

Sorry to beat the dead horse. The term was probably coined by Go authors much earlier plan9 days (http://doc.cat-v.org/plan_9/4th_edition/papers/utf). 

There is a very thin line between "cute" and "irritating", especially in programming languages.

Ladislav Thon

unread,
Feb 14, 2013, 2:00:01 PM2/14/13
to mi...@dartlang.org
Oh, and this one:

The fromCharCodes name is deliberately a little vague.

How about deliberately avoiding a new term without well-defined meaning that is visually so similar to all the others, and using plain old "decode"?

LT

Karl J. Smith

unread,
Feb 14, 2013, 2:27:26 PM2/14/13
to mi...@dartlang.org

Rune is short, easy to type, close enough in meaning to character, but different enough that you initially need to look it up instead of making an incorrect assumption.

I like it.

Alex Tatumizer

unread,
Feb 14, 2013, 3:52:55 PM2/14/13
to mi...@dartlang.org
Bad news: I found a logical flaw in definition of rune :(
If we define it as jargon (that is, codeword) for Unicode Code Point, we get nonsense.
There are too many levels of redirection here, which lead eventually to ... nowhere.

The point (pun not intended) is that Code Point is a code itself. Every time we mention code, we have to clearly realize what this code really encodes.
And if we pose a question in the form "Unicode Code Point is the code for what?" then the answer is either "Character" or "we don't know".
In either case, we are in a fix: if the former is true, then rune is a name of a name of a name; in the latter case, we don't know what we are talking about.

I think we a paradox on our hands.
 




Alex Tatumizer

unread,
Feb 14, 2013, 4:36:10 PM2/14/13
to mi...@dartlang.org
> ... and using plain old "decode"?
Decode WHAT?
We don't know what it is, otherwise we could just say fromIt(). What this "it" is?
I can as well argue that "encode" is more appropriate here. We have a code before, and a code after. How one is better than the other?
Whatever choice we make, the reverse operation does the opposite. It's like complex "i" and "-i": you can rename them, and get same result.

Full-blown paradox.

I have a better idea: let's revert to Characters here, and I take "rune" and use it in the meaning of "Representation of Unified Non-Equalities", I was looking for a good word in vain, and here such excellent word is wasted for non-purpose.

Ladislav Thon

unread,
Feb 14, 2013, 11:55:00 PM2/14/13
to General Dart Discussion


> > ... and using plain old "decode"?
> Decode WHAT?

factory String.decode(List<int> codes);

Decode a list of integers that can be whatever the implementation takes. They don't necessarily have to code characters.

> I can as well argue that "encode" is more appropriate here. We have a code before, and a code after.

No. We have a String after.

> I have a better idea: let's revert to Characters here, and I take "rune" and use it in the meaning of "Representation of Unified Non-Equalities"

:-)

You know, the representations idea is not going to fly, as representations don't have meaning. People don't like to deal with stuff that doesn't mean anything -- it's hard to judge whether it is compatible with what we want.

LT

Alex Tatumizer

unread,
Feb 15, 2013, 2:48:54 PM2/15/13
to mi...@dartlang.org
> People don't like to deal with stuff that doesn't mean anything -- it's hard to judge whether it is compatible with what we want.

Same applies to current version of rune (let's call it rune-1). It's just a 32-bit integer. What is it good for? Just for one thing: you can pass it to fromCharCodes (which is even more abstract than rune). No other meaning. (well, there's exception, I admit: some programmers know how to construct glyphs (pictures) out of these codes using fonts etc., but for the rest of us, it's just a number)
.
Proposed rune-2 is as abstract as rune-1, but it's good for all comparison operations (note: ALL of them, whereas rune-1 is good only for == and !=).

Alex Tatumizer

unread,
Feb 17, 2013, 2:01:38 PM2/17/13
to mi...@dartlang.org
I have an uneasy feeling about this new API since the day of announcement.
The culprit of my grievances: CodeUnit.

The problem is that pompous CodeUnit doesn't play well with merry couple of char and rune.
Upon considerable deliberation, I suggest renaming CodeUnit to a simpler equivalent: "Kode"

After refactoring, we will have an aesthetically pleasing trio of terms: :
1. char : abbrev 
2. rune : (fake) jargon
3. kode : misspelling

It would be much easier to remember and explain to others, along the following lines:
- do you want 16 bit? - misspelling
- do you want 32 bit? - jargon wannabe
- don't know what you are doing? - abbrev

My 3.14 cents.


Ladislav Thon

unread,
Feb 17, 2013, 4:42:42 PM2/17/13
to mi...@dartlang.org
> People don't like to deal with stuff that doesn't mean anything -- it's hard to judge whether it is compatible with what we want.

Same applies to current version of rune (let's call it rune-1). It's just a 32-bit integer.

No. It is a Unicode code point, which is a well-defined thing.

LT

Alex Tatumizer

unread,
Feb 18, 2013, 9:47:44 AM2/18/13
to mi...@dartlang.org
@Ladislav:
I found a rant that explains intricate relationships between Unicode and Unicode encoding schemes - please take a look

Among other things, it confirms my decode/encode conjecture - that is, whatever you described as "decode" in your proposal is in fact very much encode, and vice versa
It also questions some common stereotypes (like "Unicode encoding", which doesn't exist)

There's another interesting paper to look at:

It tells you that there are 3 standard ways to encode unicode code point as int, one of which is UTF-32 (rarely mentioned anywhere)
The very existence of this term means that the expression "returns Unicode code point" is a bit ambiguous, though everybody will understand it as UTF-32.
"Rune" (so to speak)  probably should be defined as "unicode code point in UTF-32 encoding"

> No. It is a Unicode code point, which is a well-defined thing
Yes, it is, but ... see above. .
Does you program really depend on it being exactly UTF-32? What if you get it UTF-8-based (that is, 1-4 utf-8 bytes packed in 32-bit int value, still representing the same code point),
and all standard libraries will support it this way? Will your program break?

In fact, what you need to know in 99.99% cases is that you get unicode code point represented SOMEHOW as int. 

I can say the same about date representations: date.rune2 returns a number that represents this date SOMEHOW, (which is good enough for all comparisons).

Again, with rune-1 you can't do even this much: comparing rune1 > rune2 by itself doesn't mean a lot (e.g. to sort alphabetically, you need to implement collation rules, which are not completely trivial)

Ladislav Thon

unread,
Feb 18, 2013, 4:12:56 PM2/18/13
to mi...@dartlang.org
@Ladislav:
I found a rant that explains intricate relationships between Unicode and Unicode encoding schemes - please take a look

Among other things, it confirms my decode/encode conjecture - that is, whatever you described as "decode" in your proposal is in fact very much encode, and vice versa

I don't get it. A sequence of integers is the unreadable form, string is the readable form, therefore going from the list of integers into the string is decoding. Especially given that the list of integers can contain stuff like surrogate pairs.
 
There's another interesting paper to look at:

It tells you that there are 3 standard ways to encode unicode code point as int, one of which is UTF-32 (rarely mentioned anywhere)

Not really. The article shows 3 ways of representing a Unicode code point as a sequence of integers. Only in one case (UTF-32), the sequence actually degenerates to a single integer. That article is a weak argument.

Now I think it's true that there are more ways of representing a Unicode code point as a single integer, but there's only one sensible today: UTF-32/UCS-4 as 32-bit integer.

LT

Alex Tatumizer

unread,
Feb 18, 2013, 5:01:55 PM2/18/13
to mi...@dartlang.org
What's really strange that in one place, where we don't experience any confusion whatsoever (timeMillisecondsSinceEpoch), we want to be more pious then the Pope, in another (String API) we juggle loosely defined terms like char, codeUnit and rune with ease.

In fact, formal names of those entities, according to unicode standard (if my theory is correct) are:
codeUnit:  codeUnitUtf16
rune: codeUnitUtf32
char: codeUnitUtf16Or32   

These names are ugly, but certainly not uglier than millisecondsSinceEpoch, and consistent with each other.

However, if we are OK with sacrificing formality, then at least we have to be consistent in the way how we sacrifice it.

As I discovered just minutes ago, my earlier suggestion of renaming CodeUnit to Kode allows to regain consistency in more than one way:

1. Each of newly minted words char, kode and rune is a crippled version of some other word (each crippled in a unique way, which is good for pedagogical purposes).

2. Each of them is ALSO a fish. 

Indeed, according to wikipedia, 

Arctic char (Salvelinus alpinus) is a cold-water fish in the Salmonidae family, native to Arctic, sub-Arctic and alpine lakes and coastal waters 

Kode, if misspelled again as Cod, is "the common name for the genus Gadus of demersal fishes, belonging to the family Gadidae."

Rune, fish of genus "Not-A-Fish", but I have a close friend in the institution charged with assigning names to newly discovered species of fish, he sometimes consults with me while looking for a good name, so
it won't be very difficult to sell him on "rune", especially given that it's really a very fitting name for a fish.

What's good about this idea is that it allows us to visualize complex concepts of string programming - for, as the saying goes, seeing once is better than hearing about it a hundred times.
(Please see images attached)





Reply all
Reply to author
Forward
0 new messages