[erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

69 views

Skip to first unread message

Raimo Niskanen

unread,

Oct 31, 2012, 10:44:14 AM10/31/12

to erlang-q...@erlang.org

Allthough there might be opinions on whether allowing Unicode variable
and atom names is a good idea, I would like to discuss EEP 40 itself.
In a previous thread there was much said about Unicode or not but I only
found the following about EEP 40, hoping I did not miss anything valuable:

On Thu, Oct 25, 2012 at 05:20:21PM +1300, Richard O'Keefe wrote:
>
> On 23/10/2012, at 10:20 PM, Jesper Louis Andersen wrote:
> >
> > Google Go takes two stances differently:
> >
> > * There is *no* normalization. This means that you can write the same symbol using one codepoint or with two code points combining into the same representation. Of course this is the conservative stance where it is expected that people do not do silly things. But my guess is that it is much easier to handle. Is there a specific reason to pick normalization, apart from the obvious one? I see some similarities to tabs vs spaces for indentation here.
>
> Normalisation is a pain in the πρωκτος. The only thing worse is _not_ doing it.
> (As it happens, I am planning to rewrite the tokeniser of my Smalltalk system to
> accept Unicode -- the run-time already does -- and this is one of the issues I've
> been thinking about.)
>
> I can see four options:
> (1) say that different encodings of the same text are different
> (2) leave it undefined whether they are different
> (3) say that it's someone else's problem (like XML 1.0, which says
> "Characters in names should be expressed using Normalization Form C"
> but leaves it to the author to make it so)
> (4) require normalisation.
>
> The issue is a severely practical one: can two people with different editors
> edit the same source file? As you sapiently observe, this _is_ very like tabs
> vs spaces: your editor may think tabs are every 3 columns, but mine thinks they
> are every 8, and you didn't tell _me_ otherwise. (Again, my Smalltalk system
> discerns method and class boundaries using indentation, and it has paid off to
> enforce no-tabs-in-source-files at check-in.) Of the options above, it is
> only option (4) that makes multiple editors safe to use.
>
> As it happens, I _have_ had the experience of typing exactly what I saw and having
> it fail to match, so I do not want to see anyone else suffering the same fate.
>
> > * In Go, identifiers are exported if they begin with a codepoint in class Lu. This is also a very conservative stance since now your programs must use an Lu codepoint for variable names if we just ported that solution to Erlang. But it is quite simple again, and very easy to handle from a parser perspective.
>
> Restriction to Lu is not an option for Erlang. We *have* to continue to
> allow "_" as well, which is a Pc character, not an Lu character. And if
> we allow _that_ Pc character, why not the others? They aren't used for
> anything else in Erlang.
>
> We really have to allow Lt as well. It would be surpassing strange if
> Ljudevit was a variable but ǈudevit was not.
> There are 31 "Lt" letters in Unicode 6. Of those, 27 are Greek.
> The other 4 exist for the sake of Croatian (which has an alphabet of 30
> letters). As it happens, my maternal grandfather came from a small
> town not far from Dubrovnik. Do I want to be the one to tell 4.4 million
> people who look rather like Granddad Covič they can't write a variable
> name in their own language using their own letters? No, not really.
>
> >From a lexical analyser perspective, scanning variable names requires
> just two character sets: things that can begin a variable and things
> that can continue one. How those sets are derived really has no effect
> whatever on how complicated the parsing is. Scanning unquoted atoms is
> admittedly tricky, but that's entirely down to Erlang's _existing_
> treatment of "." and "@"; without those two to worry about we'd just
> have atom starts and atom continuations and again the derivation of
> the sets would make no difference to the scanner's complexity.
>

That was the discussion so far. Here follows my thoughts.

Set notation mistake?
---------------------

I do not understand the BNF definition of variable in the EEP:
variable ::= var_start var_continue*

var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_ID_Start)

var_continue ::= XID_Continue U "@"

As I read the Unicode XID_Start definition
<http://www.unicode.org/Public/6.2.0/ucd/DerivedCoreProperties.txt>
there are no general category Pc (Connector_Punctuation) characters in
XID_Start, hence will there be no such in the set intersection
(which as I understand '∩' should mean) defining var_start. Therefore
U+5F LOW LINE aka '_' Underscore is not allowed to start a variable.

Is there something wrong in that set notation, or what did I misunderstand?

Was it not ment to be:
var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc

More restricted variable names
------------------------------

Nevertheless, I would like a slightly more conservative change in how Erlang
should use Unicode in variable names and unquoted atoms.

I want to be able to read printed source code on a paper and at least
understand if Ƽ = count() has a variable, an atom or an integer to the left.
This is an impossible goal because we can today e.g Cyrillic А in any .erl
file and that will look as it should compile but it will not.

So I have to change that requirement into; if it compiles I want to be able
to tell from a noncolour printed source code listing what the semantics is.

Therefore I think a more conservative rule for variable start is needed:
variable ::= var_start var_continue*

var_start ::= ("A".."Z" ∪ "_")

var_continue ::= XID_Continue ∪ "@"

I hereby ditch the characters "À".."Ö" ∪ "Ø".."Þ" that are allowed today since
if they are allowed there is no telling which of all accents are allowed
and so we have to allow all LATIN CAPITAL and therefor all GREEK, CYRILLIC,
ARMENIAN, GEORGIAN, GLAGOLITIC, COPTIC and DESERET CAPITAL letters,
and that is a too big set to handle for a human. Tools would become
essential.

I think it is better to restrict to a subset of 7-bit US-ASCII. Decent
editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
character is under the cursor and if it is A..Z or _ under U+7F it is a
variable start. That is a possible set to memorize even for non-english
programmers especially considering all reserved words are in 7-bit US-ASCII
and hence Erlang programmers must be somewhat familiar with that charset.

Removing the Latin-1 characters > 128 will need warnings in one release
introduction later, and probably an non-unicode compile flag. But I do not
think that many have used such characters to start variables so far.

We can then define mst_variable (maybe singleton variable) much like
in the proposed EEP:
mst_variable ::= mst_var_start var_continue*

mst_var_start ::= "_" ("A".."Z" ∪ "a".."z" ∪ "0".."9" ∪ "_" ∪ "@")

An alternative suggestion is to allow "@" as var_start:
variable ::= var_start var_continue*
var_start ::= ("A".."Z" ∪ "_" ∪ "@")

which require no change from today for maybe singleton variables:
mst_var_start ::= "_"

I can not think of anything partically bad with allowing @隠者 as a
variable name. The "@" makes it distinct from an atom, and "@" is
one of the variable prefix characters in perl (good or bad?!).

The underscore
--------------

I would like to argue against allowing all Unicode general category Pc
(Connector_Punctuation) character in place of "_". This class contain
in Unicode 6.2 these characters:
U+5F; LOW LINE
U+2034; UNDERTIE
U+2040; CHARACTER TIE
U+2054; INVERTED UNDERTIE
U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D; DASHED LOW LINE
U+FE4E; CENTERLINE LOW LINE
U+FE4F; WAVY LOW LINE
U+FF3F; FULLWIDTH LOW LINE

Of these at least U+2040 "⁀" is horizontal at the top of the line
and U+FE33 "︳" looks like a vertical bar (I guess intended for
vertical flow chinese) so they do not resemble "_" very much.
Allowing all these would make it hard to remember if a given
character is category Pc or something else e.g "|". Therefore
I think it will be enough to allow U+5F LOW LINE ("_", underscore).

An Erlang programmer will have to be able to enter many other
7-bit US-ASCII punctuation characters e.g ".,?:;%'" so
the underscore should pose no particular problem.

Unquoted atoms
--------------

The EEP proposes:
atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
| "." (Ll ∪ Lo)

I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
be excluded so an atom can not start with a capital looking letter,
but Pc ⊄ XID_Start so there is no reason to subtract it, and why
subtract Lo (Other_Letter)?

There also seems to be a typo in the definition of unquoted_atom
where an iteration of atom_continue is missing.

I propose:
unquoted_atom ::= atom_start atom_continue*

atom_start ::= atom_start_char
| "." atom_start_char

atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)

atom_continue ::= XID_Continue ∪ "@"
| "." XID_Continue

General explanation
-------------------

I think the EEP could benefit from explaining more about the used character
classes, what kind of stability annex #31 is designed to give and such.

When I did read the EEP it took several days of Unicode standard reading to
start understanding, and I think many hesitate before trying to understand
the EEP, which is a pity.

My first concern was about if I write code for one Unicode Erlang release
in the future, will then that code be valid for subsequent Erlang releases
based on later Unicode standards. It seems annex #31 is very much targeted
at solving that problem, and Unicode in itself is much about stability in
subsequent standards, so that problem seems handled, but I am not sure yet.

For example the EEP and my proposal both define atom_start to be XID_Start
minus a set containing uppercase and titlecase letters. XID_Start is
derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
in finding which codepoints are contained in Other_ID_Start. All I have
found is that it is used to give future stability to ID_Start so that when
the standard has to remove some codepoint from ID_Start it will be added
to Other_ID_Start and therefore XID_Start will not have lost a codepoint
so old code will still be valid.

But since we here define atom_start as above, moving a character from Lu
or Lt into Other_ID_Start will remove it from atom_start and old code
using it will not compile. If I am not mistaken. The same applies to
the EEP's definition of var_start.

I have not managed to find any stability statements from the Unicode
Consortium about if that could happen, much because I have not had
the time yet. Maybe instead the definition of atom_start above is
flawed and should use set unions only instead...?

I anyway miss this kind of stability reasoning/explanation in the EEP.

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Richard O'Keefe

unread,

Nov 1, 2012, 1:27:10 AM11/1/12

to Raimo Niskanen, erlang-q...@erlang.org

On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:
>
> Was it not ment to be:
> var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc

Yes. I made a mistake there.

>
> More restricted variable names
> ------------------------------
>
> Nevertheless, I would like a slightly more conservative change in how Erlang
> should use Unicode in variable names and unquoted atoms.
>
> I want to be able to read printed source code on a paper and at least
> understand if Ƽ = count() has a variable, an atom or an integer to the left.
> This is an impossible goal because we can today e.g Cyrillic А in any .erl
> file and that will look as it should compile but it will not.

I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks
like this: А. I grant you that it is somewhere between exceptionally
difficult and impossible to tell an A from an А from an Α (Latin
capital A, Cyrillic, and Greek respectively). But they are all capital
letters. The point of the proposal is that since А (U+0410) is a
capital letter, А = count() _should_ compile.

If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
that would have been hard to tell from a six, true.
But I don't see how this is any different from the fact that in a script
you don't know, you cannot tell _what_ a character is.
For example, I had a student this year whose native language was I
believe Malayalam. I can't tell a Malayalam letter from a digit from
a punctuation mark.

Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?

Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE.
Nothing to do with Cyrillic.

Reverting to the Middle Welsh letter, if I cannot tell a small letter
from a digit, does that mean that every unquoted atom should begin
with an English letter? (I cannot say "a Latin letter", because
ỽ _is_ a member of the extended Latin script.)

No, I'm sorry. This is ridiculous. Expecting everybody to begin
_their_ variables which you will almost certainly never see to begin
with an ASCII letter so _you_ can tell this from that; what sense does
that make? If it is in a script you cannot read, then you cannot read it.

Can we just try, for a minute or to, to entertain a rather wild idea?
Here's the idea: most programmers are adults. They can make informed
choices. If they *want* you to read their code, they are smart enough
to write in a script you can read. If they decide that it's more
important to them that _they_ can read comfortably, that's their
decision to make. If you want a Malayalam-speaker to write code for
you, put the language (English, Finnish, whatever) in the contract.

I have a confession to make. My multiple-programming-languages to
multiple-styled-output-formats tool is currently Latin-1 only.
That's because it's for _me_; nobody paid me to write it and I didn't
expect anyone else to find it useful (although someone did). It can,
for example, be configured to generate HTML, and it can be made to
wrap keywords in <B> and could as easily wrap variables in <U>. It
would probably take me about a week to revised the thing to use
Unicode. So then I'd have a tool that could generate printed listings
with variables underlined, without needing to slap untold numbers of
people in the face with the notion that they are and must remain
second-class world citizens.

> So I have to change that requirement into; if it compiles I want to be able
> to tell from a noncolour printed source code listing what the semantics is.

You are, in fact, proposing a backwards-incompatible change to Erlang,
in order to achieve a goal which is not in general achievable, and not
in my view worth achieving if you could.

Let's be realistic here. If you cannot read any of the words, it is not
going to do you any good to tell the variables from the atoms from the
numbers. Let's take an example. I took a snippet of Erlang out of
the Erlang/OTP release and transliterated the English letters to
Russian ones. If you _don't_ read the Cyrillic script, precisely what
good does it do you to know which are the variables? If you _do_ read
the Cyrillic script, this will seem to you to be complete gibberish,
so imagine it's a language you don't know.

ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
try
{ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
{ҕӄҽҲ,ҢӃ}
catch
ҒһҰӂӂ:ҔӁӁҾӁ ->
ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
end.

ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
{ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
{ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),

ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
end, [], җӅӂ),
ӂӃҺ=[]}, 0, ҥҳұ),
{ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
{ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
{һҰұҴһ,ҕһ}|ґ2],
{Ґ,ҕһ,ҢӃ3}.

I don't know about you, but I wouldn't dare to touch this.
It DOES NOT MATTER TO me which words are variables and which
are not, because that knowledge is not useful to me.

(By the way, it should now be clear that in a context like this
you'll _know_ that something is a Cyrillic capital A because
everything else is Cyrillic -- there are no capital letters in
keywords -- so what would a Latin capital A be doing there?)

Does that mean there will be Erlang files that I cannot read and
Raimo Niskanen cannot read? Certainly it does. Does that mean a
big problem for us? No. Nobody is going to _expect_ us to read
it. If someone ships us source code we can't read we shan't use
it.

Is this a NEW problem? No. It is already possible to use some
surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
with a little ingenuity, ...) so ever since Erlang began, we've
had the possibility of entire files being written in words that
we did not understand. If you don't know what the *functions*
are about, what good does it do you to know which tokens are
variables?

I once had to maintain a large chunk of Prolog written by a
very clever programmer whose idea of good variable naming
style came from old BASIC (one letter, or one letter and one
digit). I could see _which_ tokens were the variables, but
not _what_ the variable names meant. I had to figure it out
from the predicate names. So from actual experience I can
tell you

JUST KNOWING WHICH TOKENS ARE VARIABLES IS
NEXT TO USELESS.

> I think it is better to restrict to a subset of 7-bit US-ASCII.

Yeah! Let's make Erlang ASCII-only! (Too bad about my father's
middle name: Æneas. Perfectly good English name, from Latin.)

> Decent
> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> character is under the cursor and if it is A..Z or _ under U+7F it is a
> variable start.

I'm using Aquamacs.
From the Aquamacs help:
Emacs buffers and strings support a large repertoire of
characters from many different scripts, allowing users to
type and display text in almost any known written language.

To support this multitude of characters and scripts,
Emacs closely follows the Unicode Standard.
It's Meta-X describe-char, not Ctrl-X describe-char,
and it works perfectly with Unicode characters.
Here's sample output:

character: Ҳ (1202, #o2262, #x4b2)
preferred charset: unicode (Unicode (ISO10646))
code point: 0x04B2
syntax: w which means: word
category: .:Base, y:Cyrillic
buffer code: #xD2 #xB2
file code: #xD2 #xB2 (encoded by coding system utf-8)
display: by this font (glyph code)
nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)

Character code properties: customize what to show
name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
general-category: Lu (Letter, Uppercase)

Trying this in Vim, it tells me what the numeric codes
of a letter are, but not that it is a letter.

>
> The underscore
> --------------
>
> I would like to argue against allowing all Unicode general category Pc
> (Connector_Punctuation) character in place of "_". This class contain
> in Unicode 6.2 these characters:
> U+5F; LOW LINE
> U+2034; UNDERTIE
> U+2040; CHARACTER TIE
> U+2054; INVERTED UNDERTIE
> U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
> U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> U+FE4D; DASHED LOW LINE
> U+FE4E; CENTERLINE LOW LINE
> U+FE4F; WAVY LOW LINE
> U+FF3F; FULLWIDTH LOW LINE
>
> Of these at least U+2040 "⁀" is horizontal at the top of the line

If it looks horizontal, you have a very poor font.
It's _supposed_ to look more like a c rotated 90 degrees
clockwise and flattened a bit.

> and U+FE33 "︳" looks like a vertical bar (I guess intended for
> vertical flow chinese) so they do not resemble "_" very much.

Who said they were _supposed_ to resemble "_"?
Not me.

I can see your point here, but allowing-all-of-Pc *is* the
Unicode UAX#31 recommendation. We *have* to tailor the
definition somewhat for the sake of backwards compatibility
(dots and at signs). We *could* tailor it here, but it is
definitely advantageous to have at least one more Pc
character reasons given in the EEP.

> Allowing all these would make it hard to remember if a given
> character is category Pc or something else e.g "|".

You are not *supposed* to remember what each and every character is.

BECAUSE YOU CAN'T.

If there's anyone who can, I don't want to meet them.
What _else_ could we talk about?

There are 110,117 defined characters in Unicode 6.2.
(The figure was 110,116 in Unicode 6.1 and 6.2 added one more.)
NOBODY is expected to know what all these characters are.

The idea is not
"if a character is to appear in an Erlang file,
everybody must know what it means"
but
"if someone wants to use their own script in
an Erlang file, they should be able to do so
in a way that is generally consistent with
other programming languages."

The idea that a character should be forbidden unless YOU
recognise it would take us right back to ASCII or Latin 1.
Please, do not put the cart before the horse.

It is perfectly acceptable to say "If someone wants to share
Erlang code with people in other countries, they should use
characters that all those people recognise." In the 21st
century it is no longer acceptable to say "nobody may use a
character unless I remember what it is."

>
> Unquoted atoms
> --------------
>
> The EEP proposes:
> atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
> | "." (Ll ∪ Lo)
>
> I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
> be excluded so an atom can not start with a capital looking letter,
> but Pc ⊄ XID_Start so there is no reason to subtract it, and why
> subtract Lo (Other_Letter)?

There is also no *harm* in making it obvious that variables
*can* start with Pc characters and unquoted atoms *cannot*.

Why subtract Lo? That was a combination of a backwards compatibility
issue and an oversight.

The backwards compatibility issue is that
ªº are Lo characters and are not allowed to begin an Erlang atom.
The oversight was forgetting that this category was the one with
most of the characters I wanted to allow.

This should read

atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
| "." (Ll ∪ Lo)

> There also seems to be a typo in the definition of unquoted_atom
> where an iteration of atom_continue is missing.
>
> I propose:
> unquoted_atom ::= atom_start atom_continue*

Yes.

>
> atom_start ::= atom_start_char
> | "." atom_start_char

That will allow Latin-1 atoms that are not now legal.

>
> atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
>
> atom_continue ::= XID_Continue ∪ "@"
> | "." XID_Continue

That will allow Latin-1 atoms that are not now legal.

> General explanation
> -------------------
>
> I think the EEP could benefit from explaining more about the used character
> classes, what kind of stability annex #31 is designed to give and such.
>
> When I did read the EEP it took several days of Unicode standard reading to
> start understanding, and I think many hesitate before trying to understand
> the EEP, which is a pity.

Well, yes. Is it my job to repeat all the material in the Unicode
standard? I don't think so. I mean, the thing's telephone-book size!

>
> My first concern was about if I write code for one Unicode Erlang release
> in the future, will then that code be valid for subsequent Erlang releases
> based on later Unicode standards.

Yes. Section 1.1 of UAX#31 could hardly be more explicit. Well,
maybe it could, which is why it points to
http://www.unicode.org/policies/stability_policy.html
which says

- Once a character is XID_Continue,
it must continue to be so in all future versions.
- If a character is XID_Start then it must also be XID_Continue.
- Once a character is XID_Start,
it must continue to be so in all future versions.

amongst other things.

> For example the EEP and my proposal both define atom_start to be XID_Start
> minus a set containing uppercase and titlecase letters. XID_Start is
> derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
> in finding which codepoints are contained in Other_ID_Start.

To start with, the purpose of Other_ID_Start is to provide stability.
Any character which _used_ to be an ID_Start but because of some change
would have ceased to be so will be given that property to compensate.

The properties Other_ID_Start and Other_ID_Continue are listed in
Proplist.txt in the Unicode data base. Here's the current set:

# ================================================

2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P
212E ; Other_ID_Start # So ESTIMATED SYMBOL
309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

# Total code points: 4

# ================================================

00B7 ; Other_ID_Continue # Po MIDDLE DOT
0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE

# Total code points: 12

> But since we here define atom_start as above, moving a character from Lu
> or Lt into Other_ID_Start will remove it from atom_start and old code
> using it will not compile.

Lu and Lt are "General Categories". Other_ID_Start is a "property".

OK, now we've got a genuine technical problem.

The set of characters that can begin a variable-OR-an-unquoted-atom
can only grow. That much stability we're promised.

If a character changes from Lu to Lt or Other_ID_Start,
no problem. If a character changes from Lt to Lu or
Other_ID_Start, no problem. But if a character changes
from Lu/Lt to Ll/Lo or vice versa, we have a problem.

Perhaps we can appeal to this:
Once a character is encoded, its properties may still be
changed, but not in such a way as to change the fundamental
identity of the character.
...
For example, the representative glyph for U+0061 “A”
cannot be changed to “B”; the General_Category for
U+0061 “A” cannot be changed to Ll (lowercase letter)
...

Case Pair stability _nearly_ gives us what we want.
If two characters form a case pair in a version of Unicode,
they will remain a case pair in each subsequent version of Unicode.

If two characters do not form a case pair in a version of Unicode,
they will never become a case pair in any subsequent version of Unicode.
That is, if "D" and "d" are unequal defined characters such that
lower("D") = "d" and upper("d") = "D", then this will remain true.
This means that
If "D" is an Lu character now and "d" the corresponding Ll
character, they are going to remain a case pair.
So we could fiddle a bit and say
Lu + Lt + Pc + (Other_ID_Start such that lower(x) != x)
is what we're after.

This doesn't handle the situation where there is a cased letter now
but not its case opposite, as Latin-1 had y-umlaut and sharp s as
lower case letters with no upper case version. But when case opposites
for them did go into Unicode, they didn't change.

I don't think we actually have a problem.

However, the attached revision to EEP 40 has two recommendations.

eep-0040.md

Tim McNamara

unread,

Nov 1, 2012, 1:45:10 AM11/1/12

to Richard O'Keefe, erlang-q...@erlang.org

+1 to ROK's ideas from me.

We should be allowing programmers and programming teams to make their own decisions regarding which characters to allow within projects. If people want to play tricks on each other by replacing ASCII chars with visibly indistinguishable chars from somewhere else, then that's their own business. We have the technology to be culturally sensitive and responsive. If someone is willing to invest energy to implement Unicode, we as a community should not put barriers in front of that.

Dmitry Belyaev

unread,

Nov 1, 2012, 8:36:13 AM11/1/12

to Richard O'Keefe, erlang-q...@erlang.org

I've looked through the proposal and don't understand why there are no proposal to add localized keywords?

Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.

--
Dmitry Belyaev

On 01.11.2012, at 9:27, Richard O'Keefe wrote:

> <eep-0040.md>

Raimo Niskanen

unread,

Nov 1, 2012, 12:52:39 PM11/1/12

to erlang-q...@erlang.org

On Thu, Nov 01, 2012 at 06:27:10PM +1300, Richard O'Keefe wrote:
>
>
> On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:

: :

> >
> > More restricted variable names
> > ------------------------------
> >
> > Nevertheless, I would like a slightly more conservative change in how Erlang
> > should use Unicode in variable names and unquoted atoms.
> >
> > I want to be able to read printed source code on a paper and at least
> > understand if Ƽ = count() has a variable, an atom or an integer to the left.
> > This is an impossible goal because we can today e.g Cyrillic А in any .erl
> > file and that will look as it should compile but it will not.
>
> I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks
> like this: А. I grant you that it is somewhere between exceptionally
> difficult and impossible to tell an A from an А from an Α (Latin
> capital A, Cyrillic, and Greek respectively). But they are all capital
> letters. The point of the proposal is that since А (U+0410) is a
> capital letter, А = count() _should_ compile.

I think that point, which is a good one, did not come through in the
proposal, but the updated version of yours have a very good
rationale that makes it clearer.

>
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam. I can't tell a Malayalam letter from a digit from
> a punctuation mark.
>
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
>
> Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.

Sorry I mixed examples here and pushed you on a side track. The TONE FIVE
was an example of not knowing the symbol's general category. The Cyrillic A
was an example of a similary looking glyph to A in US-ASCII.

So here is what seems to be the core question:

I say I want to be able to see the difference between a variable and an
unquoted atom even if I can not make sense of the variables and atoms names'.
I say it would be possible to achieve this by enforcing a small set of first
letters for variables. Then we would require a variable to start with
US-ASCII CAPITAL, "_" or "@".

You say that goal of mine is a lost cause because I will not have any use of
being able to tell the difference between telling the difference between a
variable and an atom anyway. And trying to achieve this by making backwards
incompatible changes is plain ridicilous.

Fair enough.

Just adding "@" to the current set of characters allowed to start a variable
would not be a backwards compatible change, or? But it would be ugly to allow
some Latin capitals while not the Latin extended nor Cyrillic etc.

You have a point. Now it is clearer to me.

>
> > I think it is better to restrict to a subset of 7-bit US-ASCII.
>
> Yeah! Let's make Erlang ASCII-only! (Too bad about my father's
> middle name: Æneas. Perfectly good English name, from Latin.)

I was of course talking about the start of a variable, not the
entire language. I am not that stupid. His variable could be
__Æneas, or @Æneas (the latter is unreadable).

>
> > Decent
> > editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> > character is under the cursor and if it is A..Z or _ under U+7F it is a
> > variable start.
>
> I'm using Aquamacs.
> From the Aquamacs help:
> Emacs buffers and strings support a large repertoire of
> characters from many different scripts, allowing users to
> type and display text in almost any known written language.
>
> To support this multitude of characters and scripts,
> Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,

Yes. Meta-X. My mistake.

> and it works perfectly with Unicode characters.
> Here's sample output:
>
> character: Ҳ (1202, #o2262, #x4b2)
> preferred charset: unicode (Unicode (ISO10646))
> code point: 0x04B2
> syntax: w which means: word
> category: .:Base, y:Cyrillic
> buffer code: #xD2 #xB2
> file code: #xD2 #xB2 (encoded by coding system utf-8)
> display: by this font (glyph code)
> nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
>
> Character code properties: customize what to show
> name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
> old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
> general-category: Lu (Letter, Uppercase)
>
> Trying this in Vim, it tells me what the numeric codes
> of a letter are, but not that it is a letter.

Yes. I know. I gave the example.

So in Vim you can easilly see if the character is less than 128.
But not if it is a letter.

>
> >
> > The underscore
> > --------------
> >
> > I would like to argue against allowing all Unicode general category Pc
> > (Connector_Punctuation) character in place of "_". This class contain
> > in Unicode 6.2 these characters:
> > U+5F; LOW LINE
> > U+2034; UNDERTIE
> > U+2040; CHARACTER TIE
> > U+2054; INVERTED UNDERTIE
> > U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
> > U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> > U+FE4D; DASHED LOW LINE
> > U+FE4E; CENTERLINE LOW LINE
> > U+FE4F; WAVY LOW LINE
> > U+FF3F; FULLWIDTH LOW LINE
> >
> > Of these at least U+2040 "⁀" is horizontal at the top of the line
>
> If it looks horizontal, you have a very poor font.
> It's _supposed_ to look more like a c rotated 90 degrees
> clockwise and flattened a bit.

Yes that describes it better. A horizontal flat C, rounded up.

>
> > and U+FE33 "︳" looks like a vertical bar (I guess intended for
> > vertical flow chinese) so they do not resemble "_" very much.
>
> Who said they were _supposed_ to resemble "_"?
> Not me.

No. I did, because for me that would indicate the character's purpose.

>
> I can see your point here, but allowing-all-of-Pc *is* the
> Unicode UAX#31 recommendation. We *have* to tailor the
> definition somewhat for the sake of backwards compatibility
> (dots and at signs). We *could* tailor it here, but it is
> definitely advantageous to have at least one more Pc
> character reasons given in the EEP.

Sorry I can not find those reasons. I find reasons and agree
that if we allow more than "_" we should allow all in Pc,
but I do not see why we need more than "_" other than because
it is UAX#31's recommendation.

I said I want to be able to understand the semantics without
knowing all characters. Is that a straw man attack?

The wildcard variable is "_" and starting a variable with that
character has a special meaning to the compiler. Why do we need
more aliases for that character?

> >
> > Unquoted atoms
> > --------------
> >
> > The EEP proposes:
> > atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
> > | "." (Ll ∪ Lo)
> >
> > I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
> > be excluded so an atom can not start with a capital looking letter,
> > but Pc ⊄ XID_Start so there is no reason to subtract it, and why
> > subtract Lo (Other_Letter)?
>
> There is also no *harm* in making it obvious that variables
> *can* start with Pc characters and unquoted atoms *cannot*.

Point taken. I agree.

>
> Why subtract Lo? That was a combination of a backwards compatibility
> issue and an oversight.
>
> The backwards compatibility issue is that
> ªº are Lo characters and are not allowed to begin an Erlang atom.

Would that be an issue? Since they are in Lo should we not start
allowing them?

> The oversight was forgetting that this category was the one with
> most of the characters I wanted to allow.

I guessed so.

>
> This should read
>
> atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
> | "." (Ll ∪ Lo)

Ok. Now I get it. But should it not be the same set after a dot
as at the start?

>
> > There also seems to be a typo in the definition of unquoted_atom
> > where an iteration of atom_continue is missing.
> >
> > I propose:
> > unquoted_atom ::= atom_start atom_continue*
>
> Yes.
> >
> > atom_start ::= atom_start_char
> > | "." atom_start_char
>
> That will allow Latin-1 atoms that are not now legal.
> >
> > atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)
> >
> > atom_continue ::= XID_Continue ∪ "@"
> > | "." XID_Continue
>
> That will allow Latin-1 atoms that are not now legal.
>
> > General explanation
> > -------------------
> >
> > I think the EEP could benefit from explaining more about the used character
> > classes, what kind of stability annex #31 is designed to give and such.
> >
> > When I did read the EEP it took several days of Unicode standard reading to
> > start understanding, and I think many hesitate before trying to understand
> > the EEP, which is a pity.
>
> Well, yes. Is it my job to repeat all the material in the Unicode
> standard? I don't think so. I mean, the thing's telephone-book size!

No. The rationale in your new version is a great improvement.
Pointers and reasons are what is needed.

> >
> > My first concern was about if I write code for one Unicode Erlang release
> > in the future, will then that code be valid for subsequent Erlang releases
> > based on later Unicode standards.
>
> Yes. Section 1.1 of UAX#31 could hardly be more explicit. Well,
> maybe it could, which is why it points to
> http://www.unicode.org/policies/stability_policy.html
> which says
>
> - Once a character is XID_Continue,
> it must continue to be so in all future versions.
> - If a character is XID_Start then it must also be XID_Continue.
> - Once a character is XID_Start,
> it must continue to be so in all future versions.
>
> amongst other things.

Thank you. The Unicode standard is hard to navigate.

>
> > For example the EEP and my proposal both define atom_start to be XID_Start
> > minus a set containing uppercase and titlecase letters. XID_Start is
> > derived from ID_Start, and ID_Start contains Other_ID_Start. I have failed
> > in finding which codepoints are contained in Other_ID_Start.
>
> To start with, the purpose of Other_ID_Start is to provide stability.
> Any character which _used_ to be an ID_Start but because of some change
> would have ceased to be so will be given that property to compensate.
>
> The properties Other_ID_Start and Other_ID_Continue are listed in
> Proplist.txt in the Unicode data base. Here's the current set:

So that's where it is... It is difficult to find out where the
different properties are attached to characters.

>
> # ================================================
>
> 2118 ; Other_ID_Start # Sm SCRIPT CAPITAL P
> 212E ; Other_ID_Start # So ESTIMATED SYMBOL
> 309B..309C ; Other_ID_Start # Sk [2] KATAKANA-HIRAGANA VOICED SOUND MARK..KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
>
> # Total code points: 4
>
> # ================================================
>
> 00B7 ; Other_ID_Continue # Po MIDDLE DOT
> 0387 ; Other_ID_Continue # Po GREEK ANO TELEIA
> 1369..1371 ; Other_ID_Continue # No [9] ETHIOPIC DIGIT ONE..ETHIOPIC DIGIT NINE
> 19DA ; Other_ID_Continue # No NEW TAI LUE THAM DIGIT ONE
>
> # Total code points: 12
>
> > But since we here define atom_start as above, moving a character from Lu
> > or Lt into Other_ID_Start will remove it from atom_start and old code
> > using it will not compile.
>
>
> Lu and Lt are "General Categories". Other_ID_Start is a "property".
>
> OK, now we've got a genuine technical problem.
>
> The set of characters that can begin a variable-OR-an-unquoted-atom
> can only grow. That much stability we're promised.
>
> If a character changes from Lu to Lt or Other_ID_Start,
> no problem. If a character changes from Lt to Lu or
> Other_ID_Start, no problem. But if a character changes
> from Lu/Lt to Ll/Lo or vice versa, we have a problem.

I agree that moving a character from Lu or Lt to Other_Id_Start would
increase the set of atom_start characters.

For the characters "ªº" you above called that a backwards compatibility
issue, which I doubt it is. Ignoring that issue would simplify atom_start.

I still think I still see a problem, though:
unquoted_atom ::= atom_start atom_continue*

atom_start ::= XID_Start \ (Lu ∪ Lt ∪ Pc ∪ "ªº")
| "." (Ll ∪ Lo)

atom_continue ::= XID_Continue | "@"
| "." (Ll ∪ Lo)

Where XID_Start is practically:
(Lu ∪ Ll ∪ Lt ∪ Lm ∪ Lo ∪ Nl ∪ Other_ID_Start)
\ Pattern_Syntax \ Pattern_White_Space

If a character moves from Ll or Lo to Other_ID_Start it will suddenly
become not allowed after a ".". Right?

Should not the set after a "." be about the same as at the start?
unquoted_atom ::= atom_start atom_continue*

atom_start ::= atom_start_char | "." atom_start_char

atom_continue ::= XID_Continue | "@" | "." atom_start_char
atom_start_char ::= XID_Start \ (Lu ∪ Lt ∪ Pc ∪ "ªº")

I think you are right.

>
> However, the attached revision to EEP 40 has two recommendations.
>
>

Richard O'Keefe

unread,

Nov 1, 2012, 5:15:10 PM11/1/12

to Dmitry Belyaev, erlang-q...@erlang.org

On 2/11/2012, at 1:36 AM, Dmitry Belyaev wrote:

> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?

Because that's actually an orthogonal concern.

Suppose for example that you want

essayez mapped to try
... ...
attrapez catch
... ...
fin end

This has nothing to do with the character set.

The classic way to handle keywords in a tokeniser is FIRST to
recognise them (using an automatically generated or hand coded
deterministic finite state machine) as identifiers and LATER
to look them up in a table (possibly using perfect hashing) to
see if they are keywords.

There is no point in allowing people to plug Serbian keywords
into a table if they will never be recognised as identifiers to
start with. We have to get that part right first.

I have three observations on the general idea.
(1) I have seen Pascal localised in exactly this way.
That was French, which is why I used French in my example.
(2) When I mentioned EEP 40 to a colleague his immediate
reaction was precisely the same, that *obviously* people
should be able to plug their own keywords in too.
(3) Ada and Python have not done this.

Suppose we added a new directive:
-keywords(kw_set_id).
which looked in some path for a file containing
[{'essayez','try'},{'attrapez','catch'},{'fin','end'},...].
and used that to update a dictionary.
The lexical analyser
Then the lexical analyser could report the English keywords
to the parser. We might want two lists: one for keywords
and one for directives (other than -encoding and -keywords).

This is NOT an EEP; it is not a draft of an EEP; and I have
no intention of producing an EEP on this topic at this time.
Someone else can write that one.

> Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.

One of the reasons that I have no intention of writing an EEP about this
is that flicking between two keyboards is for me a single keystroke.
(On the iPad: tap the globe. On the desktop Mac: command space.)
Switching keyboard layouts is about as hard as switching from lower to
upper case and back. It should also be possible to configure your
text editor, perhaps using abbreviation support, to turn
"@es" (or the equivalent in your language) into "try" and so on.

Until you've written your own wrappers around the library components
you use, you'll need to flick back into Latin script to call those
anyway. Such wrappers _can_ be written, so the need to use some
Latin script in everyday work may not continue forever, but it
does mean there has to be a transition period in which people using
non-Latin keyboards have to learn to use Cmd-Space.

Dmitry Belyaev

unread,

Nov 1, 2012, 5:37:42 PM11/1/12

to Richard O'Keefe, Erlang Questions

Comments inside the quoted text below.

--
Dmitry Belyaev

On 02.11.2012, at 1:15, Richard O'Keefe wrote:

>
> On 2/11/2012, at 1:36 AM, Dmitry Belyaev wrote:
>
>> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
>
> Because that's actually an orthogonal concern.

> ...

> There is no point in allowing people to plug Serbian keywords
> into a table if they will never be recognised as identifiers to
> start with. We have to get that part right first.

It is like to allow to type only variable names localized and do not allow atoms. No use if I cannot write all the text in the language I've chosen.

What about your Māori students? Will you tell them they may write some parts of the program in their language and some other words they have to write in English?

> ...

> (3) Ada and Python have not done this.

I don't think that pointing to other bad choices is good.

It's not only one shortcut to toggle the layout. It's another layout and the brain must be switched to that layout too just to type proper characters.
Another problem is bad layout design. The most widely used russian layout has cyrillic letter "С" on the same button as latin "C".
By the way, typing only this one letter I have made two errors while trying to type symbol " just because I forgot the layout was still russian.
What I want to say is that it is not only the problem of one additional keystroke.

Yes, I'd choose "All or Nothing" option for all this proposal.

Richard O'Keefe

unread,

Nov 1, 2012, 6:41:46 PM11/1/12

to Raimo Niskanen, erlang-q...@erlang.org

I'm not going to answer every point, because I'm supposed to be marking exams.
That doesn't mean they aren't good points.

Next revision of the EEP:

eep-0040.md

Richard O'Keefe

unread,

Nov 1, 2012, 7:11:05 PM11/1/12

to Dmitry Belyaev, Erlang Questions

On 2/11/2012, at 10:37 AM, Dmitry Belyaev wrote:

> Comments inside the quoted text below.
>
> --
> Dmitry Belyaev
>
> On 02.11.2012, at 1:15, Richard O'Keefe wrote:
>
>>
>> On 2/11/2012, at 1:36 AM, Dmitry Belyaev wrote:
>>
>>> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
>>
>> Because that's actually an orthogonal concern.
>> ...
>> There is no point in allowing people to plug Serbian keywords
>> into a table if they will never be recognised as identifiers to
>> start with. We have to get that part right first.
>
> It is like to allow to type only variable names localized and do not allow atoms. No use if I cannot write all the text in the language I've chosen.

I did not say "no, never do it".
I said "We have to handle Unicode variables and atoms FIRST".

Step 1: recognise and distinguish between variables and atoms-or-keywords.

THAT is what EEP 40 is about.

Step 2: decide which atoms-or-keywords are atoms and which are what keywords.

If you want keywords in Hebrew or Malayalam or whatever, you have to do
step 1 first.

For that matter, if you are willing to begin keywords with a special
character (as Algol and IMP programmers had to), you can just

-include('keywords/fr').
... ?essayez
...
?attrapez
...
?fin

right now. (%external %integer %fn %spec ring any bells with my readers?)

To repeat: I am NOT saying NO.
I am saying, let's get EEP 40 through *FIRST*.
Then you will be able to use ?slučaj (Croatian for 'case') or
whatever takes your fancy with _no_ extra support from the
Erlang/OTP maintainers right away. You get _that_ much ability
to use localised keywords *sooner* than if you put that into EEP 40.

> What about your Māori students? Will you tell them they may write some parts of the program in their language and some other words they have to write in English?

No, I'll tell them about the macro trick.

>> (3) Ada and Python have not done this.
>
> I don't think that pointing to other bad choices is good.

Considering the huge amount of design work that has gone into
Ada revision -- I once printed out a whole bunch of revision
documents and stopped when I had a pile 60 cm high and still
had a long way to go -- it's not clear that how bad a choice
it is. As with EEP 40, it's not "no never" to localised
keywords, but "this _first_".

There are, after all, such things as preprocessors,
and at least keywords are not something you have to name in
a debugger in order to trace them or put breakpoints on them,
so unlike other identifier mapping, keyword localisation via
preprocessor actually works.

>
> Yes, I'd choose "All or Nothing" option for all this proposal.

EEP 40 is *ORTHOGONAL* to localised keywords.
You could have localised (in Latin-1 only) keywords without EEP 40.
You could have EEP 40 without localised keywords.
You can have both.
You can, as I have already said, have EEP 40 AS A STEP TOWARDS
localised keywords.

Here's how it goes:

- first one supports alternative encodings,
but still accepts only Latin-1 characters.

- next one supports non-Latin-1 characters in comments.

- next one supports non-Latin-1 characters in strings.

- next one supports non-Latin-1 characters in identifiers.

- next one supports non-Latin-1 characters in numbers.

- and at any point along the route one can consider
localised keywords.

Richard O'Keefe

unread,

Nov 2, 2012, 12:00:19 AM11/2/12

to Raimo Niskanen, erlang-q...@erlang.org

A Unicode expert has suggested not allowing all of Pc at the
beginning of a variable but just the ASCII and FULLWIDTH
versions of "_". It's not yet clear to me what should be
done in the body of an identifier; allowing precisely these
characters instead of all of Pc is enough for us to begin
with, and we can add the other Pc characters later. Expect
yet another revision next week.

Raimo Niskanen

unread,

Nov 2, 2012, 5:35:39 AM11/2/12

to erlang-q...@erlang.org

On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote:
> I'm not going to answer every point, because I'm supposed to be marking exams.
> That doesn't mean they aren't good points.

Looking forward to later then...

>
> Next revision of the EEP:

It is now updated and published.

Formally, the EEP updates should go to ee...@erlang.org, according to
http://www.erlang.org/eep.html. I have missed on procedures by not
mailing to that list when accepting this EEP, but that will improve...

>
:
: :

> >
> > The wildcard variable is "_" and starting a variable with that
> > character has a special meaning to the compiler. Why do we need
> > more aliases for that character?
>

> BECAUSE that character has a special meaning,
> and the other characters are NOT aliases for it.
>
> Maybe it's not in the EEP, but it certainly was in this mailing list.
> Someone was arguing against internationalisation on the grounds that
> 變量 couldn't be used as a variable name, and to the proposal that
> _變量 be used, it was claimed that the compiler would have to treat
> this as something that was supposed to occur just once, and so I
> pointed out that there are other Pc characters available, so that
> ⁀變量 or ‿變量 could be used. It wasn't that word, and I think I
> didn't mention ⁀. But the point was that we could retain the
> current reading of "_" unchanged and begin caseless words used as
> variable names with some other Pc character. The idea is that the
> other Pc characters would or could be treated differently from "_".
>
> In fact I do prefer that all the Pc characters should be treated
> the same, but at the moment the EEP offers both alternatives for
> consideration.

Ok. I misread it as there was only one suggestion and that was to
treat all Pc characters alike. I think it is still somewhat unclear
that only treating "_" special _is_ an alternative in the EEP.

Also I do not clearly see what problem is solved for someone using
fonts with say Arabic letters but not say the undertine, by revising
the underscore rule. Bear with me. I have never used another keyboard
than Swedish or English. Is it so that when using such a font there
is no Pc character available except for the "_" (and why is that
available?) so there must be a possibility to express both non-singleton
and maybe-singleton variables using just the "_"?

:
> You cannot even understand the lexical semantics without knowing
> the characters. The most primitive level of "understand(ing)
> the semantics" I can imagine is being able to answer the question
> "Is this sequence of characters legal or not?"
>
> Consider this example: "؂र॰." (U+0930, U+0970, usual full stop.)
> If you were trying to read that from a file, would it be a legal
> term?
>
> No. The first character is a letter, but the second character is
> classified as a punctuation mark. I only know this because I was
> constantly referring to the tables while constructing the example.
> It will be instantly obvious, I imagine, to anyone familiar with
> the Devanagari script. For that matter, hawaiɁi is or ought to
> be a perfectly good atom. That glottal stop letter looked a lot
> like a question mark, didn't it? So it might not have _looked_
> like an atom, but it would be one.

I have realized that. I wanted a lesser degree of understanding the
lexical semantics: If it passes the compiler (which that example
does not) I would like to be able to see which identifiers are
variables and which are atoms.

Also, e.g someone writing a syntax highlighter for Vim i guess would
appreciate a simple rule for how to recognize a variable.

>
> If someone gives you an Erlang file written entirely in ASCII,
> but using the Klingon language, just how much would it help you
> to know where the variables began? (Google Translate offers
> translation to Esperanto, why not Klingon? I haven't opened my
> copy of the how-to-learn-Klingon book in 20 years. Sigh.)

It would not help much, I agree. But if for example I get a bug report
about the compiler or runtime system not doing right for a few lines
of Klingon Erlang, it would be helpful to easily distinguish variables
from atoms.

>
> >>
> >> The backwards compatibility issue is that
> >> ªº are Lo characters and are not allowed to begin an Erlang atom.
> >
> > Would that be an issue? Since they are in Lo should we not start
> > allowing them?
>

> I wanted to preserve a somewhat stronger property than any I mentioned,
> namely that
> "this is a legal Erlang text using Latin-1 characters
> under the old rules"
> if and only if
> "this is a legal Erlang text using Latin-1 characters
> under the new rules".
>
> If anyone wants to propose allowing "ªº" at the beginning of an atom
> in Latin-1 Erlang, fine. Doesn't bother me. But I wasn't about to
> introduce _any_ incompatibility if I could avoid it. In particular,
> it seems like a nice thing for the transition period that if you have
> an Erlang file that works in Unicode Erlang and happens to include
> nothing outside Latin-1 (a trivial mechanical check) it should be
> guaranteed to work in Latin-1 Erlang.

Ok. Good point. That sounds maybe essential. And now that goal is in the
latest version of the EEP. Very good.

:
: ::

> >> This should read
> >>
> >> atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
> >> | "." (Ll ∪ Lo)
> >
> > Ok. Now I get it. But should it not be the same set after a dot
> > as at the start?
>

> Consider
> 1> X = a.B.
> * 1: syntax error before: B
> 1> X = a._2.
> * 1: syntax error before: _2
> 1> X = a.3.
> * 1: syntax error before: 3
> 1> X = a.b.
> 'a.b'
>
> That tells us that currently, only Ll characters are allowed
> after a dot in the continuation of an identifier. That naturally
> generalised to (Ll ∪ Lo). So I made "what can follow a dot" the
> same everywhere in an atom. The mental model I had was to think
> of dot-followed-by-Ll-or-Lo as a single extended character.

Yes. And currently only Ll characters are allowed at the start
of an atom. So currently the same set is allowed at the start
as after a ".".

Your current suggestion allows a.ª as an unquoted atom since the character
after the dot is in Lo, but it is not allowed in Erlang today.

It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters
are in Nl (Letter_Number), which is part of XID_Start.

So I think the mental model should be that after a dot there
should be as if a new atom was starting.

:
>
> Concerning stability, I did send a message to the Unicode consortium.
> I've had an informal response:
>
> An interesting question you raise, which I will pass along
> to some people here. I think the short answer is that you
> can tailor these things to particular environments, and you
> may not be able to rely on any given standard property for
> special purposes. Especially if that property is not
> formally stable. But I'll see what others say.
>
> There are sufficiently many programming languages that depend on
> initial alphabetic case that we may be looking at a revision of
> UAX#31. Wouldn't that be fun‽ (Groan.)

I think we need an XID_Start_Uppercase and XID_Start_Lowercase,
containing Other_ID_Start_Uppercase and Other_ID_Start_Lowercase.

>
> Remaining points skipped for now.
>
>

I especially anticipate a reply about what happens if a character
moves from Ll or Lo to Other_ID_Start...

Good luck with the exams!

Raimo Niskanen

unread,

Nov 2, 2012, 6:26:18 AM11/2/12

to erlang-q...@erlang.org

On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote:
:

>
> Next revision of the EEP:

> [-- Attachment #2: eep-0040.md --]
> [-- Type: application/octet-stream, Encoding: quoted-printable, Size: 16K --]

I think the EEP should elaborate on normalization. It seems to me that
prescribing NFC would be natural since a file consisting of Latin-1
characters is already NFC (Normalized Form C (Composed)).

O.t.o.h that would make the atom ﬁ⁵ different from the atom fi5,
and using NFKC (Normalized Form KC (Compatibility Composed))
would make them equal. I do not know. That ﬁ =:= fi may be
good but that i⁵ =:= i5 may be not good. Anyway normalizing
these character sequences in comments or strings is _not_
desirable. If NFKC would be an option it could only be that
for atoms and variables.

Richard O'Keefe

unread,

Nov 4, 2012, 4:24:30 PM11/4/12

to Raimo Niskanen, erlang-q...@erlang.org

On 2/11/2012, at 10:35 PM, Raimo Niskanen wrote:
>
> Also I do not clearly see what problem is solved for someone using
> fonts with say Arabic letters but not say the undertine, by revising
> the underscore rule. Bear with me. I have never used another keyboard
> than Swedish or English. Is it so that when using such a font there
> is no Pc character available except for the "_" (and why is that
> available?) so there must be a possibility to express both non-singleton
> and maybe-singleton variables using just the "_"?

I have only tried the Macintosh interface, where there are
three "Arabic", "Arabic - PC", and "Arabic - QWERTY" virtual keyboards
available. All of them have the underline. ISO 8859-6 (the ISO 8-bit
character set for Arabic) includes all of ASCII. However, I am not an
expert.

>
> I have realized that. I wanted a lesser degree of understanding the
> lexical semantics: If it passes the compiler (which that example
> does not) I would like to be able to see which identifiers are
> variables and which are atoms.
>
> Also, e.g someone writing a syntax highlighter for Vim i guess would
> appreciate a simple rule for how to recognize a variable.

Well, the EEP gives them _that_. If Vim can highlight Ada and Python
and Java correctly, what's the problem? Copy the regular expressions
it uses for Java and tinker with them.

>
>>
>> If someone gives you an Erlang file written entirely in ASCII,
>> but using the Klingon language, just how much would it help you
>> to know where the variables began? (Google Translate offers
>> translation to Esperanto, why not Klingon? I haven't opened my
>> copy of the how-to-learn-Klingon book in 20 years. Sigh.)
>
> It would not help much, I agree. But if for example I get a bug report
> about the compiler or runtime system not doing right for a few lines
> of Klingon Erlang, it would be helpful to easily distinguish variables
> from atoms.

You don't have to do it by eye. You can use a tool (like the Vim
syntax colourer you mention above).

>> Consider
>> 1> X = a.B.
>> * 1: syntax error before: B
>> 1> X = a._2.
>> * 1: syntax error before: _2
>> 1> X = a.3.
>> * 1: syntax error before: 3
>> 1> X = a.b.
>> 'a.b'
>>
>> That tells us that currently, only Ll characters are allowed
>> after a dot in the continuation of an identifier. That naturally
>> generalised to (Ll ∪ Lo). So I made "what can follow a dot" the
>> same everywhere in an atom. The mental model I had was to think
>> of dot-followed-by-Ll-or-Lo as a single extended character.
>
> Yes. And currently only Ll characters are allowed at the start
> of an atom. So currently the same set is allowed at the start
> as after a ".".
>
> Your current suggestion allows a.ª as an unquoted atom since the character
> after the dot is in Lo, but it is not allowed in Erlang today.

Oh DRAT!

>
> It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters
> are in Nl (Letter_Number), which is part of XID_Start.

Frankly that one doesn't bother me in the least.

>
> So I think the mental model should be that after a dot there
> should be as if a new atom was starting.

However, since I've got to fix the a.ª bug, I may as well adopt
your suggestion. The grammar now reads

unquoted_atom ::= "."? atom_start atom_continue*

atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")

atom_continue ::= XID_Continue ∪ "@" \ "ªº"
| "." atom_start

Reply all

Reply to author

Forward

0 new messages