Allthough there might be opinions on whether allowing Unicode variable and atom names is a good idea, I would like to discuss EEP 40 itself. In a previous thread there was much said about Unicode or not but I only found the following about EEP 40, hoping I did not miss anything valuable:
On Thu, Oct 25, 2012 at 05:20:21PM +1300, Richard O'Keefe wrote:
> On 23/10/2012, at 10:20 PM, Jesper Louis Andersen wrote:
> > Google Go takes two stances differently:
> > * There is *no* normalization. This means that you can write the same symbol using one codepoint or with two code points combining into the same representation. Of course this is the conservative stance where it is expected that people do not do silly things. But my guess is that it is much easier to handle. Is there a specific reason to pick normalization, apart from the obvious one? I see some similarities to tabs vs spaces for indentation here.
> Normalisation is a pain in the πρωκτος. The only thing worse is _not_ doing it. > (As it happens, I am planning to rewrite the tokeniser of my Smalltalk system to > accept Unicode -- the run-time already does -- and this is one of the issues I've > been thinking about.)
> I can see four options: > (1) say that different encodings of the same text are different > (2) leave it undefined whether they are different > (3) say that it's someone else's problem (like XML 1.0, which says > "Characters in names should be expressed using Normalization Form C" > but leaves it to the author to make it so) > (4) require normalisation.
> The issue is a severely practical one: can two people with different editors > edit the same source file? As you sapiently observe, this _is_ very like tabs > vs spaces: your editor may think tabs are every 3 columns, but mine thinks they > are every 8, and you didn't tell _me_ otherwise. (Again, my Smalltalk system > discerns method and class boundaries using indentation, and it has paid off to > enforce no-tabs-in-source-files at check-in.) Of the options above, it is > only option (4) that makes multiple editors safe to use.
> As it happens, I _have_ had the experience of typing exactly what I saw and having > it fail to match, so I do not want to see anyone else suffering the same fate.
> > * In Go, identifiers are exported if they begin with a codepoint in class Lu. This is also a very conservative stance since now your programs must use an Lu codepoint for variable names if we just ported that solution to Erlang. But it is quite simple again, and very easy to handle from a parser perspective.
> Restriction to Lu is not an option for Erlang. We *have* to continue to > allow "_" as well, which is a Pc character, not an Lu character. And if > we allow _that_ Pc character, why not the others? They aren't used for > anything else in Erlang.
> We really have to allow Lt as well. It would be surpassing strange if > Ljudevit was a variable but Ljudevit was not. > There are 31 "Lt" letters in Unicode 6. Of those, 27 are Greek. > The other 4 exist for the sake of Croatian (which has an alphabet of 30 > letters). As it happens, my maternal grandfather came from a small > town not far from Dubrovnik. Do I want to be the one to tell 4.4 million > people who look rather like Granddad Covič they can't write a variable > name in their own language using their own letters? No, not really.
> >From a lexical analyser perspective, scanning variable names requires > just two character sets: things that can begin a variable and things > that can continue one. How those sets are derived really has no effect > whatever on how complicated the parsing is. Scanning unquoted atoms is > admittedly tricky, but that's entirely down to Erlang's _existing_ > treatment of "." and "@"; without those two to worry about we'd just > have atom starts and atom continuations and again the derivation of > the sets would make no difference to the scanner's complexity.
That was the discussion so far. Here follows my thoughts.
Set notation mistake? ---------------------
I do not understand the BNF definition of variable in the EEP: variable ::= var_start var_continue*
As I read the Unicode XID_Start definition <http://www.unicode.org/Public/6.2.0/ucd/DerivedCoreProperties.txt> there are no general category Pc (Connector_Punctuation) characters in XID_Start, hence will there be no such in the set intersection (which as I understand '∩' should mean) defining var_start. Therefore U+5F LOW LINE aka '_' Underscore is not allowed to start a variable.
Is there something wrong in that set notation, or what did I misunderstand?
Was it not ment to be: var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc
More restricted variable names ------------------------------
Nevertheless, I would like a slightly more conservative change in how Erlang should use Unicode in variable names and unquoted atoms.
I want to be able to read printed source code on a paper and at least understand if Ƽ = count() has a variable, an atom or an integer to the left. This is an impossible goal because we can today e.g Cyrillic А in any .erl file and that will look as it should compile but it will not.
So I have to change that requirement into; if it compiles I want to be able to tell from a noncolour printed source code listing what the semantics is.
Therefore I think a more conservative rule for variable start is needed: variable ::= var_start var_continue*
var_start ::= ("A".."Z" ∪ "_")
var_continue ::= XID_Continue ∪ "@"
I hereby ditch the characters "À".."Ö" ∪ "Ø".."Þ" that are allowed today since if they are allowed there is no telling which of all accents are allowed and so we have to allow all LATIN CAPITAL and therefor all GREEK, CYRILLIC, ARMENIAN, GEORGIAN, GLAGOLITIC, COPTIC and DESERET CAPITAL letters, and that is a too big set to handle for a human. Tools would become essential.
I think it is better to restrict to a subset of 7-bit US-ASCII. Decent editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which character is under the cursor and if it is A..Z or _ under U+7F it is a variable start. That is a possible set to memorize even for non-english programmers especially considering all reserved words are in 7-bit US-ASCII and hence Erlang programmers must be somewhat familiar with that charset.
Removing the Latin-1 characters > 128 will need warnings in one release introduction later, and probably an non-unicode compile flag. But I do not think that many have used such characters to start variables so far.
We can then define mst_variable (maybe singleton variable) much like in the proposed EEP: mst_variable ::= mst_var_start var_continue*
An alternative suggestion is to allow "@" as var_start: variable ::= var_start var_continue* var_start ::= ("A".."Z" ∪ "_" ∪ "@")
which require no change from today for maybe singleton variables: mst_var_start ::= "_"
I can not think of anything partically bad with allowing @隠者 as a variable name. The "@" makes it distinct from an atom, and "@" is one of the variable prefix characters in perl (good or bad?!).
The underscore --------------
I would like to argue against allowing all Unicode general category Pc (Connector_Punctuation) character in place of "_". This class contain in Unicode 6.2 these characters: U+5F; LOW LINE U+2034; UNDERTIE U+2040; CHARACTER TIE U+2054; INVERTED UNDERTIE U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE U+FE4D; DASHED LOW LINE U+FE4E; CENTERLINE LOW LINE U+FE4F; WAVY LOW LINE U+FF3F; FULLWIDTH LOW LINE
Of these at least U+2040 "⁀" is horizontal at the top of the line and U+FE33 "︳" looks like a vertical bar (I guess intended for vertical flow chinese) so they do not resemble "_" very much. Allowing all these would make it hard to remember if a given character is category Pc or something else e.g "|". Therefore I think it will be enough to allow U+5F LOW LINE ("_", underscore).
An Erlang programmer will have to be able to enter many other 7-bit US-ASCII punctuation characters e.g ".,?:;%'" so the underscore should pose no particular problem.
Unquoted atoms --------------
The EEP proposes: atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc) | "." (Ll ∪ Lo)
I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should be excluded so an atom can not start with a capital looking letter, but Pc ⊄ XID_Start so there is no reason to subtract it, and why subtract Lo (Other_Letter)?
There also seems to be a typo in the definition of unquoted_atom where an iteration of atom_continue is missing.
I propose: unquoted_atom ::= atom_start atom_continue*
I think the EEP could benefit from explaining more about the used character classes, what kind of stability annex #31 is designed to give and such.
When I did read the EEP it took several days of Unicode standard reading to start understanding, and I think many hesitate before trying to understand the EEP, which is a pity.
My first concern was about if I write code for one Unicode Erlang release in the future, will then that code be valid for subsequent Erlang releases based on later Unicode standards. It seems annex #31 is very much targeted at solving that problem, and Unicode in itself is much about stability in subsequent standards, so that problem seems handled, but I am not sure yet.
> Was it not ment to be:
> var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc
Yes. I made a mistake there.
> More restricted variable names
> ------------------------------
> Nevertheless, I would like a slightly more conservative change in how Erlang
> should use Unicode in variable names and unquoted atoms.
> I want to be able to read printed source code on a paper and at least
> understand if Ƽ = count() has a variable, an atom or an integer to the left.
> This is an impossible goal because we can today e.g Cyrillic А in any .erl
> file and that will look as it should compile but it will not.
I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks
like this: А. I grant you that it is somewhere between exceptionally
difficult and impossible to tell an A from an А from an Α (Latin
capital A, Cyrillic, and Greek respectively). But they are all capital
letters. The point of the proposal is that since А (U+0410) is a
capital letter, А = count() _should_ compile.
If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
that would have been hard to tell from a six, true.
But I don't see how this is any different from the fact that in a script
you don't know, you cannot tell _what_ a character is.
For example, I had a student this year whose native language was I
believe Malayalam. I can't tell a Malayalam letter from a digit from
a punctuation mark.
Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE.
Nothing to do with Cyrillic.
Reverting to the Middle Welsh letter, if I cannot tell a small letter
from a digit, does that mean that every unquoted atom should begin
with an English letter? (I cannot say "a Latin letter", because
ỽ _is_ a member of the extended Latin script.)
No, I'm sorry. This is ridiculous. Expecting everybody to begin
_their_ variables which you will almost certainly never see to begin
with an ASCII letter so _you_ can tell this from that; what sense does
that make? If it is in a script you cannot read, then you cannot read it.
Can we just try, for a minute or to, to entertain a rather wild idea?
Here's the idea: most programmers are adults. They can make informed
choices. If they *want* you to read their code, they are smart enough
to write in a script you can read. If they decide that it's more
important to them that _they_ can read comfortably, that's their
decision to make. If you want a Malayalam-speaker to write code for
you, put the language (English, Finnish, whatever) in the contract.
I have a confession to make. My multiple-programming-languages to
multiple-styled-output-formats tool is currently Latin-1 only.
That's because it's for _me_; nobody paid me to write it and I didn't
expect anyone else to find it useful (although someone did). It can,
for example, be configured to generate HTML, and it can be made to
wrap keywords in <B> and could as easily wrap variables in <U>. It
would probably take me about a week to revised the thing to use
Unicode. So then I'd have a tool that could generate printed listings
with variables underlined, without needing to slap untold numbers of
people in the face with the notion that they are and must remain
second-class world citizens.
> So I have to change that requirement into; if it compiles I want to be able
> to tell from a noncolour printed source code listing what the semantics is.
You are, in fact, proposing a backwards-incompatible change to Erlang,
in order to achieve a goal which is not in general achievable, and not
in my view worth achieving if you could.
Let's be realistic here. If you cannot read any of the words, it is not
going to do you any good to tell the variables from the atoms from the
numbers. Let's take an example. I took a snippet of Erlang out of
the Erlang/OTP release and transliterated the English letters to
Russian ones. If you _don't_ read the Cyrillic script, precisely what
good does it do you to know which are the variables? If you _do_ read
the Cyrillic script, this will seem to you to be complete gibberish,
so imagine it's a language you don't know.
I don't know about you, but I wouldn't dare to touch this.
It DOES NOT MATTER TO me which words are variables and which
are not, because that knowledge is not useful to me.
(By the way, it should now be clear that in a context like this
you'll _know_ that something is a Cyrillic capital A because
everything else is Cyrillic -- there are no capital letters in
keywords -- so what would a Latin capital A be doing there?)
Does that mean there will be Erlang files that I cannot read and
Raimo Niskanen cannot read? Certainly it does. Does that mean a
big problem for us? No. Nobody is going to _expect_ us to read
it. If someone ships us source code we can't read we shan't use
it.
Is this a NEW problem? No. It is already possible to use some
surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
with a little ingenuity, ...) so ever since Erlang began, we've
had the possibility of entire files being written in words that
we did not understand. If you don't know what the *functions*
are about, what good does it do you to know which tokens are
variables?
I once had to maintain a large chunk of Prolog written by a
very clever programmer whose idea of good variable naming
style came from old BASIC (one letter, or one letter and one
digit). I could see _which_ tokens were the variables, but
not _what_ the variable names meant. I had to figure it out
from the predicate names. So from actual experience I can
tell you
JUST KNOWING WHICH TOKENS ARE VARIABLES IS
NEXT TO USELESS.
> I think it is better to restrict to a subset of 7-bit US-ASCII.
Yeah! Let's make Erlang ASCII-only! (Too bad about my father's
middle name: Æneas. Perfectly good English name, from Latin.)
> Decent
> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> character is under the cursor and if it is A..Z or _ under U+7F it is a
> variable start.
I'm using Aquamacs.
From the Aquamacs help:
Emacs buffers and strings support a large repertoire of characters from many different scripts, allowing users to
type and display text in almost any known written language.
To support this multitude of characters and scripts,
Emacs closely follows the Unicode Standard.
It's Meta-X describe-char, not Ctrl-X describe-char,
and it works perfectly with Unicode characters.
Here's sample output:
character: Ҳ (1202, #o2262, #x4b2)
preferred charset: unicode (Unicode (ISO10646))
code point: 0x04B2
syntax: w which means: word
category: .:Base, y:Cyrillic
buffer code: #xD2 #xB2
file code: #xD2 #xB2 (encoded by coding system utf-8)
display: by this font (glyph code)
nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
Character code properties: customize what to show
name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
general-category: Lu (Letter, Uppercase)
Trying this in Vim, it tells me what the numeric codes
of a letter are, but not that it is a letter.
> I would like to argue against allowing all Unicode general category Pc
> (Connector_Punctuation) character in place of "_". This class contain
> in Unicode 6.2 these characters:
> U+5F; LOW LINE
> U+2034; UNDERTIE
> U+2040; CHARACTER TIE
> U+2054; INVERTED UNDERTIE
> U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
> U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
> U+FE4D; DASHED LOW LINE
> U+FE4E; CENTERLINE LOW LINE
> U+FE4F; WAVY LOW LINE
> U+FF3F; FULLWIDTH LOW LINE
> Of these at least U+2040 "⁀" is horizontal at the top of the line
If it looks horizontal, you have a very poor font.
It's _supposed_ to look more like a c rotated 90 degrees
clockwise and flattened a bit.
> and U+FE33 "︳" looks like a vertical bar (I guess intended for
> vertical flow chinese) so they do not resemble "_" very much.
Who said they were _supposed_ to resemble "_"?
Not me.
We should be allowing programmers and programming teams to make their own
decisions regarding which characters to allow within projects. If people
want to play tricks on each other by replacing ASCII chars with visibly
indistinguishable chars from somewhere else, then that's their own
business. We have the technology to be culturally sensitive and responsive.
If someone is willing to invest energy to implement Unicode, we as a
community should not put barriers in front of that.
On Nov 1, 2012 6:27 PM, "Richard O'Keefe" <o...@cs.otago.ac.nz> wrote:
> > Was it not ment to be:
> > var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc
> Yes. I made a mistake there.
> > More restricted variable names
> > ------------------------------
> > Nevertheless, I would like a slightly more conservative change in how
> Erlang
> > should use Unicode in variable names and unquoted atoms.
> > I want to be able to read printed source code on a paper and at least
> > understand if Ƽ = count() has a variable, an atom or an integer to the
> left.
> > This is an impossible goal because we can today e.g Cyrillic А in any
> .erl
> > file and that will look as it should compile but it will not.
> I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks
> like this: А. I grant you that it is somewhere between exceptionally
> difficult and impossible to tell an A from an А from an Α (Latin
> capital A, Cyrillic, and Greek respectively). But they are all capital
> letters. The point of the proposal is that since А (U+0410) is a
> capital letter, А = count() _should_ compile.
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam. I can't tell a Malayalam letter from a digit from
> a punctuation mark.
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
> Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.
> Reverting to the Middle Welsh letter, if I cannot tell a small letter
> from a digit, does that mean that every unquoted atom should begin
> with an English letter? (I cannot say "a Latin letter", because
> ỽ _is_ a member of the extended Latin script.)
> No, I'm sorry. This is ridiculous. Expecting everybody to begin
> _their_ variables which you will almost certainly never see to begin
> with an ASCII letter so _you_ can tell this from that; what sense does
> that make? If it is in a script you cannot read, then you cannot read it.
> Can we just try, for a minute or to, to entertain a rather wild idea?
> Here's the idea: most programmers are adults. They can make informed
> choices. If they *want* you to read their code, they are smart enough
> to write in a script you can read. If they decide that it's more
> important to them that _they_ can read comfortably, that's their
> decision to make. If you want a Malayalam-speaker to write code for
> you, put the language (English, Finnish, whatever) in the contract.
> I have a confession to make. My multiple-programming-languages to
> multiple-styled-output-formats tool is currently Latin-1 only.
> That's because it's for _me_; nobody paid me to write it and I didn't
> expect anyone else to find it useful (although someone did). It can,
> for example, be configured to generate HTML, and it can be made to
> wrap keywords in <B> and could as easily wrap variables in <U>. It
> would probably take me about a week to revised the thing to use
> Unicode. So then I'd have a tool that could generate printed listings
> with variables underlined, without needing to slap untold numbers of
> people in the face with the notion that they are and must remain
> second-class world citizens.
> > So I have to change that requirement into; if it compiles I want to be
> able
> > to tell from a noncolour printed source code listing what the semantics
> is.
> You are, in fact, proposing a backwards-incompatible change to Erlang,
> in order to achieve a goal which is not in general achievable, and not
> in my view worth achieving if you could.
> Let's be realistic here. If you cannot read any of the words, it is not
> going to do you any good to tell the variables from the atoms from the
> numbers. Let's take an example. I took a snippet of Erlang out of
> the Erlang/OTP release and transliterated the English letters to
> Russian ones. If you _don't_ read the Cyrillic script, precisely what
> good does it do you to know which are the variables? If you _do_ read
> the Cyrillic script, this will seem to you to be complete gibberish,
> so imagine it's a language you don't know.
> I don't know about you, but I wouldn't dare to touch this.
> It DOES NOT MATTER TO me which words are variables and which
> are not, because that knowledge is not useful to me.
> (By the way, it should now be clear that in a context like this
> you'll _know_ that something is a Cyrillic capital A because
> everything else is Cyrillic -- there are no capital letters in
> keywords -- so what would a Latin capital A be doing there?)
> Does that mean there will be Erlang files that I cannot read and
> Raimo Niskanen cannot read? Certainly it does. Does that mean a
> big problem for us? No. Nobody is going to _expect_ us to read
> it. If someone ships us source code we can't read we shan't use
> it.
> Is this a NEW problem? No. It is already possible to use some
> surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
> with a little ingenuity, ...) so ever since Erlang began, we've
> had the possibility of entire files being written in words that
> we did not understand. If you don't know what the *functions*
> are about, what good does it do you to know which tokens are
> variables?
> I once had to maintain a large chunk of Prolog written by a
> very clever programmer whose idea of good variable naming
> style came from old BASIC (one letter, or one letter and one
> digit). I could see _which_ tokens were the variables, but
> not _what_ the variable names meant. I had to figure it out
> from the predicate names. So from actual experience I can
> tell you
> JUST KNOWING WHICH TOKENS ARE VARIABLES IS
> NEXT TO USELESS.
> > I think it is better to restrict to a subset of 7-bit US-ASCII.
> Yeah! Let's make Erlang ASCII-only! (Too bad about my father's
> middle name: Æneas. Perfectly good English name, from Latin.)
> > Decent
> > editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> > character is under the cursor and if it is A..Z or _ under U+7F it is a
> > variable start.
> I'm using Aquamacs.
> From the Aquamacs help:
> Emacs buffers and strings support a large repertoire of
> characters from many different scripts, allowing users to
> type and display text in almost any known written language.
> To support this multitude of characters and scripts,
> Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,
> and it works perfectly with Unicode characters.
> Here's sample output:
> character: Ҳ (1202, #o2262, #x4b2)
> preferred charset: unicode (Unicode (ISO10646))
> code point: 0x04B2
> syntax: w which means: word
> category: .:Base, y:Cyrillic
> buffer code: #xD2 #xB2
> file code: #xD2 #xB2 (encoded by coding system utf-8)
> display: by this font (glyph code)
> Character code properties: customize what to show
> name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
> old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
> general-category: Lu (Letter, Uppercase)
> Trying this in Vim, it tells me what the numeric codes
> of a letter are, but not that it is a letter.
I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.
>> Was it not ment to be: >> var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc
> Yes. I made a mistake there.
>> More restricted variable names >> ------------------------------
>> Nevertheless, I would like a slightly more conservative change in how Erlang >> should use Unicode in variable names and unquoted atoms.
>> I want to be able to read printed source code on a paper and at least >> understand if Ƽ = count() has a variable, an atom or an integer to the left. >> This is an impossible goal because we can today e.g Cyrillic А in any .erl >> file and that will look as it should compile but it will not.
> I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks > like this: А. I grant you that it is somewhere between exceptionally > difficult and impossible to tell an A from an А from an Α (Latin > capital A, Cyrillic, and Greek respectively). But they are all capital > letters. The point of the proposal is that since А (U+0410) is a > capital letter, А = count() _should_ compile.
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V) > that would have been hard to tell from a six, true. > But I don't see how this is any different from the fact that in a script > you don't know, you cannot tell _what_ a character is. > For example, I had a student this year whose native language was I > believe Malayalam. I can't tell a Malayalam letter from a digit from > a punctuation mark.
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
> Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE. > Nothing to do with Cyrillic.
> Reverting to the Middle Welsh letter, if I cannot tell a small letter > from a digit, does that mean that every unquoted atom should begin > with an English letter? (I cannot say "a Latin letter", because > ỽ _is_ a member of the extended Latin script.)
> No, I'm sorry. This is ridiculous. Expecting everybody to begin > _their_ variables which you will almost certainly never see to begin > with an ASCII letter so _you_ can tell this from that; what sense does > that make? If it is in a script you cannot read, then you cannot read it.
> Can we just try, for a minute or to, to entertain a rather wild idea? > Here's the idea: most programmers are adults. They can make informed > choices. If they *want* you to read their code, they are smart enough > to write in a script you can read. If they decide that it's more > important to them that _they_ can read comfortably, that's their > decision to make. If you want a Malayalam-speaker to write code for > you, put the language (English, Finnish, whatever) in the contract.
> I have a confession to make. My multiple-programming-languages to > multiple-styled-output-formats tool is currently Latin-1 only. > That's because it's for _me_; nobody paid me to write it and I didn't > expect anyone else to find it useful (although someone did). It can, > for example, be configured to generate HTML, and it can be made to > wrap keywords in <B> and could as easily wrap variables in <U>. It > would probably take me about a week to revised the thing to use > Unicode. So then I'd have a tool that could generate printed listings > with variables underlined, without needing to slap untold numbers of > people in the face with the notion that they are and must remain > second-class world citizens.
>> So I have to change that requirement into; if it compiles I want to be able >> to tell from a noncolour printed source code listing what the semantics is.
> You are, in fact, proposing a backwards-incompatible change to Erlang, > in order to achieve a goal which is not in general achievable, and not > in my view worth achieving if you could.
> Let's be realistic here. If you cannot read any of the words, it is not > going to do you any good to tell the variables from the atoms from the > numbers. Let's take an example. I took a snippet of Erlang out of > the Erlang/OTP release and transliterated the English letters to > Russian ones. If you _don't_ read the Cyrillic script, precisely what > good does it do you to know which are the variables? If you _do_ read > the Cyrillic script, this will seem to you to be complete gibberish, > so imagine it's a language you don't know.
> I don't know about you, but I wouldn't dare to touch this. > It DOES NOT MATTER TO me which words are variables and which > are not, because that knowledge is not useful to me.
> (By the way, it should now be clear that in a context like this > you'll _know_ that something is a Cyrillic capital A because > everything else is Cyrillic -- there are no capital letters in > keywords -- so what would a Latin capital A be doing there?)
> Does that mean there will be Erlang files that I cannot read and > Raimo Niskanen cannot read? Certainly it does. Does that mean a > big problem for us? No. Nobody is going to _expect_ us to read > it. If someone ships us source code we can't read we shan't use > it.
> Is this a NEW problem? No. It is already possible to use some > surprising languages in ASCII (Klingon, Ancient Egyptian, Greek > with a little ingenuity, ...) so ever since Erlang began, we've > had the possibility of entire files being written in words that > we did not understand. If you don't know what the *functions* > are about, what good does it do you to know which tokens are > variables?
> I once had to maintain a large chunk of Prolog written by a > very clever programmer whose idea of good variable naming > style came from old BASIC (one letter, or one letter and one > digit). I could see _which_ tokens were the variables, but > not _what_ the variable names meant. I had to figure it out > from the predicate names. So from actual experience I can > tell you
> JUST KNOWING WHICH TOKENS ARE VARIABLES IS > NEXT TO USELESS.
>> I think it is better to restrict to a subset of 7-bit US-ASCII.
> Yeah! Let's make Erlang ASCII-only! (Too bad about my father's > middle name: Æneas. Perfectly good English name, from Latin.)
>> Decent >> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which >> character is under the cursor and if it is A..Z or _ under U+7F it is a >> variable start.
> I'm using Aquamacs. > From the Aquamacs help: > Emacs buffers and strings support a large repertoire of > characters from many different scripts, allowing users to > type and display text in almost any known written language.
> To support this multitude of characters and scripts, > Emacs closely follows the Unicode Standard. > It's Meta-X describe-char, not Ctrl-X describe-char, > and it works perfectly with Unicode characters. > Here's sample output:
> character: Ҳ (1202, #o2262, #x4b2) > preferred charset: unicode (Unicode (ISO10646)) > code point: 0x04B2 > syntax: w which means: word > category: .:Base, y:Cyrillic > buffer code: #xD2 #xB2 > file code: #xD2 #xB2 (encoded by coding system utf-8) > display: by this font (glyph code) > nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)
> Character code properties: customize what to show > name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER > old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER > general-category: Lu (Letter, Uppercase)
> Trying this in Vim, it tells me what the numeric codes > of a letter are, but not that it is a letter.
>> The underscore >> --------------
>> I would like to argue against allowing all Unicode general category Pc >> (Connector_Punctuation) character in place of "_". This class contain >> in Unicode 6.2 these characters: >> U+5F; LOW LINE >> U+2034; UNDERTIE >> U+2040; CHARACTER TIE >> U+2054; INVERTED UNDERTIE >> U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE >> U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW
On Thu, Nov 01, 2012 at 06:27:10PM +1300, Richard O'Keefe wrote:
> On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote: : :
> > More restricted variable names > > ------------------------------
> > Nevertheless, I would like a slightly more conservative change in how Erlang > > should use Unicode in variable names and unquoted atoms.
> > I want to be able to read printed source code on a paper and at least > > understand if Ƽ = count() has a variable, an atom or an integer to the left. > > This is an impossible goal because we can today e.g Cyrillic А in any .erl > > file and that will look as it should compile but it will not.
> I am a little puzzled here. U+0410 (CYRILLIC CAPITAL LETTER A) looks > like this: А. I grant you that it is somewhere between exceptionally > difficult and impossible to tell an A from an А from an Α (Latin > capital A, Cyrillic, and Greek respectively). But they are all capital > letters. The point of the proposal is that since А (U+0410) is a > capital letter, А = count() _should_ compile.
I think that point, which is a good one, did not come through in the proposal, but the updated version of yours have a very good rationale that makes it clearer.
> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V) > that would have been hard to tell from a six, true. > But I don't see how this is any different from the fact that in a script > you don't know, you cannot tell _what_ a character is. > For example, I had a student this year whose native language was I > believe Malayalam. I can't tell a Malayalam letter from a digit from > a punctuation mark.
> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?
> Ah! Emacs to the rescue. It's the LATIN CAPITAL LETTER TONE FIVE. > Nothing to do with Cyrillic.
Sorry I mixed examples here and pushed you on a side track. The TONE FIVE was an example of not knowing the symbol's general category. The Cyrillic A was an example of a similary looking glyph to A in US-ASCII.
> Reverting to the Middle Welsh letter, if I cannot tell a small letter > from a digit, does that mean that every unquoted atom should begin > with an English letter? (I cannot say "a Latin letter", because > ỽ _is_ a member of the extended Latin script.)
> No, I'm sorry. This is ridiculous. Expecting everybody to begin > _their_ variables which you will almost certainly never see to begin > with an ASCII letter so _you_ can tell this from that; what sense does > that make? If it is in a script you cannot read, then you cannot read it.
> Can we just try, for a minute or to, to entertain a rather wild idea? > Here's the idea: most programmers are adults. They can make informed > choices. If they *want* you to read their code, they are smart enough > to write in a script you can read. If they decide that it's more > important to them that _they_ can read comfortably, that's their > decision to make. If you want a Malayalam-speaker to write code for > you, put the language (English, Finnish, whatever) in the contract.
> I have a confession to make. My multiple-programming-languages to > multiple-styled-output-formats tool is currently Latin-1 only. > That's because it's for _me_; nobody paid me to write it and I didn't > expect anyone else to find it useful (although someone did). It can, > for example, be configured to generate HTML, and it can be made to > wrap keywords in <B> and could as easily wrap variables in <U>. It > would probably take me about a week to revised the thing to use > Unicode. So then I'd have a tool that could generate printed listings > with variables underlined, without needing to slap untold numbers of > people in the face with the notion that they are and must remain > second-class world citizens.
> > So I have to change that requirement into; if it compiles I want to be able > > to tell from a noncolour printed source code listing what the semantics is.
> You are, in fact, proposing a backwards-incompatible change to Erlang, > in order to achieve a goal which is not in general achievable, and not > in my view worth achieving if you could.
> Let's be realistic here. If you cannot read any of the words, it is not > going to do you any good to tell the variables from the atoms from the > numbers. Let's take an example. I took a snippet of Erlang out of > the Erlang/OTP release and transliterated the English letters to > Russian ones. If you _don't_ read the Cyrillic script, precisely what > good does it do you to know which are the variables? If you _do_ read > the Cyrillic script, this will seem to you to be complete gibberish, > so imagine it's a language you don't know.
So here is what seems to be the core question:
I say I want to be able to see the difference between a variable and an unquoted atom even if I can not make sense of the variables and atoms names'. I say it would be possible to achieve this by enforcing a small set of first letters for variables. Then we would require a variable to start with US-ASCII CAPITAL, "_" or "@".
You say that goal of mine is a lost cause because I will not have any use of being able to tell the difference between telling the difference between a variable and an atom anyway. And trying to achieve this by making backwards incompatible changes is plain ridicilous.
Fair enough.
Just adding "@" to the current set of characters allowed to start a variable would not be a backwards compatible change, or? But it would be ugly to allow some Latin capitals while not the Latin extended nor Cyrillic etc.
> I don't know about you, but I wouldn't dare to touch this. > It DOES NOT MATTER TO me which words are variables and which > are not, because that knowledge is not useful to me.
> (By the way, it should now be clear that in a context like this > you'll _know_ that something is a Cyrillic capital A because > everything else is Cyrillic -- there are no capital letters in > keywords -- so what would a Latin capital A be doing there?)
> Does that mean there will be Erlang files that I cannot read and > Raimo Niskanen cannot read? Certainly it does. Does that mean a > big problem for us? No. Nobody is going to _expect_ us to read > it. If someone ships us source code we can't read we shan't use > it.
> Is this a NEW problem? No. It is already possible to use some > surprising languages in ASCII (Klingon, Ancient Egyptian, Greek > with a little ingenuity, ...) so ever since Erlang began, we've > had the possibility of entire files being written in words that > we did not understand. If you don't know what the *functions* > are about, what good does it do you to know which tokens are > variables?
> I once had to maintain a large chunk of Prolog written by a > very clever programmer whose idea of good variable naming > style came from old BASIC (one letter, or one letter and one > digit). I could see _which_ tokens were the variables, but > not _what_ the variable names meant. I had to figure it out > from the predicate names. So from actual experience I can > tell you
> JUST KNOWING WHICH TOKENS ARE VARIABLES IS > NEXT TO USELESS.
You have a point. Now it is clearer to me.
> > I think it is better to restrict to a subset of 7-bit US-ASCII.
> Yeah! Let's make Erlang ASCII-only! (Too bad about my father's > middle name: Æneas. Perfectly good English name, from Latin.)
I was of course talking about the start of a variable, not the entire language. I am not that stupid. His variable could be __Æneas, or @Æneas (the latter is unreadable).
> > Decent > > editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which > > character is under the cursor and if it is A..Z or _ under U+7F it is a > > variable start.
> I'm using Aquamacs. > From the Aquamacs help: > Emacs buffers and strings support a large repertoire of > characters from many different scripts, allowing users to > type and display text in almost any known written language.
> To support this multitude of characters and scripts, > Emacs closely follows the Unicode Standard. > It's Meta-X describe-char, not Ctrl-X describe-char,
> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
Because that's actually an orthogonal concern.
Suppose for example that you want
essayez mapped to try
... ...
attrapez catch
... ...
fin end
This has nothing to do with the character set.
The classic way to handle keywords in a tokeniser is FIRST to
recognise them (using an automatically generated or hand coded
deterministic finite state machine) as identifiers and LATER
to look them up in a table (possibly using perfect hashing) to
see if they are keywords.
There is no point in allowing people to plug Serbian keywords
into a table if they will never be recognised as identifiers to
start with. We have to get that part right first.
I have three observations on the general idea.
(1) I have seen Pascal localised in exactly this way.
That was French, which is why I used French in my example.
(2) When I mentioned EEP 40 to a colleague his immediate
reaction was precisely the same, that *obviously* people
should be able to plug their own keywords in too.
(3) Ada and Python have not done this.
Suppose we added a new directive:
-keywords(kw_set_id).
which looked in some path for a file containing
[{'essayez','try'},{'attrapez','catch'},{'fin','end'},...].
and used that to update a dictionary.
The lexical analyser Then the lexical analyser could report the English keywords
to the parser. We might want two lists: one for keywords
and one for directives (other than -encoding and -keywords).
This is NOT an EEP; it is not a draft of an EEP; and I have
no intention of producing an EEP on this topic at this time.
Someone else can write that one.
> Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.
One of the reasons that I have no intention of writing an EEP about this
is that flicking between two keyboards is for me a single keystroke.
(On the iPad: tap the globe. On the desktop Mac: command space.)
Switching keyboard layouts is about as hard as switching from lower to
upper case and back. It should also be possible to configure your
text editor, perhaps using abbreviation support, to turn
"@es" (or the equivalent in your language) into "try" and so on.
Until you've written your own wrappers around the library components
you use, you'll need to flick back into Latin script to call those
anyway. Such wrappers _can_ be written, so the need to use some
Latin script in everyday work may not continue forever, but it
does mean there has to be a transition period in which people using
non-Latin keyboards have to learn to use Cmd-Space.
>> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
> Because that's actually an orthogonal concern. > ... > There is no point in allowing people to plug Serbian keywords > into a table if they will never be recognised as identifiers to > start with. We have to get that part right first.
It is like to allow to type only variable names localized and do not allow atoms. No use if I cannot write all the text in the language I've chosen.
What about your Māori students? Will you tell them they may write some parts of the program in their language and some other words they have to write in English?
> ... > (3) Ada and Python have not done this.
I don't think that pointing to other bad choices is good.
> Suppose we added a new directive: > -keywords(kw_set_id). > which looked in some path for a file containing > [{'essayez','try'},{'attrapez','catch'},{'fin','end'},...]. > and used that to update a dictionary. > The lexical analyser > Then the lexical analyser could report the English keywords > to the parser. We might want two lists: one for keywords > and one for directives (other than -encoding and -keywords).
> This is NOT an EEP; it is not a draft of an EEP; and I have > no intention of producing an EEP on this topic at this time. > Someone else can write that one.
>> Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.
> One of the reasons that I have no intention of writing an EEP about this > is that flicking between two keyboards is for me a single keystroke. > (On the iPad: tap the globe. On the desktop Mac: command space.) > Switching keyboard layouts is about as hard as switching from lower to > upper case and back. It should also be possible to configure your > text editor, perhaps using abbreviation support, to turn > "@es" (or the equivalent in your language) into "try" and so on.
> Until you've written your own wrappers around the library components > you use, you'll need to flick back into Latin script to call those > anyway. Such wrappers _can_ be written, so the need to use some > Latin script in everyday work may not continue forever, but it > does mean there has to be a transition period in which people using > non-Latin keyboards have to learn to use Cmd-Space.
It's not only one shortcut to toggle the layout. It's another layout and the brain must be switched to that layout too just to type proper characters. Another problem is bad layout design. The most widely used russian layout has cyrillic letter "С" on the same button as latin "C". By the way, typing only this one letter I have made two errors while trying to type symbol " just because I forgot the layout was still russian. What I want to say is that it is not only the problem of one additional keystroke.
Yes, I'd choose "All or Nothing" option for all this proposal.
> I say I want to be able to see the difference between a variable and an
> unquoted atom even if I can not make sense of the variables and atoms names'.
And I say that I don't see any significant benefit in being able to do this.
I also note that Haskell and Prolog also have identifiers whose properties
depend on the case of their initial letter. In Haskell, "conid"s begin with
a "large" letter and "varid"s begin with a "small" one (section 2.4,
Identifiers and Operators), where they take "_" as a "small" letter so that
it can begin a variable. And they do not require either varids or conids to
begin with an ASCII letter. Nor does SWI Prolog require this:
m% swipl
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.1.4)
...
?- Γαμμα = αλπηα.
Γαμμα = αλπηα.
[Meta-X describe-character]
> Yes. I know. I gave the example.
You seemed to be saying that describe-character didn't work
with non-Latin-1 characters. I am sorry to have misunderstood you.
> So in Vim you can easilly see if the character is less than 128.
> But not if it is a letter.
>>> and U+FE33 "︳" looks like a vertical bar (I guess intended for
>>> vertical flow chinese) so they do not resemble "_" very much.
>> Who said they were _supposed_ to resemble "_"?
>> Not me.
> No. I did, because for me that would indicate the character's purpose.
That's rather like saying that the Greeks should stop using
; for questions, because only ? would indicate the character's purpose.
> Sorry I can not find those reasons. I find reasons and agree
> that if we allow more than "_" we should allow all in Pc,
> but I do not see why we need more than "_" other than because
> it is UAX#31's recommendation.
> The wildcard variable is "_" and starting a variable with that
> character has a special meaning to the compiler. Why do we need
> more aliases for that character?
BECAUSE that character has a special meaning,
and the other characters are NOT aliases for it.
Maybe it's not in the EEP, but it certainly was in this mailing list.
Someone was arguing against internationalisation on the grounds that
變量 couldn't be used as a variable name, and to the proposal that
_變量 be used, it was claimed that the compiler would have to treat
this as something that was supposed to occur just once, and so I
pointed out that there are other Pc characters available, so that
⁀變量 or ‿變量 could be used. It wasn't that word, and I think I
didn't mention ⁀. But the point was that we could retain the
current reading of "_" unchanged and begin caseless words used as
variable names with some other Pc character. The idea is that the
other Pc characters would or could be treated differently from "_".
In fact I do prefer that all the Pc characters should be treated
the same, but at the moment the EEP offers both alternatives for
consideration.
>> It is perfectly acceptable to say "If someone wants to share
>> Erlang code with people in other countries, they should use
>> characters that all those people recognise." In the 21st
>> century it is no longer acceptable to say "nobody may use a
>> character unless I remember what it is."
> I said I want to be able to understand the semantics without
> knowing all characters. Is that a straw man attack?
You cannot even understand the lexical semantics without knowing
the characters. The most primitive level of "understand(ing)
the semantics" I can imagine is being able to answer the question
"Is this sequence of characters legal or not?"
Consider this example: "र॰." (U+0930, U+0970, usual full stop.)
If you were trying to read that from a file, would it be a legal
term?
No. The first character is a letter, but the second character is
classified as a punctuation mark. I only know this because I was
constantly referring to the tables while constructing the example.
It will be instantly obvious, I imagine, to anyone familiar with
the Devanagari script. For that matter, hawaiɁi is or ought to
be a perfectly good atom. That glottal stop letter looked a lot
like a question mark, didn't it? So it might not have _looked_
like an atom, but it would be one.
If someone gives you an Erlang file written entirely in ASCII,
but using the Klingon language, just how much would it help you
to know where the variables began? (Google Translate offers
translation to Esperanto, why not Klingon? I haven't opened my
copy of the how-to-learn-Klingon book in 20 years. Sigh.)
>> The backwards compatibility issue is that
>> ªº are Lo characters and are not allowed to begin an Erlang atom.
> Would that be an issue? Since they are in Lo should we not start
> allowing them?
I wanted to preserve a somewhat stronger property than any I mentioned,
namely that
"this is a legal Erlang text using Latin-1 characters
under the old rules"
if and only if
"this is a legal Erlang text using Latin-1 characters
under the new rules".
If anyone wants to propose allowing "ªº" at the beginning of an atom
in Latin-1 Erlang, fine. Doesn't bother me. But I wasn't about to
introduce _any_ incompatibility if I could avoid it. In particular,
it seems like a nice thing for the transition period that if you have
an Erlang file that works in Unicode Erlang and happens to include
nothing outside Latin-1 (a trivial mechanical check) it should be
guaranteed to work in Latin-1 Erlang.
Oh FLAMING SWEARWORDS. Erlang doesn't currently allow "ªº" anywhere
in an unquoted atom. OK. There are two reasonable alternatives:
Backwards compatible: do not allow "ªº" in identifiers.
UAX#31 compatible: treat "ªº" just like any other Ll characters.
I never thought to check whether Erlang allowed "ªº" at the end of
an identifier because it _obviously_ would. But it doesn't. Sigh.
> Ok. Now I get it. But should it not be the same set after a dot
> as at the start?
Consider
1> X = a.B.
* 1: syntax error before: B
1> X = a._2.
* 1: syntax error before: _2
1> X = a.3.
* 1: syntax error before: 3
1> X = a.b.
'a.b'
That tells us that currently, only Ll characters are allowed
after a dot in the continuation of an identifier. That naturally
generalised to (Ll ∪ Lo). So I made "what can follow a dot" the
same everywhere in an atom. The mental model I had was to think
of dot-followed-by-Ll-or-Lo as a single extended character.
>>> I agree that moving a character from Lu or Lt to Other_Id_Start would
> increase the set of atom_start characters.
> For the characters "ªº" you above called that a backwards compatibility
> issue, which I doubt it is.
There is definitely a backwards compatibility issue (whether one can
safely move a new-rules file that is entirely in Latin-1 back to an
old-rules system). Whether it is of any practical significance is
another matter. What's also clear is that I haven't quite got there
yet. One reason for revising the EEP again.
Concerning stability, I did send a message to the Unicode consortium.
I've had an informal response:
An interesting question you raise, which I will pass along
to some people here. I think the short answer is that you
can tailor these things to particular environments, and you
may not be able to rely on any given standard property for
special purposes. Especially if that property is not
formally stable. But I'll see what others say.
There are sufficiently many programming languages that depend on
initial alphabetic case that we may be looking at a revision of
UAX#31. Wouldn't that be fun‽ (Groan.)
>> On 2/11/2012, at 1:36 AM, Dmitry Belyaev wrote:
>>> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?
>> Because that's actually an orthogonal concern. >> ... >> There is no point in allowing people to plug Serbian keywords >> into a table if they will never be recognised as identifiers to >> start with. We have to get that part right first.
> It is like to allow to type only variable names localized and do not allow atoms. No use if I cannot write all the text in the language I've chosen.
I did not say "no, never do it". I said "We have to handle Unicode variables and atoms FIRST".
Step 1: recognise and distinguish between variables and atoms-or-keywords.
THAT is what EEP 40 is about.
Step 2: decide which atoms-or-keywords are atoms and which are what keywords.
If you want keywords in Hebrew or Malayalam or whatever, you have to do step 1 first.
For that matter, if you are willing to begin keywords with a special character (as Algol and IMP programmers had to), you can just
right now. (%external %integer %fn %spec ring any bells with my readers?)
To repeat: I am NOT saying NO. I am saying, let's get EEP 40 through *FIRST*. Then you will be able to use ?slučaj (Croatian for 'case') or whatever takes your fancy with _no_ extra support from the Erlang/OTP maintainers right away. You get _that_ much ability to use localised keywords *sooner* than if you put that into EEP 40.
> What about your Māori students? Will you tell them they may write some parts of the program in their language and some other words they have to write in English?
No, I'll tell them about the macro trick.
>> (3) Ada and Python have not done this.
> I don't think that pointing to other bad choices is good.
Considering the huge amount of design work that has gone into Ada revision -- I once printed out a whole bunch of revision documents and stopped when I had a pile 60 cm high and still had a long way to go -- it's not clear that how bad a choice it is. As with EEP 40, it's not "no never" to localised keywords, but "this _first_".
There are, after all, such things as preprocessors, and at least keywords are not something you have to name in a debugger in order to trace them or put breakpoints on them, so unlike other identifier mapping, keyword localisation via preprocessor actually works.
> Yes, I'd choose "All or Nothing" option for all this proposal.
EEP 40 is *ORTHOGONAL* to localised keywords. You could have localised (in Latin-1 only) keywords without EEP 40. You could have EEP 40 without localised keywords. You can have both. You can, as I have already said, have EEP 40 AS A STEP TOWARDS localised keywords.
Here's how it goes:
- first one supports alternative encodings, but still accepts only Latin-1 characters.
- next one supports non-Latin-1 characters in comments.
- next one supports non-Latin-1 characters in strings.
- next one supports non-Latin-1 characters in identifiers.
- next one supports non-Latin-1 characters in numbers.
- and at any point along the route one can consider localised keywords.
A Unicode expert has suggested not allowing all of Pc at the
beginning of a variable but just the ASCII and FULLWIDTH
versions of "_". It's not yet clear to me what should be
done in the body of an identifier; allowing precisely these
characters instead of all of Pc is enough for us to begin
with, and we can add the other Pc characters later. Expect
yet another revision next week.
On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote: > I'm not going to answer every point, because I'm supposed to be marking exams. > That doesn't mean they aren't good points.
Looking forward to later then...
> Next revision of the EEP:
It is now updated and published.
Formally, the EEP updates should go to e...@erlang.org, according to http://www.erlang.org/eep.html. I have missed on procedures by not mailing to that list when accepting this EEP, but that will improve...
> > The wildcard variable is "_" and starting a variable with that > > character has a special meaning to the compiler. Why do we need > > more aliases for that character?
> BECAUSE that character has a special meaning, > and the other characters are NOT aliases for it.
> Maybe it's not in the EEP, but it certainly was in this mailing list. > Someone was arguing against internationalisation on the grounds that > 變量 couldn't be used as a variable name, and to the proposal that > _變量 be used, it was claimed that the compiler would have to treat > this as something that was supposed to occur just once, and so I > pointed out that there are other Pc characters available, so that > ⁀變量 or ‿變量 could be used. It wasn't that word, and I think I > didn't mention ⁀. But the point was that we could retain the > current reading of "_" unchanged and begin caseless words used as > variable names with some other Pc character. The idea is that the > other Pc characters would or could be treated differently from "_".
> In fact I do prefer that all the Pc characters should be treated > the same, but at the moment the EEP offers both alternatives for > consideration.
Ok. I misread it as there was only one suggestion and that was to treat all Pc characters alike. I think it is still somewhat unclear that only treating "_" special _is_ an alternative in the EEP.
Also I do not clearly see what problem is solved for someone using fonts with say Arabic letters but not say the undertine, by revising the underscore rule. Bear with me. I have never used another keyboard than Swedish or English. Is it so that when using such a font there is no Pc character available except for the "_" (and why is that available?) so there must be a possibility to express both non-singleton and maybe-singleton variables using just the "_"?
> You cannot even understand the lexical semantics without knowing > the characters. The most primitive level of "understand(ing) > the semantics" I can imagine is being able to answer the question > "Is this sequence of characters legal or not?"
> Consider this example: "र॰." (U+0930, U+0970, usual full stop.) > If you were trying to read that from a file, would it be a legal > term?
> No. The first character is a letter, but the second character is > classified as a punctuation mark. I only know this because I was > constantly referring to the tables while constructing the example. > It will be instantly obvious, I imagine, to anyone familiar with > the Devanagari script. For that matter, hawaiɁi is or ought to > be a perfectly good atom. That glottal stop letter looked a lot > like a question mark, didn't it? So it might not have _looked_ > like an atom, but it would be one.
I have realized that. I wanted a lesser degree of understanding the lexical semantics: If it passes the compiler (which that example does not) I would like to be able to see which identifiers are variables and which are atoms.
Also, e.g someone writing a syntax highlighter for Vim i guess would appreciate a simple rule for how to recognize a variable.
> If someone gives you an Erlang file written entirely in ASCII, > but using the Klingon language, just how much would it help you > to know where the variables began? (Google Translate offers > translation to Esperanto, why not Klingon? I haven't opened my > copy of the how-to-learn-Klingon book in 20 years. Sigh.)
It would not help much, I agree. But if for example I get a bug report about the compiler or runtime system not doing right for a few lines of Klingon Erlang, it would be helpful to easily distinguish variables from atoms.
> >> The backwards compatibility issue is that > >> ªº are Lo characters and are not allowed to begin an Erlang atom.
> > Would that be an issue? Since they are in Lo should we not start > > allowing them?
> I wanted to preserve a somewhat stronger property than any I mentioned, > namely that > "this is a legal Erlang text using Latin-1 characters > under the old rules" > if and only if > "this is a legal Erlang text using Latin-1 characters > under the new rules".
> If anyone wants to propose allowing "ªº" at the beginning of an atom > in Latin-1 Erlang, fine. Doesn't bother me. But I wasn't about to > introduce _any_ incompatibility if I could avoid it. In particular, > it seems like a nice thing for the transition period that if you have > an Erlang file that works in Unicode Erlang and happens to include > nothing outside Latin-1 (a trivial mechanical check) it should be > guaranteed to work in Latin-1 Erlang.
Ok. Good point. That sounds maybe essential. And now that goal is in the latest version of the EEP. Very good.
> > Ok. Now I get it. But should it not be the same set after a dot > > as at the start?
> Consider > 1> X = a.B. > * 1: syntax error before: B > 1> X = a._2. > * 1: syntax error before: _2 > 1> X = a.3. > * 1: syntax error before: 3 > 1> X = a.b. > 'a.b'
> That tells us that currently, only Ll characters are allowed > after a dot in the continuation of an identifier. That naturally > generalised to (Ll ∪ Lo). So I made "what can follow a dot" the > same everywhere in an atom. The mental model I had was to think > of dot-followed-by-Ll-or-Lo as a single extended character.
Yes. And currently only Ll characters are allowed at the start of an atom. So currently the same set is allowed at the start as after a ".".
Your current suggestion allows a.ª as an unquoted atom since the character after the dot is in Lo, but it is not allowed in Erlang today.
It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters are in Nl (Letter_Number), which is part of XID_Start.
So I think the mental model should be that after a dot there should be as if a new atom was starting.
:
> Concerning stability, I did send a message to the Unicode consortium. > I've had an informal response:
> An interesting question you raise, which I will pass along > to some people here. I think the short answer is that you > can tailor these things to particular environments, and you > may not be able to rely on any given standard property for > special purposes. Especially if that property is not > formally stable. But I'll see what others say.
> There are sufficiently many programming languages that depend on > initial alphabetic case that we may be looking at a revision of > UAX#31. Wouldn't that be fun‽ (Groan.)
I think we need an XID_Start_Uppercase and XID_Start_Lowercase, containing Other_ID_Start_Uppercase and Other_ID_Start_Lowercase.
> Remaining points skipped for now.
I especially anticipate a reply about what happens if a character moves from Ll or Lo to Other_ID_Start...
On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote:
:
> Next revision of the EEP: > [-- Attachment #2: eep-0040.md --] > [-- Type: application/octet-stream, Encoding: quoted-printable, Size: 16K --]
I think the EEP should elaborate on normalization. It seems to me that prescribing NFC would be natural since a file consisting of Latin-1 characters is already NFC (Normalized Form C (Composed)).
O.t.o.h that would make the atom fi⁵ different from the atom fi5, and using NFKC (Normalized Form KC (Compatibility Composed)) would make them equal. I do not know. That fi =:= fi may be good but that i⁵ =:= i5 may be not good. Anyway normalizing these character sequences in comments or strings is _not_ desirable. If NFKC would be an option it could only be that for atoms and variables.
> Also I do not clearly see what problem is solved for someone using > fonts with say Arabic letters but not say the undertine, by revising > the underscore rule. Bear with me. I have never used another keyboard > than Swedish or English. Is it so that when using such a font there > is no Pc character available except for the "_" (and why is that > available?) so there must be a possibility to express both non-singleton > and maybe-singleton variables using just the "_"?
I have only tried the Macintosh interface, where there are three "Arabic", "Arabic - PC", and "Arabic - QWERTY" virtual keyboards available. All of them have the underline. ISO 8859-6 (the ISO 8-bit character set for Arabic) includes all of ASCII. However, I am not an expert.
> I have realized that. I wanted a lesser degree of understanding the > lexical semantics: If it passes the compiler (which that example > does not) I would like to be able to see which identifiers are > variables and which are atoms.
> Also, e.g someone writing a syntax highlighter for Vim i guess would > appreciate a simple rule for how to recognize a variable.
Well, the EEP gives them _that_. If Vim can highlight Ada and Python and Java correctly, what's the problem? Copy the regular expressions it uses for Java and tinker with them.
>> If someone gives you an Erlang file written entirely in ASCII, >> but using the Klingon language, just how much would it help you >> to know where the variables began? (Google Translate offers >> translation to Esperanto, why not Klingon? I haven't opened my >> copy of the how-to-learn-Klingon book in 20 years. Sigh.)
> It would not help much, I agree. But if for example I get a bug report > about the compiler or runtime system not doing right for a few lines > of Klingon Erlang, it would be helpful to easily distinguish variables > from atoms.
You don't have to do it by eye. You can use a tool (like the Vim syntax colourer you mention above).
>> Consider >> 1> X = a.B. >> * 1: syntax error before: B >> 1> X = a._2. >> * 1: syntax error before: _2 >> 1> X = a.3. >> * 1: syntax error before: 3 >> 1> X = a.b. >> 'a.b'
>> That tells us that currently, only Ll characters are allowed >> after a dot in the continuation of an identifier. That naturally >> generalised to (Ll ∪ Lo). So I made "what can follow a dot" the >> same everywhere in an atom. The mental model I had was to think >> of dot-followed-by-Ll-or-Lo as a single extended character.
> Yes. And currently only Ll characters are allowed at the start > of an atom. So currently the same set is allowed at the start > as after a ".".
> Your current suggestion allows a.ª as an unquoted atom since the character > after the dot is in Lo, but it is not allowed in Erlang today.
Oh DRAT!
> It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters > are in Nl (Letter_Number), which is part of XID_Start.
Frankly that one doesn't bother me in the least.
> So I think the mental model should be that after a dot there > should be as if a new atom was starting.
However, since I've got to fix the a.ª bug, I may as well adopt your suggestion. The grammar now reads