Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
EEP 40 - A proposal for Unicode variable and atom names in Erlang.
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  13 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Raimo Niskanen  
View profile  
 More options Oct 31 2012, 10:44 am
From: Raimo Niskanen <raimo+erlang-questi...@erix.ericsson.se>
Date: Wed, 31 Oct 2012 15:44:14 +0100
Local: Wed, Oct 31 2012 10:44 am
Subject: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
Allthough there might be opinions on whether allowing Unicode variable
and atom names is a good idea, I would like to discuss EEP 40 itself.
In a previous thread there was much said about Unicode or not but I only
found the following about EEP 40, hoping I did not miss anything valuable:

That was the discussion so far. Here follows my thoughts.

Set notation mistake?
---------------------

I do not understand the BNF definition of variable in the EEP:
    variable ::= var_start var_continue*

    var_start ::= XID_Start ∩ (Lu ∪ Lt ∪ Pc ∪ Other_ID_Start)

    var_continue ::= XID_Continue U "@"

As I read the Unicode XID_Start definition
<http://www.unicode.org/Public/6.2.0/ucd/DerivedCoreProperties.txt>
there are no general category Pc (Connector_Punctuation) characters in
XID_Start, hence will there be no such in the set intersection
(which as I understand '∩' should mean) defining var_start. Therefore
U+5F LOW LINE aka '_' Underscore is not allowed to start a variable.

Is there something wrong in that set notation, or what did I misunderstand?

Was it not ment to be:
    var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc

More restricted variable names
------------------------------

Nevertheless, I would like a slightly more conservative change in how Erlang
should use Unicode in variable names and unquoted atoms.

I want to be able to read printed source code on a paper and at least
understand if Ƽ = count() has a variable, an atom or an integer to the left.
This is an impossible goal because we can today e.g Cyrillic А in any .erl
file and that will look as it should compile but it will not.

So I have to change that requirement into; if it compiles I want to be able
to tell from a noncolour printed source code listing what the semantics is.

Therefore I think a more conservative rule for variable start is needed:
    variable ::= var_start var_continue*

    var_start ::= ("A".."Z" ∪ "_")

    var_continue ::= XID_Continue ∪ "@"

I hereby ditch the characters "À".."Ö" ∪ "Ø".."Þ" that are allowed today since
if they are allowed there is no telling which of all accents are allowed
and so we have to allow all LATIN CAPITAL and therefor all GREEK, CYRILLIC,
ARMENIAN, GEORGIAN, GLAGOLITIC, COPTIC and DESERET CAPITAL letters,
and that is a too big set to handle for a human. Tools would become
essential.

I think it is better to restrict to a subset of 7-bit US-ASCII. Decent
editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
character is under the cursor and if it is A..Z or _ under U+7F it is a
variable start. That is a possible set to memorize even for non-english
programmers especially considering all reserved words are in 7-bit US-ASCII
and hence Erlang programmers must be somewhat familiar with that charset.

Removing the Latin-1 characters > 128 will need warnings in one release
introduction later, and probably an non-unicode compile flag. But I do not
think that many have used such characters to start variables so far.

We can then define mst_variable (maybe singleton variable) much like
in the proposed EEP:
    mst_variable ::= mst_var_start var_continue*

    mst_var_start ::= "_" ("A".."Z" ∪ "a".."z" ∪ "0".."9" ∪ "_" ∪ "@")

An alternative suggestion is to allow "@" as var_start:
    variable ::= var_start var_continue*
    var_start ::= ("A".."Z" ∪ "_" ∪ "@")

which require no change from today for maybe singleton variables:
    mst_var_start ::= "_"

I can not think of anything partically bad with allowing @隠者 as a
variable name. The "@" makes it distinct from an atom, and "@" is
one of the variable prefix characters in perl (good or bad?!).

The underscore
--------------

I would like to argue against allowing all Unicode general category Pc
(Connector_Punctuation) character in place of "_". This class contain
in Unicode 6.2 these characters:
    U+5F;   LOW LINE
    U+2034; UNDERTIE
    U+2040; CHARACTER TIE
    U+2054; INVERTED UNDERTIE
    U+FE33; PRESENTATION FORM FOR VERTICAL LOW LINE
    U+FE33; PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
    U+FE4D; DASHED LOW LINE
    U+FE4E; CENTERLINE LOW LINE
    U+FE4F; WAVY LOW LINE
    U+FF3F; FULLWIDTH LOW LINE

Of these at least U+2040 "⁀" is horizontal at the top of the line
and U+FE33 "︳" looks like a vertical bar (I guess intended for
vertical flow chinese) so they do not resemble "_" very much.
Allowing all these would make it hard to remember if a given
character is category Pc or something else e.g "|". Therefore
I think it will be enough to allow U+5F LOW LINE ("_", underscore).

An Erlang programmer will have to be able to enter many other
7-bit US-ASCII punctuation characters e.g ".,?:;%'" so
the underscore should pose no particular problem.

Unquoted atoms
--------------

The EEP proposes:
    atom_start ::= XID_Start ∖ (Lu ∪ Lt ∪ Lo ∪ Pc)
        | "." (Ll ∪ Lo)

I agree that Lu (Uppercase_Letter) and Lt (Titlecase_Letter) should
be excluded so an atom can not start with a capital looking letter,
but Pc ⊄ XID_Start so there is no reason to subtract it, and why
subtract Lo (Other_Letter)?

There also seems to be a typo in the definition of unquoted_atom
where an iteration of atom_continue is missing.

I propose:
    unquoted_atom ::= atom_start atom_continue*

    atom_start ::= atom_start_char
        | "." atom_start_char

    atom_start_char ::= XID_Start ∖ (Lu ∪ Lt)

    atom_continue ::= XID_Continue ∪ "@"
        | "." XID_Continue

General explanation
-------------------

I think the EEP could benefit from explaining more about the used character
classes, what kind of stability annex #31 is designed to give and such.

When I did read the EEP it took several days of Unicode standard reading to
start understanding, and I think many hesitate before trying to understand
the EEP, which is a pity.

My first concern was about if I write code for one Unicode Erlang release
in the future, will then that code be valid for subsequent Erlang releases
based on later Unicode standards. It seems annex #31 is very much targeted
at solving that problem, and Unicode in itself is much about stability in
subsequent standards, so that problem seems handled, but I am not sure yet.

For example the EEP and my proposal both ...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard O'Keefe  
View profile  
 More options Nov 1 2012, 1:27 am
From: "Richard O'Keefe" <o...@cs.otago.ac.nz>
Date: Thu, 1 Nov 2012 18:27:10 +1300
Local: Thurs, Nov 1 2012 1:27 am
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

  eep-0040.md
14K Download

On 1/11/2012, at 3:44 AM, Raimo Niskanen wrote:

> Was it not ment to be:
>    var_start ::= (XID_Start ∩ (Lu ∪ Lt ∪ Other_ID_Start)) ∪ Pc

Yes.  I made a mistake there.

> More restricted variable names
> ------------------------------

> Nevertheless, I would like a slightly more conservative change in how Erlang
> should use Unicode in variable names and unquoted atoms.

> I want to be able to read printed source code on a paper and at least
> understand if Ƽ = count() has a variable, an atom or an integer to the left.
> This is an impossible goal because we can today e.g Cyrillic А in any .erl
> file and that will look as it should compile but it will not.

I am a little puzzled here.  U+0410 (CYRILLIC CAPITAL LETTER A) looks
like this:  А.  I grant you that it is somewhere between exceptionally
difficult and impossible to tell an A from an А from an Α (Latin
capital A, Cyrillic, and Greek respectively).  But they are all capital
letters.  The point of the proposal is that since А (U+0410) is a
capital letter, А = count() _should_ compile.

If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
that would have been hard to tell from a six, true.
But I don't see how this is any different from the fact that in a script
you don't know, you cannot tell _what_ a character is.
For example, I had a student this year whose native language was I
believe Malayalam.  I can't tell a Malayalam letter from a digit from
a punctuation mark.

Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?

Ah!  Emacs to the rescue.  It's the LATIN CAPITAL LETTER TONE FIVE.
Nothing to do with Cyrillic.

Reverting to the Middle Welsh letter, if I cannot tell a small letter
from a digit, does that mean that every unquoted atom should begin
with an English letter?  (I cannot say "a Latin letter", because
ỽ _is_ a member of the extended Latin script.)

No, I'm sorry.  This is ridiculous.  Expecting everybody to begin
_their_ variables which you will almost certainly never see to begin
with an ASCII letter so _you_ can tell this from that; what sense does
that make?  If it is in a script you cannot read, then you cannot read it.

Can we just try, for a minute or to, to entertain a rather wild idea?
Here's the idea:  most programmers are adults.  They can make informed
choices.  If they *want* you to read their code, they are smart enough
to write in a script you can read.  If they decide that it's more
important to them that _they_ can read comfortably, that's their
decision to make.  If you want a Malayalam-speaker to write code for
you, put the language (English, Finnish, whatever) in the contract.

I have a confession to make.  My multiple-programming-languages to
multiple-styled-output-formats tool is currently Latin-1 only.
That's because it's for _me_; nobody paid me to write it and I didn't
expect anyone else to find it useful (although someone did).  It can,
for example, be configured to generate HTML, and it can be made to
wrap keywords in <B> and could as easily wrap variables in <U>.  It
would probably take me about a week to revised the thing to use
Unicode.  So then I'd have a tool that could generate printed listings
with variables underlined, without needing to slap untold numbers of
people in the face with the notion that they are and must remain
second-class world citizens.

> So I have to change that requirement into; if it compiles I want to be able
> to tell from a noncolour printed source code listing what the semantics is.

You are, in fact, proposing a backwards-incompatible change to Erlang,
in order to achieve a goal which is not in general achievable, and not
in my view worth achieving if you could.

Let's be realistic here.  If you cannot read any of the words, it is not
going to do you any good to tell the variables from the atoms from the
numbers.  Let's take an example.  I took a snippet of Erlang out of
the Erlang/OTP release and transliterated the English letters to
Russian ones.  If you _don't_ read the Cyrillic script, precisely what
good does it do you to know which are the variables?  If you _do_ read
the Cyrillic script, this will seem to you to be complete gibberish,
so imagine it's a language you don't know.

ҵӄҽҲӃҸҾҽ({ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,Ґӂ0,ҥұ,ҥҳұ}, ҐӃҾҼҜҾҳ, ҢӃ0) ->
    try
        {ҐӂҼ,ҔҽӃӁӈқҰұҴһ,ҢӃ} = ҲҶ_ҵӄҽ(ҥұ, Ґӂ0, ҥҳұ, ҐӃҾҼҜҾҳ, {ҝҰҼҴ,ҐӁҸӃӈ}, ҢӃ0),
        ҕӄҽҲ = {ҵӄҽҲӃҸҾҽ,ҝҰҼҴ,ҐӁҸӃӈ,ҔҽӃӁӈқҰұҴһ,ҐӂҼ},
        {ҕӄҽҲ,ҢӃ}
    catch
        ҒһҰӂӂ:ҔӁӁҾӁ ->
            ҢӃҰҲҺ = ҴӁһҰҽҶ:ҶҴӃ_ӂӃҰҲҺӃӁҰҲҴ(),
            ҸҾ:ҵӆӁҸӃҴ("ҕӄҽҲӃҸҾҽ: ~ӆ/~ӆ\ҽ", [ҝҰҼҴ,ҐӁҸӃӈ]),
            ҴӁһҰҽҶ:ӁҰҸӂҴ(ҒһҰӂӂ, ҔӁӁҾӁ, ҢӃҰҲҺ)
    end.

ҲҶ_ҵӄҽ(қҴӂ, җӅӂ, ҥҳұ, ҐӃҾҼҜҾҳ, ҝҰҼҴҐӁҸӃӈ, ҢӃ0) ->
    {ҕҸ,ҢӃ1} = ҽҴӆ_һҰұҴһ(ҢӃ0),
    {ҕһ,ҢӃ2} = һҾҲҰһ_ҵӄҽҲ_һҰұҴһ(ҝҰҼҴҐӁҸӃӈ, ҢӃ1),

    ґҴҵ = ҲһҴҰӁ_ҳҴҰҳ(#ӂӁ{ӁҴҶ=ҵҾһҳһ(fun ({ӅҰӁ,ҥ}, ҡҴҶ) ->
                                           ҿӄӃ_ӁҴҶ(ҥ, ҡҴҶ)
                                     end, [], җӅӂ),
                        ӂӃҺ=[]}, 0, ҥҳұ),
    {ґ2,_ҐҵӃ,ҢӃ3} = ҲҶ_һҸӂӃ(қҴӂ, 0, ҥҳұ, ґҴҵ,
       ҢӃ2#ҲҶ{ұӃӈҿҴ=ҴӇҸӃ,ұҵҰҸһ=ҕҸ,ҵҸҽҵҾ=ҕҸ,Ҹӂ_ӃҾҿ_ұһҾҲҺ=ӃӁӄҴ}),
    {ҝҰҼҴ,ҐӁҸӃӈ} = ҝҰҼҴҐӁҸӃӈ,
    Ґ = [{һҰұҴһ,ҕҸ},{ҵӄҽҲ_ҸҽҵҾ,ҐӃҾҼҜҾҳ,{ҰӃҾҼ,ҝҰҼҴ},ҐӁҸӃӈ},
         {һҰұҴһ,ҕһ}|ґ2],
    {Ґ,ҕһ,ҢӃ3}.

I don't know about you, but I wouldn't dare to touch this.
It DOES NOT MATTER TO me which words are variables and which
are not, because that knowledge is not useful to me.

(By the way, it should now be clear that in a context like this
you'll _know_ that something is a Cyrillic capital A because
everything else is Cyrillic -- there are no capital letters in
keywords -- so what would a Latin capital A be doing there?)

Does that mean there will be Erlang files that I cannot read and
Raimo Niskanen cannot read?  Certainly it does. Does that mean a
big problem for us?  No.  Nobody is going to _expect_ us to read
it.  If someone ships us source code we can't read we shan't use
it.

Is this a NEW problem?  No.  It is already possible to use some
surprising languages in ASCII (Klingon, Ancient Egyptian, Greek
with a little ingenuity, ...) so ever since Erlang began, we've
had the possibility of entire files being written in words that
we did not understand.  If you don't know what the *functions*
are about, what good does it do you to know which tokens are
variables?

I once had to maintain a large chunk of Prolog written by a
very clever programmer whose idea of good variable naming
style came from old BASIC (one letter, or one letter and one
digit).  I could see _which_ tokens were the variables, but
not _what_ the variable names meant.  I had to figure it out
from the predicate names.  So from actual experience I can
tell you

        JUST KNOWING WHICH TOKENS ARE VARIABLES IS
        NEXT TO USELESS.

> I think it is better to restrict to a subset of 7-bit US-ASCII.

Yeah!  Let's make Erlang ASCII-only!  (Too bad about my father's
middle name: Æneas.  Perfectly good English name, from Latin.)

> Decent
> editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> character is under the cursor and if it is A..Z or _ under U+7F it is a
> variable start.

I'm using Aquamacs.
From the Aquamacs help:
        Emacs buffers and strings support a large repertoire of
        characters from many different scripts, allowing users to
        type and display text in almost any known written language.

        To support this multitude of characters and scripts,
        Emacs closely follows the Unicode Standard.
It's Meta-X describe-char, not Ctrl-X describe-char,
and it works perfectly with Unicode characters.
Here's sample output:

        character: Ҳ (1202, #o2262, #x4b2)
preferred charset: unicode (Unicode (ISO10646))
       code point: 0x04B2
           syntax: w    which means: word
         category: .:Base, y:Cyrillic
      buffer code: #xD2 #xB2
        file code: #xD2 #xB2 (encoded by coding system utf-8)
          display: by this font (glyph code)
    nil:-apple-Lucida_Grande-medium-normal-normal-*-13-*-*-*-p-0-iso10646-1 (#x8A3)

Character code properties: customize what to show
  name: CYRILLIC CAPITAL LETTER HA WITH DESCENDER
  old-name: CYRILLIC CAPITAL LETTER KHA WITH RIGHT DESCENDER
  general-category: Lu (Letter, Uppercase)

Trying this in Vim, it tells me what the numeric codes
of a letter are, but not that it is a letter.

If it looks horizontal, you have a very poor font.
It's _supposed_ to look more like a c rotated 90 degrees
clockwise and flattened a bit.

> and U+FE33 "︳" looks like a vertical bar (I guess intended for
> vertical flow chinese) so they do not resemble "_" very much.

Who said they were _supposed_ to resemble "_"?
Not me.

I can see your point here, but allowing-all-of-Pc ...

read more »

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim McNamara  
View profile  
 More options Nov 1 2012, 1:45 am
From: Tim McNamara <paperl...@timmcnamara.co.nz>
Date: Thu, 1 Nov 2012 18:45:10 +1300
Local: Thurs, Nov 1 2012 1:45 am
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

+1 to ROK's ideas from me.

We should be allowing programmers and programming teams to make their own
decisions regarding which characters to allow within projects. If people
want to play tricks on each other by replacing ASCII chars with visibly
indistinguishable chars from somewhere else, then that's their own
business. We have the technology to be culturally sensitive and responsive.
If someone is willing to invest energy to implement Unicode, we as a
community should not put barriers in front of that.
On Nov 1, 2012 6:27 PM, "Richard O'Keefe" <o...@cs.otago.ac.nz> wrote:

...

read more »

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dmitry Belyaev  
View profile  
 More options Nov 1 2012, 8:36 am
From: Dmitry Belyaev <be.dmi...@gmail.com>
Date: Thu, 1 Nov 2012 16:36:13 +0400
Local: Thurs, Nov 1 2012 8:36 am
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
I've looked through the proposal and don't understand why there are no proposal to add localized keywords?

Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.

--
Dmitry Belyaev

On 01.11.2012, at 9:27, Richard O'Keefe wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Raimo Niskanen  
View profile  
 More options Nov 1 2012, 12:52 pm
From: Raimo Niskanen <raimo+erlang-questi...@erix.ericsson.se>
Date: Thu, 1 Nov 2012 17:52:39 +0100
Local: Thurs, Nov 1 2012 12:52 pm
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

I think that point, which is a good one, did not come through in the
proposal, but the updated version of yours have a very good
rationale that makes it clearer.

> If the example had been U+1EFD ỽ (LATIN SMALL LETTER MIDDLE-WELSH V)
> that would have been hard to tell from a six, true.
> But I don't see how this is any different from the fact that in a script
> you don't know, you cannot tell _what_ a character is.
> For example, I had a student this year whose native language was I
> believe Malayalam.  I can't tell a Malayalam letter from a digit from
> a punctuation mark.

> Did you mean U+0417 (CYRILLIC CAPITAL LETTER ZE) "З", which resembles 3?

> Ah!  Emacs to the rescue.  It's the LATIN CAPITAL LETTER TONE FIVE.
> Nothing to do with Cyrillic.

Sorry I mixed examples here and pushed you on a side track. The TONE FIVE
was an example of not knowing the symbol's general category. The Cyrillic A
was an example of a similary looking glyph to A in US-ASCII.

So here is what seems to be the core question:

I say I want to be able to see the difference between a variable and an
unquoted atom even if I can not make sense of the variables and atoms names'.
I say it would be possible to achieve this by enforcing a small set of first
letters for variables. Then we would require a variable to start with
US-ASCII CAPITAL, "_" or "@".

You say that goal of mine is a lost cause because I will not have any use of
being able to tell the difference between telling the difference between a
variable and an atom anyway. And trying to achieve this by making backwards
incompatible changes is plain ridicilous.

Fair enough.

Just adding "@" to the current set of characters allowed to start a variable
would not be a backwards compatible change, or? But it would be ugly to allow
some Latin capitals while not the Latin extended nor Cyrillic etc.

You have a point. Now it is clearer to me.

> > I think it is better to restrict to a subset of 7-bit US-ASCII.

> Yeah!  Let's make Erlang ASCII-only!  (Too bad about my father's
> middle name: Æneas.  Perfectly good English name, from Latin.)

I was of course talking about the start of a variable, not the
entire language. I am not that stupid. His variable could be
__Æneas, or @Æneas (the latter is unreadable).

> > Decent
> > editors have means (vim: ga, emacs: Ctrl-X describe-char) to show which
> > character is under the cursor and if it is A..Z or _ under U+7F it is a
> > variable start.

> I'm using Aquamacs.
> From the Aquamacs help:
>    Emacs buffers and strings support a large repertoire of
>    characters from many different scripts, allowing users to
>    type and display text in almost any known written language.

>    To support this multitude of characters and scripts,
>    Emacs closely follows the Unicode Standard.
> It's Meta-X describe-char, not Ctrl-X describe-char,

Yes. Meta-X. My mistake.

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard O'Keefe  
View profile  
 More options Nov 1 2012, 5:15 pm
From: "Richard O'Keefe" <o...@cs.otago.ac.nz>
Date: Fri, 2 Nov 2012 10:15:10 +1300
Local: Thurs, Nov 1 2012 5:15 pm
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

On 2/11/2012, at 1:36 AM, Dmitry Belyaev wrote:

> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?

Because that's actually an orthogonal concern.

Suppose for example that you want

        essayez         mapped to       try
            ...                             ...
        attrapez                        catch
            ...                             ...
        fin                             end

This has nothing to do with the character set.

The classic way to handle keywords in a tokeniser is FIRST to
recognise them (using an automatically generated or hand coded
deterministic finite state machine) as identifiers and LATER
to look them up in a table (possibly using perfect hashing) to
see if they are keywords.

There is no point in allowing people to plug Serbian keywords
into a table if they will never be recognised as identifiers to
start with.  We have to get that part right first.

I have three observations on the general idea.
(1) I have seen Pascal localised in exactly this way.
    That was French, which is why I used French in my example.
(2) When I mentioned EEP 40 to a colleague his immediate
    reaction was precisely the same, that *obviously* people
    should be able to plug their own keywords in too.
(3) Ada and Python have not done this.

Suppose we added a new directive:
-keywords(kw_set_id).
which looked in some path for a file containing
[{'essayez','try'},{'attrapez','catch'},{'fin','end'},...].
and used that to update a dictionary.
The lexical analyser
Then the lexical analyser could report the English keywords
to the parser.  We might want two lists: one for keywords
and one for directives (other than -encoding and -keywords).

This is NOT an EEP; it is not a draft of an EEP; and I have
no intention of producing an EEP on this topic at this time.
Someone else can write that one.

> Suppose I will be using atoms and variables that are easy to read in my own language. Then I'll definitely be frustrated if I have to write keywords in any other language. More than that, it will be very annoying to anyone who has to switch keyboard layout from English to native.

One of the reasons that I have no intention of writing an EEP about this
is that flicking between two keyboards is for me a single keystroke.
(On the iPad: tap the globe.  On the desktop Mac: command space.)
Switching keyboard layouts is about as hard as switching from lower to
upper case and back.  It should also be possible to configure your
text editor, perhaps using abbreviation support, to turn
"@es" (or the equivalent in your language) into "try" and so on.

Until you've written your own wrappers around the library components
you use, you'll need to flick back into Latin script to call those
anyway.  Such wrappers _can_ be written, so the need to use some
Latin script in everyday work may not continue forever, but it
does mean there has to be a transition period in which people using
non-Latin keyboards have to learn to use Cmd-Space.

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dmitry Belyaev  
View profile  
 More options Nov 1 2012, 5:37 pm
From: Dmitry Belyaev <be.dmi...@gmail.com>
Date: Fri, 2 Nov 2012 01:37:42 +0400
Local: Thurs, Nov 1 2012 5:37 pm
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
Comments inside the quoted text below.

--
Dmitry Belyaev

On 02.11.2012, at 1:15, Richard O'Keefe wrote:

> On 2/11/2012, at 1:36 AM, Dmitry Belyaev wrote:

>> I've looked through the proposal and don't understand why there are no proposal to add localized keywords?

> Because that's actually an orthogonal concern.
> ...
> There is no point in allowing people to plug Serbian keywords
> into a table if they will never be recognised as identifiers to
> start with.  We have to get that part right first.

It is like to allow to type only variable names localized and do not allow atoms. No use if I cannot write all the text in the language I've chosen.

What about your Māori students? Will you tell them they may write some parts of the program in their language and some other words they have to write in English?

> ...
> (3) Ada and Python have not done this.

I don't think that pointing to other bad choices is good.

It's not only one shortcut to toggle the layout. It's another layout and the brain must be switched to that layout too just to type proper characters.
Another problem is bad layout design. The most widely used russian layout has cyrillic letter "С" on the same button as latin "C".
By the way, typing only this one letter I have made two errors while trying to type symbol " just because I forgot the layout was still russian.
What I want to say is that it is not only the problem of one additional keystroke.

Yes, I'd choose "All or Nothing" option for all this proposal.

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard O'Keefe  
View profile  
 More options Nov 1 2012, 6:42 pm
From: "Richard O'Keefe" <o...@cs.otago.ac.nz>
Date: Fri, 2 Nov 2012 11:41:46 +1300
Local: Thurs, Nov 1 2012 6:41 pm
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

I'm not going to answer every point, because I'm supposed to be marking exams.
That doesn't mean they aren't good points.

Next revision of the EEP:

  eep-0040.md
16K Download

> So here is what seems to be the core question:

> I say I want to be able to see the difference between a variable and an
> unquoted atom even if I can not make sense of the variables and atoms names'.

And I say that I don't see any significant benefit in being able to do this.

I also note that Haskell and Prolog also have identifiers whose properties
depend on the case of their initial letter.  In Haskell, "conid"s begin with
a "large" letter and "varid"s begin with a "small" one (section 2.4,
Identifiers and Operators), where they take "_" as a "small" letter so that
it can begin a variable.  And they do not require either varids or conids to
begin with an ASCII letter.  Nor does SWI Prolog require this:
m% swipl
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.1.4)
...
?- Γαμμα = αλπηα.
Γαμμα = αλπηα.

[Meta-X describe-character]

> Yes. I know. I gave the example.

You seemed to be saying that describe-character didn't work
with non-Latin-1 characters.  I am sorry to have misunderstood you.

> So in Vim you can easilly see if the character is less than 128.
> But not if it is a letter.
>>> and U+FE33 "︳" looks like a vertical bar (I guess intended for
>>> vertical flow chinese) so they do not resemble "_" very much.

>> Who said they were _supposed_ to resemble "_"?
>> Not me.

> No. I did, because for me that would indicate the character's purpose.

That's rather like saying that the Greeks should stop using
; for questions, because only ? would indicate the character's purpose.

> Sorry I can not find those reasons. I find reasons and agree
> that if we allow more than "_" we should allow all in Pc,
> but I do not see why we need more than "_" other than because
> it is UAX#31's recommendation.

> The wildcard variable is "_" and starting a variable with that
> character has a special meaning to the compiler. Why do we need
> more aliases for that character?

BECAUSE that character has a special meaning,
and the other characters are NOT aliases for it.

Maybe it's not in the EEP, but it certainly was in this mailing list.
Someone was arguing against internationalisation on the grounds that
變量 couldn't be used as a variable name, and to the proposal that
_變量 be used, it was claimed that the compiler would have to treat
this as something that was supposed to occur just once, and so I
pointed out that there are other Pc characters available, so that
⁀變量 or ‿變量 could be used.  It wasn't that word, and I think I
didn't mention ⁀.  But the point was that we could retain the
current reading of "_" unchanged and begin caseless words used as
variable names with some other Pc character.  The idea is that the
other Pc characters would or could be treated differently from "_".

In fact I do prefer that all the Pc characters should be treated
the same, but at the moment the EEP offers both alternatives for
consideration.

>> It is perfectly acceptable to say "If someone wants to share
>> Erlang code with people in other countries, they should use
>> characters that all those people recognise."  In the 21st
>> century it is no longer acceptable to say "nobody may use a
>> character unless I remember what it is."

> I said I want to be able to understand the semantics without
> knowing all characters. Is that a straw man attack?

You cannot even understand the lexical semantics without knowing
the characters.  The most primitive level of "understand(ing)
the semantics" I can imagine is being able to answer the question
"Is this sequence of characters legal or not?"

Consider this example: "؂र॰." (U+0930, U+0970, usual full stop.)
If you were trying to read that from a file, would it be a legal
term?

No.  The first character is a letter, but the second character is
classified as a punctuation mark.  I only know this because I was
constantly referring to the tables while constructing the example.
It will be instantly obvious, I imagine, to anyone familiar with
the Devanagari script.  For that matter, hawaiɁi is or ought to
be a perfectly good atom.  That glottal stop letter looked a lot
like a question mark, didn't it?  So it might not have _looked_
like an atom, but it would be one.

If someone gives you an Erlang file written entirely in ASCII,
but using the Klingon language, just how much would it help you
to know where the variables began?  (Google Translate offers
translation to Esperanto, why not Klingon?  I haven't opened my
copy of the how-to-learn-Klingon book in 20 years.  Sigh.)

>> The backwards compatibility issue is that
>> ªº are Lo characters and are not allowed to begin an Erlang atom.

> Would that be an issue? Since they are in Lo should we not start
> allowing them?

I wanted to preserve a somewhat stronger property than any I mentioned,
namely that
        "this is a legal Erlang text using Latin-1 characters
         under the old rules"
     if and only if
        "this is a legal Erlang text using Latin-1 characters
         under the new rules".

If anyone wants to propose allowing "ªº" at the beginning of an atom
in Latin-1 Erlang, fine.  Doesn't bother me.  But I wasn't about to
introduce _any_ incompatibility if I could avoid it.  In particular,
it seems like a nice thing for the transition period that if you have
an Erlang file that works in Unicode Erlang and happens to include
nothing outside Latin-1 (a trivial mechanical check) it should be
guaranteed to work in Latin-1 Erlang.

Oh FLAMING SWEARWORDS.  Erlang doesn't currently allow "ªº" anywhere
in an unquoted atom.  OK.  There are two reasonable alternatives:

Backwards compatible: do not allow "ªº" in identifiers.
UAX#31 compatible:    treat "ªº" just like any other Ll characters.

I never thought to check whether Erlang allowed "ªº" at the end of
an identifier because it _obviously_ would.  But it doesn't.  Sigh.

>> This should read

>>    atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")
>>                |  "." (Ll ∪ Lo)

> Ok. Now I get it. But should it not be the same set after a dot
> as at the start?

Consider
1> X = a.B.
* 1: syntax error before: B
1> X = a._2.
* 1: syntax error before: _2
1> X = a.3.
* 1: syntax error before: 3
1> X = a.b.
'a.b'

That tells us that currently, only Ll characters are allowed
after a dot in the continuation of an identifier.  That naturally
generalised to (Ll ∪ Lo).  So I made "what can follow a dot" the
same everywhere in an atom.  The mental model I had was to think
of dot-followed-by-Ll-or-Lo as a single extended character.

>>> I agree that moving a character from Lu or Lt to Other_Id_Start would
> increase the set of atom_start characters.

> For the characters "ªº" you above called that a backwards compatibility
> issue, which I doubt it is.

There is definitely a backwards compatibility issue (whether one can
safely move a new-rules file that is entirely in Latin-1 back to an
old-rules system).  Whether it is of any practical significance is
another matter.  What's also clear is that I haven't quite got there
yet.  One reason for revising the EEP again.

Concerning stability, I did send a message to the Unicode consortium.
I've had an informal response:

        An interesting question you raise, which I will pass along
        to some people here.  I think the short answer is that you
        can tailor these things to particular environments, and you
        may not be able to rely on any given standard property for
        special purposes.  Especially if that property is not
        formally stable.  But I'll see what others say.

There are sufficiently many programming languages that depend on
initial alphabetic case that we may be looking at a revision of
UAX#31.  Wouldn't that be fun‽  (Groan.)

Remaining points skipped for now.

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard O'Keefe  
View profile  
 More options Nov 1 2012, 7:11 pm
From: "Richard O'Keefe" <o...@cs.otago.ac.nz>
Date: Fri, 2 Nov 2012 12:11:05 +1300
Local: Thurs, Nov 1 2012 7:11 pm
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

On 2/11/2012, at 10:37 AM, Dmitry Belyaev wrote:

I did not say "no, never do it".
I said "We have to handle Unicode variables and atoms FIRST".

Step 1: recognise and distinguish between variables and atoms-or-keywords.

THAT is what EEP 40 is about.

Step 2: decide which atoms-or-keywords are atoms and which are what keywords.

If you want keywords in Hebrew or Malayalam or whatever, you have to do
step 1 first.

For that matter, if you are willing to begin keywords with a special
character (as Algol and IMP programmers had to), you can just

        -include('keywords/fr').
        ... ?essayez
                ...
            ?attrapez
                ...
            ?fin

right now.  (%external %integer %fn %spec ring any bells with my readers?)

To repeat: I am NOT saying NO.
I am saying, let's get EEP 40 through *FIRST*.
Then you will be able to use ?slučaj (Croatian for 'case') or
whatever takes your fancy with _no_ extra support from the
Erlang/OTP maintainers right away.  You get _that_ much ability
to use localised keywords *sooner* than if you put that into EEP 40.

> What about your Māori students? Will you tell them they may write some parts of the program in their language and some other words they have to write in English?

No, I'll tell them about the macro trick.

>> (3) Ada and Python have not done this.

> I don't think that pointing to other bad choices is good.

Considering the huge amount of design work that has gone into
Ada revision -- I once printed out a whole bunch of revision
documents and stopped when I had a pile 60 cm high and still
had a long way to go -- it's not clear that how bad a choice
it is.  As with EEP 40, it's not "no never" to localised
keywords, but "this _first_".

There are, after all, such things as preprocessors,
and at least keywords are not something you have to name in
a debugger in order to trace them or put breakpoints on them,
so unlike other identifier mapping, keyword localisation via
preprocessor actually works.

> Yes, I'd choose "All or Nothing" option for all this proposal.

EEP 40 is *ORTHOGONAL* to localised keywords.
You could have localised (in Latin-1 only) keywords without EEP 40.
You could have EEP 40 without localised keywords.
You can have both.
You can, as I have already said, have EEP 40 AS A STEP TOWARDS
localised keywords.

Here's how it goes:

        - first one supports alternative encodings,
          but still accepts only Latin-1 characters.

        - next one supports non-Latin-1 characters in comments.

        - next one supports non-Latin-1 characters in strings.

        - next one supports non-Latin-1 characters in identifiers.

        - next one supports non-Latin-1 characters in numbers.

        - and at any point along the route one can consider
          localised keywords.

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard O'Keefe  
View profile  
 More options Nov 2 2012, 12:00 am
From: "Richard O'Keefe" <o...@cs.otago.ac.nz>
Date: Fri, 2 Nov 2012 17:00:19 +1300
Local: Fri, Nov 2 2012 12:00 am
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
A Unicode expert has suggested not allowing all of Pc at the
beginning of a variable but just the ASCII and FULLWIDTH
versions of "_".  It's not yet clear to me what should be
done in the body of an identifier; allowing precisely these
characters instead of all of Pc is enough for us to begin
with, and we can add the other Pc characters later.  Expect
yet another revision next week.

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Raimo Niskanen  
View profile  
 More options Nov 2 2012, 5:36 am
From: Raimo Niskanen <raimo+erlang-questi...@erix.ericsson.se>
Date: Fri, 2 Nov 2012 10:35:39 +0100
Local: Fri, Nov 2 2012 5:35 am
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote:
> I'm not going to answer every point, because I'm supposed to be marking exams.
> That doesn't mean they aren't good points.

Looking forward to later then...

> Next revision of the EEP:

It is now updated and published.

Formally, the EEP updates should go to e...@erlang.org, according to
http://www.erlang.org/eep.html. I have missed on procedures by not
mailing to that list when accepting this EEP, but that will improve...


:
: :

Ok. I misread it as there was only one suggestion and that was to
treat all Pc characters alike. I think it is still somewhat unclear
that only treating "_" special _is_ an alternative in the EEP.

Also I do not clearly see what problem is solved for someone using
fonts with say Arabic letters but not say the undertine, by revising
the underscore rule. Bear with me. I have never used another keyboard
than Swedish or English. Is it so that when using such a font there
is no Pc character available except for the "_" (and why is that
available?) so there must be a possibility to express both non-singleton
and maybe-singleton variables using just the "_"?

:

I have realized that. I wanted a lesser degree of understanding the
lexical semantics: If it passes the compiler (which that example
does not) I would like to be able to see which identifiers are
variables and which are atoms.

Also, e.g someone writing a syntax highlighter for Vim i guess would
appreciate a simple rule for how to recognize a variable.

> If someone gives you an Erlang file written entirely in ASCII,
> but using the Klingon language, just how much would it help you
> to know where the variables began?  (Google Translate offers
> translation to Esperanto, why not Klingon?  I haven't opened my
> copy of the how-to-learn-Klingon book in 20 years.  Sigh.)

It would not help much, I agree. But if for example I get a bug report
about the compiler or runtime system not doing right for a few lines
of Klingon Erlang, it would be helpful to easily distinguish variables
from atoms.

Ok. Good point. That sounds maybe essential. And now that goal is in the
latest version of the EEP. Very good.

:
: ::

Yes. And currently only Ll characters are allowed at the start
of an atom. So currently the same set is allowed at the start
as after a ".".

Your current suggestion allows a.ª as an unquoted atom since the character
after the dot is in Lo, but it is not allowed in Erlang today.

It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters
are in Nl (Letter_Number), which is part of XID_Start.

So I think the mental model should be that after a dot there
should be as if a new atom was starting.

:

> Concerning stability, I did send a message to the Unicode consortium.
> I've had an informal response:

>    An interesting question you raise, which I will pass along
>    to some people here.  I think the short answer is that you
>    can tailor these things to particular environments, and you
>    may not be able to rely on any given standard property for
>    special purposes.  Especially if that property is not
>    formally stable.  But I'll see what others say.

> There are sufficiently many programming languages that depend on
> initial alphabetic case that we may be looking at a revision of
> UAX#31.  Wouldn't that be fun‽  (Groan.)

I think we need an XID_Start_Uppercase and XID_Start_Lowercase,
containing Other_ID_Start_Uppercase and Other_ID_Start_Lowercase.

> Remaining points skipped for now.

I especially anticipate a reply about what happens if a character
moves from Ll or Lo to Other_ID_Start...

Good luck with the exams!

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Raimo Niskanen  
View profile  
 More options Nov 2 2012, 6:26 am
From: Raimo Niskanen <raimo+erlang-questi...@erix.ericsson.se>
Date: Fri, 2 Nov 2012 11:26:18 +0100
Local: Fri, Nov 2 2012 6:26 am
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.
On Fri, Nov 02, 2012 at 11:41:46AM +1300, Richard O'Keefe wrote:

:

> Next revision of the EEP:
> [-- Attachment #2: eep-0040.md --]
> [-- Type: application/octet-stream, Encoding: quoted-printable, Size: 16K --]

I think the EEP should elaborate on normalization. It seems to me that
prescribing NFC would be natural since a file consisting of Latin-1
characters is already NFC (Normalized Form C (Composed)).

O.t.o.h that would make the atom fi⁵ different from the atom fi5,
and using NFKC (Normalized Form KC (Compatibility Composed))
would make them equal. I do not know. That fi =:= fi may be
good but that i⁵ =:= i5 may be not good. Anyway normalizing
these character sequences in comments or strings is _not_
desirable. If NFKC would be an option it could only be that
for atoms and variables.

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard O'Keefe  
View profile  
 More options Nov 4 2012, 4:24 pm
From: "Richard O'Keefe" <o...@cs.otago.ac.nz>
Date: Mon, 5 Nov 2012 10:24:30 +1300
Local: Sun, Nov 4 2012 4:24 pm
Subject: Re: [erlang-questions] EEP 40 - A proposal for Unicode variable and atom names in Erlang.

On 2/11/2012, at 10:35 PM, Raimo Niskanen wrote:

> Also I do not clearly see what problem is solved for someone using
> fonts with say Arabic letters but not say the undertine, by revising
> the underscore rule. Bear with me. I have never used another keyboard
> than Swedish or English. Is it so that when using such a font there
> is no Pc character available except for the "_" (and why is that
> available?) so there must be a possibility to express both non-singleton
> and maybe-singleton variables using just the "_"?

I have only tried the Macintosh interface, where there are
three "Arabic", "Arabic - PC", and "Arabic - QWERTY" virtual keyboards
available.  All of them have the underline.  ISO 8859-6 (the ISO 8-bit
character set for Arabic) includes all of ASCII.  However, I am not an
expert.

> I have realized that. I wanted a lesser degree of understanding the
> lexical semantics: If it passes the compiler (which that example
> does not) I would like to be able to see which identifiers are
> variables and which are atoms.

> Also, e.g someone writing a syntax highlighter for Vim i guess would
> appreciate a simple rule for how to recognize a variable.

Well, the EEP gives them _that_.  If Vim can highlight Ada and Python
and Java correctly, what's the problem?  Copy the regular expressions
it uses for Java and tinker with them.

>> If someone gives you an Erlang file written entirely in ASCII,
>> but using the Klingon language, just how much would it help you
>> to know where the variables began?  (Google Translate offers
>> translation to Esperanto, why not Klingon?  I haven't opened my
>> copy of the how-to-learn-Klingon book in 20 years.  Sigh.)

> It would not help much, I agree. But if for example I get a bug report
> about the compiler or runtime system not doing right for a few lines
> of Klingon Erlang, it would be helpful to easily distinguish variables
> from atoms.

You don't have to do it by eye.  You can use a tool (like the Vim
syntax colourer you mention above).

Oh DRAT!

> It also allows ᛮᛯᛰ as an atom but not ᛮᛯᛰ.ᛮᛯᛰ since these characters
> are in Nl (Letter_Number), which is part of XID_Start.

Frankly that one doesn't bother me in the least.

> So I think the mental model should be that after a dot there
> should be as if a new atom was starting.

However, since I've got to fix the a.ª bug, I may as well adopt
your suggestion.  The grammar now reads

        unquoted_atom ::= "."? atom_start atom_continue*

        atom_start ::= XID_Start \ (Lu ∪ Lt ∪ "ªº")

        atom_continue ::= XID_Continue ∪ "@" \ "ªº"
                       |  "." atom_start

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic