While I can understand the frustration that some people seem to fear, I can't help but think that there are solutions to problems that seem to be ignored. For example, if the language has full unicode support and tell the difference between characters which look the same to us, but are obviously not to the machine...how hard would it be to create a parser for a src file, or input, which literally does nothing more than turn the code into strings of the unicode value. Don't want the entire file? How about just the functions, listed in order? How about just one function matching the input just passed in. There just seem to be so many ways that a unicode capable system could compensate for our human failings.
Now, you can absolutely tell me that implementing a system that is aware and capable in that way is a royal pain in the ass and I would have zero experience with unicode to counter your argument. But if you could have your VM default to accept only a specific encoding and to throw warnings and stop, or continue depending on your needs, compiling/running a module then you could, with virtually no effort, make sure that at least your code is only interacts with other code that you expect.
That one example is hardly a complete system but the point is this. If expanding the language respects the cultural wishes of peoples and laws of other countries without detracting from the experience of those people who don't want to be bothered with it then don't we at least need to give it a serious look? If it's technically impossible without seriously compromising the integrity of the language then sure, don't do it, but even if it's difficult but can be done without degrading the experience of others...why not?
Chris.
> From: o...@cs.otago.ac.nz
> Date: Fri, 26 Oct 2012 18:43:51 +1300
> To: hd2...@eonblast.com
> CC: erlang-questi...@erlang.org
> Subject: Re: [erlang-questions] A proposal for Unicode variable and atom names in Erlang.
> On 26/10/2012, at 4:28 PM, Henning Diedrich wrote:
> > As a third (true) horror I'll add Ulf's pseudo-whitespace experience to the list. I am in agony already over the days lost in the future due to someone inserting a Unicode look-alike into code that I cannot spot until I re-type the entire seemingly cursed code that-should-work-but-magically-doesn't. And have hex-view ready at my finger tips again to inspect awkward code. Thanks so much for the nightmare.
> But (a) THERE IS NO PROPOSAL TO ALLOW EXTRA KINDS OF SPACE IN ERLANG!
> And (b) the problem is not with there _being_ extra kinds of space
> character, but with their not being _treated as_ space characters.
> This is why _partial_ support for Unicode is a bad thing.
> > As an aside, I think I still don't believe what I understood there though: that a programming language could be banned on grounds of political incorrectness?
> Not that it can be *banned*,
> but that it cannot be *required* for any assessed work.
> Nobody says I can't use whatever I like.
> But there may very well be limits on what I can ask students to use.
> I would rather let that sleeping dog lie
> while the potential problem goes away.
> > Is it possible that those rules are wrong and banning a programming language for being, what, culturally biased, is over the top?
> > I still hope i read that wrong.
> Respect for the principles of Te Tiriti o Waitangi is part of
> the law of this country. Article the second reads (in a back
> translation of the Maori text into English):
> The Queen of England agrees to protect the Chiefs,
> the subtribes and all the people of New Zealand in
> the unqualified exercise of their chieftainship over
> their lands, villages and all their treasures
> and Te Reo is a these days regarded as taonga of the tangata whenua
> (a treasure of the [native] people of the land). [Every word in
> this paragraph counts as 'English' here...]
> I'm sufficiently distressed by the continuing replacement of
> New Zealand English by American that I have strong sympathy with
> people wanting to keep MīŋŊori alive and functioning in all modern
> contexts, so I _want_ to let students use MīŋŊori.
> But in addition to that, the University has a clear policy.
> Note in particular
> Principle 1
> In recognition of the status of te reo MīŋŊori as a taonga
> protected under the Treaty of Waitangi, and within the
> spirit of the MīŋŊori Language Act 1987, the University of
> Otago will endorse the right of students and staff to
> use te reo MīŋŊori, including for assessment.
> ^^^^^^^^^^^^^^^^^^^^^^^^
> I really don't want to ask for an official decision lest the answer
> be "no".
> I would expect any country with one or more minority languages
> whose speakers got a sufficient degree of legal protection to
> have similar policies.
On Fri, Oct 26, 2012 at 5:32 AM, Richard O'Keefe <o...@cs.otago.ac.nz> wrote:
> On 22/10/2012, at 11:45 PM, Yurii Rashkovskii wrote:
>> Also, consider this: there are characters that look the same but encoded differently.
> You did read the part of the proposal that said to normalise?
I think Yurii meant cases like latin 'a' and cyrillic 'а', for example.
They are certainly a problem. But mixing scripts in this way can only be done maliciously. In normal cases is clear from context which script is being used.
Also, we already have a similar problem without leaving ASCII: in many fonts, I and l are indistinguishable...
On 22/10/2012, at 6:08 PM, Yurii Rashkovskii wrote:
> Richard,
> Please excuse my ignorance, but can you name a single good reason for non-latin atoms and variable names?
There are literally billions of people on this planet
who are most comfortable reading and writing non-Latin scripts,
and many of them write programs.
> From my personal point of view, this is a sure road to hell.
Once Erlang accepted non-ASCII characters, it had already gone
at least half way down that road. I note that IBM mainframe
programming languages like Fortran and PL/I supported DBCS
(double-byte character set) characters decades ago, and somehow,
hell completely failed to materialise.
> 1. > Python made a choice to embrace unicode more thoroughly in going from python 2 to python 3. This seems to have caused some grief in that 'ASCII' code that used to work in python 2 now often does not in python 3. Maybe this has nothing to do with Richard's EEP because that is about the string data structure this is about variable names. Still just mentioning.
Can you be more specific? Each ASCII character has the same numeric value in Unicode, and an ASCII string represented as UTF-8 is exactly the same sequence of bytes. I can't help wondering if "ASCII" here really means some 8-bit character set rather than ASCII.
> In all fairness (for Yurii's points) I should mention: > 1. I was typing this on a windows box and could not see the characters until I switched to linux > 2. Our computers may become completely, effortlessly unicode-capable someday, our keyboards will never. So to the extent that code is meant to be written, ASCII will always trump. To the extent that it is to be read, a richer (within limits) character set has its attractions.
You are assuming that everyone who is using a keyboard is using a US keyboard. That's not true. For example, on a visit to Sweden, I was allowed to use my host's computer to read my mail remotely, and my fingers kept tripping up because it was a Swedish keyboard with lots of non-ASCII characters. Heck, my wife has an iPad, and I have one on loan from the department, and both of them have Greek keyboards installed, making it pretty much effortless to type Greek, which I assure you is NOT ASCII. It's just a matter of touching the globe symbol and flicking over to the other keyboard. This is old technology. The Xerox D-machines had fast-switch virtual keyboards back in the 1980s. It takes two mouse movements to switch from a US keyboard to a Greek one on my desktop Mac (or to a Hebrew one or a Russian one or ...).
RIGHT NOW, our keyboards ARE completely, effortlessly non-Latin-1 capable.
Nobody is suggesting that any one programmer will want to use all 100,000+ Unicode characters in the same document. What is suggested is that some programmers, who can effortlessly type Russian on their Russian keyboard or Gujarati on the Gujarati keyboard -- both of which Windows supports -- and see that on their screen, should be able to do so.
I cannot for the life of my understand why, at this late date, anyone should for an instant suppose that only ASCII can be easily typed.
As it happens, for my national needs, the Mac _does_ have Māori keyboard support. It's two mouse movements to switch from US keyboard to Māori one, and then getting a vowel with a macron is just a matter of pressing the Option key while typing the vowel. A Māori student would have little reason ever to switch over to the US keyboard. I can certainly type words like kurī and kīrehe and Ākarana without taking my fingers from the keyboard.
The idea that "ASCII will always trump" on account of being easier to type deserves some kind of award for wrongness.
> What is the problem about unicode variables is that some characters > are not equal: Х != X, but they look the same.
This would be a persuasive argument IF (a) we did not already allow both XO and X0, Xl and X1, and so on; (b) mixed scripts in a single token were plausible. Neither is the case.
> Other problem about unicode is that a lot of algorithms are > locale-based and difficult (a lot of rules and exceptions).
None of those algorithms applies to the current topic, except for normalisation, which is not locale-based.
> Even non-locale based (unified and simple version of to_lower) contains this: > - Contains additional case mappings that map to more than one > character, such as "ß" to "SS".
That already applies to Latin-1, which Erlang supports RIGHT NOW. (Nit-pick: that's an example of to_upper.)
> - Characters may have case mappings that depend on the locale. > For example, in Turkish the letter U+0049 "I" capital letter i > lowercases to U+0131 "ı" small dotless i.
Indeed. But since neither variable names nor unquoted atoms are subjected to any kind of case mapping by the Erlang parser, how is that relevant _here_?
You're mainly talking about problems with Unicode *data*, and we don't have any option about dealing with those.
> There is no such thing as "language" in Unicode.
Actually, that's not quite true. Unicode *does* include so-called
"language tags", so it is perfectly possible to mark up sections
of text with the language they are supposed to be in, all in straight
Unicode.
> "language" is a locale.
No, locale is more specific than that. A locale is a script, a language,
a set of cultural conventions for writing numbers and money and dates,
and so on. An "English" phone book and an "English" dictionary would
use different locales, because they use different rules for sorting.
> Locale-based algorithms are difficult and each
> character can have different meaning for each locale.
Locale-based algorithms are difficult, true.
Give one example of a character that has a different meaning
in two locales. OK, character stand for different *sounds*
in different languages, but there is no case I can find in Unicode
where the class a character belongs to depends on the locale.
> There are a lot of cases, when I even cannot say which case a variable
> is in.
Tell me just ONE. Hint: there aren't _any_ such cases.
Each defined Unicode character has one and only one class, and that
class is not in any way locale- or context-dependent.
> How I will detect is it a variable or an atom?
The proposal you are claiming to comment on gives a precise,
unambiguous, and natural way to do so, which is consistent with
other programming languages making a case distinction.
> Here is an example:
> I want to write a module in Turkish, then the "length" id will be a
> variable, not a function.
What on earth are you talking about? Lower case l is a lower case
letter, whether you're writing English, Turkish, or Old High Martian.
> Using code, written in few languages will be a hell.
WE ALREADY HAVE THAT POSSIBILITY RIGHT NOW.
You could, *right now*, have a module containing words from a
couple of dozen languages. Imagine a mixture of English,
Swedish, Irish, Klingon, and Latino Sine Flexione.
Guess what? IT DOESN'T HAPPEN!
At most we get a mixture of English and one other language.
> On Mon, Oct 29, 2012 at 9:11 PM, Richard O'Keefe <o...@cs.otago.ac.nz>wrote:
>> On 22/10/2012, at 7:44 PM, Rustom Mody wrote:
>> > 1.
>> > Python made a choice to embrace unicode more thoroughly in going from
>> python 2 to python 3. This seems to have caused some grief in that 'ASCII'
>> code that used to work in python 2 now often does not in python 3. Maybe
>> this has nothing to do with Richard's EEP because that is about the string
>> data structure this is about variable names. Still just mentioning.
>> Can you be more specific? Each ASCII character has the same numeric value
>> in Unicode, and an ASCII string represented as UTF-8 is exactly the same
>> sequence of bytes. I can't help wondering if "ASCII" here really means
>> some 8-bit character set rather than ASCII.
> I'm an erlang-lurker, but long time Python user.
> The issues with Python 3 and "unicode vs ascii" have absolutely nothing to
> do with encoding and really, no impact at all on this discussion. Python
> 2.x had a "string" type and a "unicode" type, but the former was used both
> as a binary data type, and as a text data type. In Python 3, they have
> decided to make a firm distinction between 'binary data' and 'textual
> data', and this change in the fundamental nature of types (and what 'str'
> means) has led to some difficulties.
Can these problems be addressed? Of course.
Are they directly related to this EEP? Probably not...
I was just mentioning them so that Erlang can learn from python's mistakes.
Basically python has chosen a 'flexible string representation"
http://www.python.org/dev/peps/pep-0393/ which does the magic of using only 1 byte for ascii, 2 for bmp and 4 for
the rest (Unicode 2.0 onwards)
In the process however (of detecting the optimal char-width) some inner
loops seem to have got less efficient (my guess; dont know for sure)
So python has traded time for space.
A command-line option to choose string-engine at start time could solve
this problem.
[Though in a world where one erlang node talking to another is a very
normal usecase, this could cause its own challenges]
Also 32 bits for 'wide' unicode is wasteful, given that the number of
unicode codepoints is 1114112.
1114112 = 17*2^16 < 32*2^16 = 2^21 < 2^24 < 2^32
IOW an acceptable width could be 3 bytes and at 21 bits one could even pack
3 chars into 64 bits
> Can these problems be addressed? Of course.
> Are they directly related to this EEP? Probably not...
Certainly not. EEP 40 is about the *lexical structure* of
Erlang *variables* and *unquoted atoms*. This is something
that happens at compile time. The speed of the Erlang *compiler*
will almost certainly be affected by the adoption of Unicode.
The speed of the run-time will be affected to the extent that
atom_to_list and list_to_atom will need changing to allow
Unicode characters in atom names, but that's going to happen
anyway; all EEP 40 has to say about that is which ones don't
need quotation marks
On Monday, October 22, 2012 5:36:20 AM UTC-5, Loïc Hoguin wrote:
> On 10/22/2012 08:00 AM, Anthony Ramine wrote:
> > Also non-latin users who don't know English should be able to use
> > atoms and variables they understand.
That's because the last two letters were swapped. There's nothing here to do with Turkish. (For that matter, while Turkish has an extra dotted capital I İ and an extra dotless small i ı, it uses the same dotless capital I that we do, it's just the capital of a dotless small i.)
To quote a Pogo strip, "you have the wrong mistake".
I am reminded of a burglar indignantly protesting his innocence: "I didn't rob *THAT* house" (but don't ask me about the one next door).
We *already* have confusable characters in Latin-1: i/l/1, o/O/0 -- I'm seeing a slashed zero here and very much wish I weren't because that's not how I was taught to write a zero -- 2/Z, s/5, and if you had to read the handwriting I'm reading during marking, you'd wonder if there were _any_ two characters that couldn't be confused. (There was a time when Australian school-children were _taught_ to write unclosed small "p" letters so they looked like long-tailed "r". Why?) So our burglar is saying "I don't have THOSE [Unicode] confusable characters" (just don't ask me about all the others I do have).
If you are talking about the confusability of characters, you could bring in CAPITAL A WITH RING ABOVE and ANGSTROM SIGN, or for that matter the already mentioned Latin capital A, Cyrillic capital A, and Greek capital alpha, all of which look exactly the same.
If we once allow any kind of vaguely stringly-like thing to include Unicode characters, we are *going* to have the problem of confusible letters in data. You could restrict identifiers to be sequences of a/A characters and we'd still have the problem in data.
Of all places, the very topmost *safest* place to have the problem is in Erlang variable names, because of the singleton style check. The next safest is probably in function names. These are places where the compiler will _tell_ us if things do not match up.
Suppose someone writes
Ο_Φόβος = ο_φονιάς(του_μυαλού)
Yes, the Ο and ο will look like an O and an o, so someone _could_ trick you. But they won't be TRYING to. And if they _do_ type too much with the wrong keyboard set (as I did while typing this!) the compiler will tell them.
And all this cowering in fear at the very time that we're seeing more and more type checking in Erlang, checking that would quite certainly catch such mistakes very well. Makes you wonder about people, really it does.
[The example is as close as I could get to 'Fear is the mind-killer'.]
It just seemed to be a short way to encapsulate the issues I see with the issue. There's no doubt that ROK's posts were far more detailed but a casual reader may miss the point. I have no doubt that if a coder can write in their native language then they would choose to do that more times than not. There's also no real reason that module or function names should not be "unicoded"... so the intent of the entire source could be natural-language encoded and balkanize the codebase. I'm not sure what the solution is, but is a gradual move towards introducing the ability to express source in natural language a solution to this problem? I'm not at all convinced of that.
On Nov 4, 2012, at 7:02 PM, Toby Thain <t...@telegraphics.com.au> wrote:
On Nov 5, 2012, at 12:23 PM, Henning Diedrich <hd2...@eonblast.com> wrote:
> On Nov 5, 2012, at 6:49 AM, "Richard O'Keefe" <o...@cs.otago.ac.nz> wrote:
>> To quote a Pogo strip, "you have the wrong mistake".
> That was the very point. In the instance I can do without more choice.
I think many of these problems about confusion is largely a non-issue. Given languages which actually allows you to write full, unnormalized unicode, Google Go comes to mind, I see very few such actual problems in programs.
We already have these kinds of problems: Tabs vs spaces are not distinguishable. Neither are trailing white space. There are characters which are hard to recognize - and some fonts make it a priority to make them apart, like Richard said.
What I think is the key point is that I may be able to express certain things better with a larger symbol table. Yes, this also means I can obfuscate more, but I honestly only need the Erlang Pre-processor to win that battle of obfuscation.
Jesper Louis Andersen
Erlang Solutions Ltd., Copenhagen
Suppose the author writing in a natural language where the *exact same unicode characters* have entirely different semantics?
Map = ...
In Dutch, "Map" translates to "Folder" for an English speaker -- but the kick is that the Dutch also happen to be amazing English speakers - so it could mean what you expect a map to be or not. So the naming in the source means precisely nothing and does not help you (no matter how much post-processing you may choose to apply).
I have enough of a hard time with computer languages without having to know over 200 natural languages to boot.
Is the right decision, perhaps, to say that we need to agree on just one natural language for source - since that means you need to learn at most two languages? (And, also, did that natural language decision not happen already in every major computer system?)
If you think it's a good idea to change that status quo, then please let me know which natural language to use (yes, even if the choice were not a natural language that I currently know), just so I have a limit on where I need to educate myself. I have enough issues with encodings without being asked to learn every natural language in existence.
/s
On Nov 5, 2012, at 8:28 AM, Steve Davis <steven.charles.da...@gmail.com> wrote:
> It just seemed to be a short way to encapsulate the issues I see with the issue. There's no doubt that ROK's posts were far more detailed but a casual reader may miss the point. I have no doubt that if a coder can write in their native language then they would choose to do that more times than not. There's also no real reason that module or function names should not be "unicoded"... so the intent of the entire source could be natural-language encoded and balkanize the codebase. I'm not sure what the solution is, but is a gradual move towards introducing the ability to express source in natural language a solution to this problem? I'm not at all convinced of that.
> On Nov 4, 2012, at 7:02 PM, Toby Thain <t...@telegraphics.com.au> wrote:
>> On 04/11/12 7:57 PM, Steve Davis wrote:
>>> I'm personally looking forward to attempting to maintain open source
>>> kanji. An awesome challenge.
>> Is it that a lot of people on this thread don't read ROK's posts? Or is there another explanation for what just looks like wilful obtuseness?
> I have enough of a hard time with computer languages without having to know over 200 natural languages to boot.
Please, don't be ridiculous, you'll never encounter 200 different natural languages in your life, code or not. It therefore does not matter if people write code in a language that you can't understand. You'll never have access to it anyway! Seen any French Erlang code yet? No? Then look harder.
I don't see why you guys are fixated on allowing more people using their own language through Unicode like it's something bad. Many languages can already be used in Erlang with latin1, and they certainly are. We just want to extend that to languages that require Unicode for writing.
Last I heard, Erlang models the real world. The world is concurrent. The world is also multilingual. The world has many different writing systems. Why would you want to prevent Erlang from catering to the billions of people who don't use English?
> Suppose the author writing in a natural language where the *exact same unicode characters* have entirely different semantics?
There's a science fiction story (sorry, I forget the title and author)
where one gimmick is the ambiguity of "Pet Shop".
> I have enough of a hard time with computer languages without having to know over 200 natural languages to boot.
> Is the right decision, perhaps, to say that we need to agree on just one natural language for source
No. It is that each *exchange* needs to involve an agreed language.
When I was at Quintus, we had a company in Israel develop some graphics
software for us. (Good software too, but for unrelated reasons we never
shipped it that I know of.) You say in the contract that the documentation
will be in English (although several of our people could read Hebrew) and
you say that the code and comments will be in English too.
What Unicode makes possible is a contract where a company in Israel asks
a company in the US to provide documentation and code in Hebrew, and
there is no technical barrier to them doing it. It also lets the
Israelis write scaffolding code in Hebrew if they want to.
We do not need "One Ring to rule them all and in the darkness bind them".
English for everything would suit me fine, if it _was_ English, and not
American (:-).
> - since that means you need to learn at most two languages? (And, also, did that natural language decision not happen already in every major computer system?)
Every major computer system has been busy unmaking that decision for
decades.
> If you think it's a good idea to change that status quo, then please let me know which natural language to use (yes, even if the choice were not a natural language that I currently know), just so I have a limit on where I need to educate myself. I have enough issues with encodings without being asked to learn every natural language in existence.
Nobody is asking you to do that.
For one thing, there are about six or seven thousand natural languages
in existence. Unicode covers dozens of _scripts_ that I've never heard
of. Heck, it includes scripts that nobody in the whole world can _read_.
(Unless you believe that the author of 'Code Breaker' got it right, and
I thought he was pretty convincing.) Yes, I do mean U+101D0 to U+101FD,
the PHAISTOS DISC SIGN ... characters.
We are *not* talking about something new here.
As I keep pointing out, *nothing* stops people writing Erlang
in Klingon. They don't even have to leave ASCII for that.
It's just that _if_ they do, they have to take the consequences of
nearly everyone else being unable to read it.
Nobody has forced you to learn Klingon just because it's possible
to write Erlang in Klingon, have they?
Or let's take a real example. Erlang currently uses Latin-1.
Latin-1 lets you write Icelandic. Has anybody been dumping Icelandic
Erlang on your desk, _expecting you to read it_?
Unicode introduces the problem that Erlang code might be written in
a *script* that you cannot read. But the problem that it might be
in a *language* you cannot make head or tail of has been with us for
a long time, and the sky has not fallen.
On Nov 5, 2012, at 9:24 PM, "Richard O'Keefe" <o...@cs.otago.ac.nz> wrote:
> Unicode introduces the problem that Erlang code might be written in
> a *script* that you cannot read. But the problem that it might be
> in a *language* you cannot make head or tail of has been with us for
> a long time, and the sky has not fallen.
I have to admit this to be true. So maybe it's not such a problematic issue, after all.