PEP 3131: Supporting Non-ASCII Identifiers

"Martin v. Löwis"

unread,

May 13, 2007, 11:44:39 AM5/13/07

to

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
pytho...@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin

PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. Löwis <mar...@v.loewis.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:

Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system. Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand. A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers). It should always be the
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start> <ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x. In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
source code, a forward scan is made to find the first ASCII
non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
string to NFC, and then verify that it follows the identifier syntax.
No such callout is made for pure-ASCII identifiers, which continue to
be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
(such as pydoc) must be verified to continue to work when Unicode
strings appear in ``__dict__`` slots as keys.

References
==========

.. [1] http://www.unicode.org/reports/tr31/

Copyright
=========

This document has been placed in the public domain.

dus...@v.igoro.us

unread,

May 13, 2007, 12:00:46 PM5/13/07

to Martin v. L??wis, pytho...@python.org

On Sun, May 13, 2007 at 05:44:39PM +0200, "Martin v. L??wis" wrote:
> - should non-ASCII identifiers be supported? why?

The only objection that comes to mind is that adding such support may
make some distinct identifiers visually indistinguishable. IIRC the DNS
system has had this problem, leading to much phishing abuse.

I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.

Dustin

"Martin v. Löwis"

unread,

May 13, 2007, 12:13:02 PM5/13/07

to dus...@v.igoro.us, pytho...@python.org

> The only objection that comes to mind is that adding such support may
> make some distinct identifiers visually indistinguishable. IIRC the DNS
> system has had this problem, leading to much phishing abuse.

This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

People have mentioned that this could be used to obscure your code - but
there are so many ways to write obscure code that I don't see a problem
in adding yet another way.

People also mentioned that they might mistake identifiers in a regular,
non-phishing, non-joking scenario, because they can't tell whether the
second letter of MAXLINESIZE is a Latin A or Greek Alpha. I find that
hard to believe - if the rest of the identifier is Latin, the A surely
also is Latin, and if the rest is Greek, it's likely an Alpha. The issue
is only with single-letter identifiers, and those are most common
as local variables. Then, it's an Alpha if there is also a Beta and
a Gamma as a local variable - if you have B and C also, it's likely A.

> I don't necessarily think that the objection is strong enough to reject
> the idea -- programmers using non-ASCII symbols would be responsible for
> the consequences of their character choice.

Indeed.

Martin

André

unread,

May 13, 2007, 12:36:13 PM5/13/07

to

On May 13, 12:44 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to

> python-3...@python.org

>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").
>
> I believe this PEP differs from other Py3k PEPs in that it really
> requires feedback from people with different cultural background
> to evaluate it fully - most other PEPs are culture-neutral.
>
> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported? why?

I use to think differently. However, I would say a strong YES. They
would be extremely useful when teaching programming.

> - would you use them if it was possible to do so? in what cases?

Only if I was teaching native French speakers.

> Policy Specification
> ====================
>
> As an addition to the Python Coding style, the following policy is
> prescribed: All identifiers in the Python standard library MUST use
> ASCII-only identifiers, and SHOULD use English words wherever feasible.
>

I would add something like:

Any module released for general use SHOULD use ASCII-only identifiers
in the public API.

Thanks for this initiative.

André

John Nagle

unread,

May 13, 2007, 1:30:11 PM5/13/07

to

Martin v. Löwis wrote:
> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to
> pytho...@python.org
>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").

> All identifiers are converted into the normal form NFC while parsing;
> comparison of identifiers is based on NFC.

That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle

Paul Rubin

unread,

May 13, 2007, 1:49:09 PM5/13/07

to

"Martin v. Löwis" <mar...@v.loewis.de> writes:
> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported? why?

No, and especially no without mandatory declarations of all variables.
Look at the problems of non-ascii characters in domain names and the
subsequent invention of Punycode. Maintaining code that uses those
identifiers in good faith will already be a big enough hassle, since
it will require installing and getting familiar with keyboard setups
and editing tools needed to enter those characters. Then there's the
issue of what happens when someone tries to slip a malicious patch
through a code review on purpose, by using homoglyphic characters
similar to the way domain name phishing works. Those tricks have also
been used to re-insert bogus articles into Wikipedia, circumventing
administrative blocks on the article names.

> - would you use them if it was possible to do so? in what cases?

I would never insert them into a program. In existing programs where
they were used, I would remove them everywhere I could.

André

unread,

May 13, 2007, 1:51:02 PM5/13/07

to

On May 13, 2:30 pm, John Nagle <n...@animats.com> wrote:
> Martin v. Löwis wrote:
> > PEP 1 specifies that PEP authors need to collect feedback from the
> > community. As the author of PEP 3131, I'd like to encourage comments
> > to the PEP included below, either here (comp.lang.python), or to

> > python-3...@python.org

>
> > In summary, this PEP proposes to allow non-ASCII letters as
> > identifiers in Python. If the PEP is accepted, the following
> > identifiers would also become valid as class, function, or
> > variable names: Löffelstiel, changé, ошибка, or 売り場
> > (hoping that the latter one means "counter").
> > All identifiers are converted into the normal form NFC while parsing;
> > comparison of identifiers is based on NFC.
>
> That may not be restrictive enough, because it permits multiple
> different lexical representations of the same identifier in the same
> text. Search and replace operations on source text might not find
> all instances of the same identifier. Identifiers should be required
> to be written in source text with a unique source text representation,
> probably NFC, or be considered a syntax error.
>
> I'd suggest restricting identifiers under the rules of UTS-39,
> profile 2, "Highly Restrictive". This limits mixing of scripts
> in a single identifier; you can't mix Hebrew and ASCII, for example,
> which prevents problems with mixing right to left and left to right
> scripts. Domain names have similar restrictions.
>
> John Nagle

Python keywords MUST be in ASCII ... so the above restriction can't
work. Unless the restriction is removed (which would be a separate
PEP).

André

Paul Rubin

unread,

May 13, 2007, 1:52:12 PM5/13/07

to

"Martin v. Löwis" <mar...@v.loewis.de> writes:

> This is a commonly-raised objection, but I don't understand why people
> see it as a problem. The phishing issue surely won't apply, as you
> normally don't "click" on identifiers, but rather type them. In a
> phishing case, it is normally difficult to type the fake character
> (because the phishing relies on you mistaking the character for another
> one, so you would type the wrong identifier).

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

André

unread,

May 13, 2007, 2:16:25 PM5/13/07

to

On May 13, 12:44 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:

> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to

> python-3...@python.org
>

It should be noted that the Python community may use other forums, in
other languages. They would likely be a lot more enthusiastic about
this PEP than the usual crowd here (comp.lang.python).

André

Anton Vredegoor

unread,

May 13, 2007, 2:21:06 PM5/13/07

to

Martin v. Löwis wrote:

> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").

I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

Some time ago there was a discussion about introducing macros into the
language. Among the reasons why macros were excluded was precisely
because anyone could start writing their own kind of dialect of Python
code, resulting in less people being able to read what other programmers
wrote. And that last thing: 'Being able to easily read what other people
wrote' (sometimes that 'other people' is yourself half a year later, but
that isn't relevant in this specific case) is one of the main virtues in
the Python programming community. Correct me if I'm wrong please.

At that time I was considering to give up some user conformity because
the very powerful syntax extensions would make Python rival Lisp. It's
worth sacrificing something if one gets some other thing in return.

However since then we have gained metaclasses, iterators and generators
and even a C-like 'if' construct. Personally I'd also like to have a
'repeat-until'. These things are enough to keep us busy for a long time
and in some respects this new syntax is even more powerful/dangerous
than macros. But most importantly these extra burdens on the ease with
which one is to read code are offset by gaining more expressiveness in
the *coding* of scripts.

While I have little doubt that in the end some stubborn mathematician or
Frenchman will succeed in writing a preprocessor that would enable him
to indoctrinate his students into his specific version of reality, I see
little reason to actively endorse such foolishness.

The last argument I'd like to make is about the very possibly reality
that in a few years the Internet will be dominated by the Chinese
language instead of by the English language. As a Dutchman I have no
special interest in English being the language of the Internet but
-given the status quo- I can see the advantages of everyone speaking the
*same* language. If it be Chinese, Chinese I will start to learn,
however inept I might be at it at first.

That doesn't mean however that one should actively open up to a kind of
contest as to which language will become the main language! On the
contrary one should hold out as long as possible to the united group one
has instead of dispersing into all kinds of experimental directions.

Do we harm the Chinese in this way one might ask by making it harder for
them to gain access to the net? Do we harm ourselves by not opening up
in time to the new status quo? Yes, in a way these are valid points, but
one should not forget that more advanced countries also have a
responsibility to lead the way by providing an example, one should not
think too lightly about that.

Anyway, I feel that it will not be possible to hold off these
developments in the long run, but great beneficial effects can still be
attained by keeping the language as simple and expressive as possible
and to adjust to new realities as soon as one of them becomes undeniably
apparent (which is something entirely different than enthusiastically
inviting them in and let them fight it out against each other in your
own house) all the time taking responsibility to lead the way as long as
one has any consensus left.

A.

Stefan Behnel

unread,

May 13, 2007, 3:01:21 PM5/13/07

to Anton Vredegoor

Anton Vredegoor wrote:
>> In summary, this PEP proposes to allow non-ASCII letters as
>> identifiers in Python. If the PEP is accepted, the following
>> identifiers would also become valid as class, function, or
>> variable names: Löffelstiel, changé, ошибка, or 売り場
>> (hoping that the latter one means "counter").
>
> I am against this PEP for the following reasons:
>
> It will split up the Python user community into different language or
> interest groups without having any benefit as to making the language
> more expressive in an algorithmic way.

We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

So, nothing currently keeps you from giving names to identifiers that are
impossible to understand by, say, Americans (ok, that's easy anyway).

For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)

and most people on earth would not have a clue what this is good for. However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

Stefan

Jarek Zgoda

unread,

May 13, 2007, 3:07:05 PM5/13/07

to

Martin v. Löwis napisał(a):

> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported? why?

No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code. This is
not a literature, that requires qualified translators to get the text
from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.

> For some languages, common transliteration systems exist (in particular,
> for the Latin-based writing systems). For other languages, users have
> larger difficulties to use Latin to write their native words.

This is one of least disturbing difficulties when it comes to programming.

--
Jarek Zgoda
http://jpa.berlios.de/

Stefan Behnel

unread,

May 13, 2007, 3:10:46 PM5/13/07

to "Martin v. Löwis"

Martin v. Löwis schrieb:

> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to
> pytho...@python.org
>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").
>
> I believe this PEP differs from other Py3k PEPs in that it really
> requires feedback from people with different cultural background
> to evaluate it fully - most other PEPs are culture-neutral.
>
> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported? why?
> - would you use them if it was possible to do so? in what cases?

To make it clear: this PEP considers "identifiers written with non-ASCII
characters", not "identifiers named in a non-english language".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

Now, I am not a strong supporter (most public code will use English
identifiers anyway) but we should not forget that Python supports encoding
declarations in source files and thus has much cleaner support for non-ASCII
source code than, say, Java. So, introducing non-ASCII identifiers is just a
small step further. Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers. It only guarantees
that identifiers are always *typable* by people who have access to latin
characters on their keyboard. A rather small advantage, I'd say.

The capability of a Unicode-aware language to express non-English identifiers
in a non-ASCII encoding totally makes sense to me.

Stefan

Stefan Behnel

unread,

May 13, 2007, 3:31:19 PM5/13/07

to

Luckily, you will never be able to touch every program in the world.

Stefan

Stefan Behnel

unread,

May 13, 2007, 3:44:49 PM5/13/07

to Jarek Zgoda

Jarek Zgoda schrieb:
> Martin v. Löwis napisał(a):

Uuups, is that a non-ASCII character in there? Why don't you keep them out of
an English speaking newsgroup?

>> So, please provide feedback, e.g. perhaps by answering these
>> questions:
>> - should non-ASCII identifiers be supported? why?
>
> No, because "programs must be written for people to read, and only
> incidentally for machines to execute". Using anything other than "lowest
> common denominator" (ASCII) will restrict accessibility of code.

No, but it would make it a lot easier for a lot of people to use descriptive
names. Remember: we're all adults here, right?

> While I can read the code with Hebrew, Russian or Greek names
> transliterated to ASCII, I would not be able to read such code in native.

Then maybe it was code that was not meant to be read by you?

In the (not so small) place where I work, we tend to use descriptive names *in
German* for the code we write, mainly for reasons of domain clarity. The
*only* reason why we still use the (simple but ugly) ASCII-transcription
(ü->ue etc.) for identifiers is that we program in Java and Java lacks a
/reliable/ way to support non-ASCII characters in source code. Thanks to PEP
263 and 3120, Python does not suffer from this problem, but it suffers from
the bigger problem of not *allowing* non-ASCII characters in identifiers. And
I believe that's a rather arbitrary decision.

The more I think about it, the more I believe that this restriction should be
lifted. 'Any' non-ASCII identifier should be allowed where developers decide
that it makes sense.

Stefan

Josiah Carlson

unread,

May 13, 2007, 3:58:27 PM5/13/07

to

Stefan Behnel wrote:
> Anton Vredegoor wrote:
>>> In summary, this PEP proposes to allow non-ASCII letters as
>>> identifiers in Python. If the PEP is accepted, the following
>>> identifiers would also become valid as class, function, or
>>> variable names: Löffelstiel, changé, ошибка, or 売り場
>>> (hoping that the latter one means "counter").
>> I am against this PEP for the following reasons:
>>
>> It will split up the Python user community into different language or
>> interest groups without having any benefit as to making the language
>> more expressive in an algorithmic way.
>
> We must distinguish between "identifiers named in a non-english language" and
> "identifiers written with non-ASCII characters".

[snip]

> I do not think non-ASCII characters make this 'problem' any worse. So I must
> ask people to restrict their comments to the actual problem that this PEP is
> trying to solve.

Really? Because when I am reading source code, even if a particular
variable *name* is a sequence of characters that I cannot identify as a
word that I know, I can at least spell it out using Latin characters, or
perhaps even attempt to pronounce it (verbalization of a word, even if
it is an incorrect verbalization, I find helps me to remember a variable
and use it later).

On the other hand, the introduction of some 60k+ valid unicode glyphs
into the set of characters that can be seen as a name in Python would
make any such attempts by anyone who is not a native speaker (and even
native speakers in the case of the more obscure Kanji glyphs) an
exercise in futility.

As it stands, people who use Python (and the vast majority of other
programming languages) learn the 52 upper/lowercase variants of the
latin alphabet (and sometimes the 0-9 number characters for some parts
of the world). That's it. 62 glyphs at the worst. But a huge portion
of these people have already been exposed to these characters through
school, the internet, etc., and this isn't likely to change (regardless
of the 'impending' Chinese population dominance on the internet).

Indeed, the lack of the 60k+ glyphs as valid name characters can make
the teaching of Python to groups of people that haven't been exposed to
the Latin alphabet more difficult, but those people who are exposed to
programming are also typically exposed to the internet, on which Latin
alphabets dominate (never mind that html tags are Latin characters, as
are just about every daemon configuration file, etc.). Exposure to the
Latin alphabet isn't going to go away, and Python is very unlikely to be
the first exposure programmers have to the Latin alphabet (except for
OLPC, but this PEP is about a year late to the game to change that).
And even if Python *is* the first time children or adults are exposed to
the Latin alphabet, one would hope that 62 characters to learn to 'speak
the language of Python' is a small price to pay to use it.

Regarding different characters sharing the same glyphs, it is a problem.
Say that you are importing a module written by a mathematician that
uses an actual capital Greek alpha for a name. When a user sits down to
use it, they could certainly get NameErrors, AttributeErrors, etc., and
never understand why it is the case. Their fancy-schmancy unicode
enabled terminal will show them what looks like the Latin A, but it will
in fact be the Greek Α. Until they copy/paste, check its ord(), etc.,
they will be baffled. It isn't a problem now because A = Α is a syntax
error, but it can and will become a problem if it is allowed to.

But this issue isn't limited to different characters sharing glyphs!
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them. And no number of
guidelines, suggestions, etc., against distributing libraries with
non-Latin identifiers will stop it from happening, and *will* fragment
the community as Anton (and others) have stated.

- Josiah

Michael Torrie

unread,

May 13, 2007, 3:49:11 PM5/13/07

to pytho...@python.org

On Sun, 2007-05-13 at 21:01 +0200, Stefan Behnel wrote:

> For example, I could write
>
> def zieheDreiAbVon(wert):
> return zieheAb(wert, 3)
>
> and most people on earth would not have a clue what this is good for. However,
> someone who is fluent enough in German could guess from the names what this does.
>
> I do not think non-ASCII characters make this 'problem' any worse. So I must
> ask people to restrict their comments to the actual problem that this PEP is
> trying to solve.

I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.
ASCII is simply the lowest denominator and is support by *all*
configurations and locales on all developers' systems.

>
> Stefan

Jarek Zgoda

unread,

May 13, 2007, 4:01:18 PM5/13/07

to

Stefan Behnel napisał(a):

>> While I can read the code with Hebrew, Russian or Greek names
>> transliterated to ASCII, I would not be able to read such code in native.
>
> Then maybe it was code that was not meant to be read by you?

OK, then. As a code obfuscation measure this would fit perfectly.

Stefan Behnel

unread,

May 13, 2007, 4:04:59 PM5/13/07

to Josiah Carlson

Josiah Carlson wrote:
> It's also about being able to type names to use them in your own code
> (generally very difficult if not impossible for many non-Latin
> characters), or even be able to display them. And no number of
> guidelines, suggestions, etc., against distributing libraries with
> non-Latin identifiers will stop it from happening, and *will* fragment
> the community as Anton (and others) have stated.

Ever noticed how the community is already fragmented into people working on
project A and people not working on project A? Why shouldn't the people
working on project A agree what language they write and spell their
identifiers in? And don't forget about project B, C, and all the others.

I agree that code posted to comp.lang.python should use english identifiers
and that it is worth considering to use english identifiers in open source
code that is posted to a public OS project site. Note that I didn't say "ASCII
identifiers" but plain english identifiers. All other code should use the
language and encoding that fits its environment best.

Stefan

Terry Reedy

unread,

May 13, 2007, 4:56:22 PM5/13/07

to pytho...@python.org

"Stefan Behnel" <stefan.beh...@web.de> wrote in message
news:46476081...@web.de...

| For example, I could write
|
| def zieheDreiAbVon(wert):
| return zieheAb(wert, 3)
|
| and most people on earth would not have a clue what this is good for.
However,
| someone who is fluent enough in German could guess from the names what
this does.
|
| I do not think non-ASCII characters make this 'problem' any worse.

It is ridiculous claims like this and the consequent refusal to admit,
address, and ameliorate the 50x worse problems that would be introduced
that lead me to oppose the PEP in its current form.

Terry Jan Reedy

Bruno Desthuilliers

unread,

May 13, 2007, 5:55:11 PM5/13/07

to

Martin v. Löwis a écrit :

> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to
> pytho...@python.org
>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").
>
> I believe this PEP differs from other Py3k PEPs in that it really
> requires feedback from people with different cultural background
> to evaluate it fully - most other PEPs are culture-neutral.
>
> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported?

No.

> why?

Because it will definitivly make code-sharing impossible. Live with it
or else, but CS is english-speaking, period. I just can't understand
code with spanish or german (two languages I have notions of)
identifiers, so let's not talk about other alphabets...

NB : I'm *not* a native english speaker, I do *not* live in an english
speaking country, and my mother's language requires non-ascii encoding.
And I don't have special sympathy for the USA. And yes, I do write my
code - including comments - in english.

Bruno Desthuilliers

unread,

May 13, 2007, 6:09:39 PM5/13/07

to

Stefan Behnel a écrit :

> Anton Vredegoor wrote:
>
>>>In summary, this PEP proposes to allow non-ASCII letters as
>>>identifiers in Python. If the PEP is accepted, the following
>>>identifiers would also become valid as class, function, or
>>>variable names: Löffelstiel, changé, ошибка, or 売り場
>>>(hoping that the latter one means "counter").
>>
>>I am against this PEP for the following reasons:
>>
>>It will split up the Python user community into different language or
>>interest groups without having any benefit as to making the language
>>more expressive in an algorithmic way.
>
>
>
> We must distinguish between "identifiers named in a non-english language" and
> "identifiers written with non-ASCII characters".
>
> While the first is already allowed as long as the transcription uses only
> ASCII characters, the second is currently forbidden and is what this PEP is about.
>
> So, nothing currently keeps you from giving names to identifiers that are
> impossible to understand by, say, Americans (ok, that's easy anyway).
>
> For example, I could write
>
> def zieheDreiAbVon(wert):
> return zieheAb(wert, 3)
>
> and most people on earth would not have a clue what this is good for.

Which is exactly why I don't agree with adding support with non-ascii
identifiers. Using non-english identifiers should be strongly
discouraged, not openly supported.

> However,
> someone who is fluent enough in German could guess from the names what this does.
>
> I do not think non-ASCII characters make this 'problem' any worse.

It does, by openly stating that it's ok to write unreadable code and
offering support for it.

> So I must
> ask people to restrict their comments to the actual problem that this PEP is
> trying to solve.

Sorry, but we can't dismiss the side-effects. Learning enough
CS-oriented technical english to actually read and write code and
documentation is not such a big deal - even I managed to to so, and I'm
a bit impaired when it comes to foreign languages.

MRAB

unread,

May 13, 2007, 5:31:00 PM5/13/07

to

Perhaps there could be the option of typing and showing characters as
\uxxxx, eg. \u00FC instead of ü (u-umlaut), or showing them in a
different colour if they're not in a specified set.

Bruno Desthuilliers

unread,

May 13, 2007, 6:27:06 PM5/13/07

to

Stefan Behnel a écrit :

> Martin v. Löwis schrieb:
>
>>PEP 1 specifies that PEP authors need to collect feedback from the
>>community. As the author of PEP 3131, I'd like to encourage comments
>>to the PEP included below, either here (comp.lang.python), or to
>>pytho...@python.org
>>
>>In summary, this PEP proposes to allow non-ASCII letters as
>>identifiers in Python. If the PEP is accepted, the following
>>identifiers would also become valid as class, function, or
>>variable names: Löffelstiel, changé, ошибка, or 売り場
>>(hoping that the latter one means "counter").
>>
>>I believe this PEP differs from other Py3k PEPs in that it really
>>requires feedback from people with different cultural background
>>to evaluate it fully - most other PEPs are culture-neutral.
>>
>>So, please provide feedback, e.g. perhaps by answering these
>>questions:
>>- should non-ASCII identifiers be supported? why?
>>- would you use them if it was possible to do so? in what cases?
>
>
>
> To make it clear: this PEP considers "identifiers written with non-ASCII
> characters", not "identifiers named in a non-english language".

You cannot just claim that these are two totally distinct issues and get
away with it. The fact is that not only non-english identifiers are a
bad thing when it comes to sharing and cooperation, and it's obvious
that non-ascii glyphs can only make things work - since it's obvious
that people willing to use such a "feature" *wont* do it to spell
english identifiers anyway.

> While the first is already allowed as long as the transcription uses only
> ASCII characters, the second is currently forbidden and is what this PEP is about.
>
> Now, I am not a strong supporter (most public code will use English
> identifiers anyway) but we should not forget that Python supports encoding
> declarations in source files and thus has much cleaner support for non-ASCII
> source code than, say, Java. So, introducing non-ASCII identifiers is just a
> small step further.

I would certainly not qualify this as a "small" step.

> Disallowing this does *not* guarantee in any way that
> identifiers are understandable for English native speakers.

I'm not an English native speaker. And there's more than a subtle
distinction between "not garantying" and "encouraging".

> It only guarantees
> that identifiers are always *typable* by people who have access to latin
> characters on their keyboard. A rather small advantage, I'd say.
>
> The capability of a Unicode-aware language to express non-English identifiers
> in a non-ASCII encoding totally makes sense to me.

It does of course make sens (at least if you add support for non-english
non-ascii translation of the *whole* language - keywords, builtins and
the whole standard lib included). But it's still a very bad idea IMHO.

Virgil Dupras

unread,

May 13, 2007, 6:03:28 PM5/13/07

to

On May 13, 11:44 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to

> python-3...@python.org

I don't think that supporting non-ascii characters for identifiers
would cause any problem. Most people won't use it anyway. People who
use non-english identifiers for their project and hope for it to be
popular worldwide will probably just fail because of their foolish
coding style policy choice. I put that kind of choice in the same
ballpark as deciding to use hungarian notation for python code.

As for malicious patch submission, I think this is a non issue.
Designing tool to detect any non-ascii char identifier in a file
should be a trivial script to write.

I say that if there is a demand for it, let's do it.

Alexander Schmolck

unread,

May 13, 2007, 6:26:25 PM5/13/07

to

Jarek Zgoda <jzg...@o2.usun.pl> writes:

> Martin v. Löwis napisał(a):
>
>> So, please provide feedback, e.g. perhaps by answering these
>> questions:
>> - should non-ASCII identifiers be supported? why?
>
> No, because "programs must be written for people to read, and only
> incidentally for machines to execute". Using anything other than "lowest
> common denominator" (ASCII) will restrict accessibility of code. This is
> not a literature, that requires qualified translators to get the text
> from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.
>
> While I can read the code with Hebrew, Russian or Greek names
> transliterated to ASCII, I would not be able to read such code in native.

Who or what would force you to? Do you currently have to deal with hebrew,
russian or greek names transliterated into ASCII? I don't and I suspect this
whole panic about everyone suddenly having to deal with code written in kanji,
klingon and hieroglyphs etc. is unfounded -- such code would drastically
reduce its own "fitness" (much more so than the ASCII-transliterated chinese,
hebrew and greek code I never seem to come across), so I think the chances
that it will be thrust upon you (or anyone else in this thread) are minuscule.

Plenty of programming languages already support unicode identifiers, so if
there is any rational basis for this fear it shouldn't be hard to come up with
-- where is it?

'as

BTW, I'm not sure if you don't underestimate your own intellectual faculties
if you think couldn't cope with greek or russian characters. On the other hand
I wonder if you don't overestimate your ability to reasonably deal with code
written in a completely foreign language, as long as its ASCII -- for anything
of nontrivial length, surely doing anything with such code would already be
orders of magnitude harder?

Anders J. Munch

unread,

May 13, 2007, 6:32:06 PM5/13/07

to

Josiah Carlson wrote:
> On the other hand, the introduction of some 60k+ valid unicode glyphs
> into the set of characters that can be seen as a name in Python would
> make any such attempts by anyone who is not a native speaker (and even
> native speakers in the case of the more obscure Kanji glyphs) an
> exercise in futility.
>

So you gather up a list of identifiers and and send out for translation. Having
actual Kanji glyphs instead a mix of transliterations and bad English will only
make that easier.

That won't even cost you anything, since you were already having docstrings
translated, along with comments and documentation, right?

> But this issue isn't limited to different characters sharing glyphs!
> It's also about being able to type names to use them in your own code
> (generally very difficult if not impossible for many non-Latin
> characters), or even be able to display them.

For display, tell your editor the utf-8 source file is really latin-1. For
entry, copy-paste.

- Anders

Alex Martelli

unread,

May 13, 2007, 6:35:15 PM5/13/07

to

Bruno Desthuilliers <bdesth.qu...@free.quelquepart.fr> wrote:

> > Disallowing this does *not* guarantee in any way that
> > identifiers are understandable for English native speakers.
>
> I'm not an English native speaker. And there's more than a subtle
> distinction between "not garantying" and "encouraging".

I agree with Bruno and the many others who have expressed disapproval
for this idea -- and I am not an English native speaker, either (and
neither, it seems to me, are many others who dislike this PEP). The
mild pleasure of using accented letters in code "addressed strictly to
Italian-speaking audiences and never intended to be of any use to
anybody not speaking Italian" (should I ever desire to write such code)
pales in comparison with the disadvantages, many of which have already
been analyzed or at least mentioned.

Homoglyphic characters _introduced by accident_ should not be discounted
as a risk, as, it seems to me, was done early in this thread after the
issue had been mentioned. In the past, it has happened to me to
erroneously introduce such homoglyphs in a document I was preparing with
a word processor, by a slight error in the use of the system- provided
way for inserting characters not present on the keyboard; I found out
when later I went looking for the name I _thought_ I had input (but I
was looking for it spelled with the "right" glyph, not the one I had
actually used which looked just the same) and just could not find it.

On that occasion, suspecting I had mistyped in some way or other, I
patiently tried looking for "pieces" of the word in question, eventually
locating it with just a mild amount of aggravation when I finally tried
a piece without the offending character. But when something similar
happens to somebody using a sufficiently fancy text editor to input
source in a programming language allowing arbitrary Unicode letters in
identifiers, the damage (the sheer waste of developer time) can be much
more substantial -- there will be two separate identifiers around, both
looking exactly like each other but actually distinct, and unbounded
amount of programmer time can be spent chasing after this extremely
elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
With some copy-and-paste during development and attempts at debugging,
several copies of each distinct version of the identifier can be spread
around the code, further hampering attempts at understanding.

Alex

Alan Franzoni

unread,

May 13, 2007, 6:41:01 PM5/13/07

to

Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Löwis" ha scritto:

[cut]

I'm from Italy, and I can say that some thoughts by Martin v. Löwis are
quite right. It's pretty easy to see code that uses "English" identifiers
and comments, but they're not really english - many times, they're just
"englishized" versions of the italian word. They might lure a real english
reader into an error rather than help him understand what the name really
stands for. It would be better to let the programmer pick the language he
or she prefers, without restrictions.

The patch problem doesn't seem a real issue to me, because it's the project
admin the one who can pick the encoding, and he could easily refuse any
patch that doesn't conform to the standards he wants.

BTW, there're a couple of issues that should be solved; even though I could
do with iso-8859-1, I usually pick utf-8 as the preferred encoding for my
files, because I found it more portable and more compatible with different
editors and IDE (I don't know if I just found some bugs in some specific
software, but I had problems with accented characters when switching
environment from Win to Linux, especially when reading/writing to and from
non-native FS, e.g. reading files from a ntfs disk from linux, or reading
an ext2 volume from Windows) on various platforms.

By the way, I would highly dislike anybody submitting a patch that contains
identifiers other than ASCII or iso-8859-1. Hence, I think there should be
a way, a kind of directive or sth. like that, to constrain the identifiers
charset to a 'subset' of the 'global' one.

Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.

--
Alan Franzoni <alan.fra...@gmail.com>
-
Togli .xyz dalla mia email per contattarmi.
Remove .xyz from my address in order to contact me.
-
GPG Key Fingerprint (Key ID = FE068F3E):
5C77 9DC3 BD5B 3A28 E7BC 921A 0255 42AA FE06 8F3E

Alexander Schmolck

unread,

May 13, 2007, 6:46:31 PM5/13/07

to

"Martin v. Löwis" <mar...@v.loewis.de> writes:

> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to

> pytho...@python.org

>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").
>
> I believe this PEP differs from other Py3k PEPs in that it really
> requires feedback from people with different cultural background
> to evaluate it fully - most other PEPs are culture-neutral.
>

> So, please provide feedback, e.g. perhaps by answering these
> questions:
> - should non-ASCII identifiers be supported?

Yes.

> why?

Because not everyone speaks English, not all languages can losslessly
transliterated ASCII and because it's unreasonable to drastically restrict the
domain of things that can be conveniently expressed for a language that's also
targeted at a non-professional programmer audience.

I'm also not aware of any horror stories from languages which do already allow
unicode identifiers.

> - would you use them if it was possible to do so?

Possibly.

> in what cases?

Maybe mathematical code (greek letters) or code that is very culture and
domain specific (say code doing Japanese tax forms).

'as

Anders J. Munch

unread,

May 13, 2007, 6:53:01 PM5/13/07

to

Michael Torrie wrote:
>
> So given that people can already transliterate their language for use as
> identifiers, I think avoiding non-ASCII character sets is a good idea.

Transliteration makes people choose bad variable names, I see it all the time
with Danish programmers. Say e.g. the most descriptive name for a process is
"kør forlæns" (run forward). But "koer_forlaens" is ugly, so instead he'll
write "run_fremad", combining an English word with a slightly less appropriate
Danish word. Sprinkle in some English spelling errors and badly-chosen English
words, and you have the sorry state of the art that is today.

- Anders

Steven D'Aprano

unread,

May 13, 2007, 7:35:19 PM5/13/07

to

On Sun, 13 May 2007 15:35:15 -0700, Alex Martelli wrote:

> Homoglyphic characters _introduced by accident_ should not be discounted
> as a risk

...

> But when something similar
> happens to somebody using a sufficiently fancy text editor to input
> source in a programming language allowing arbitrary Unicode letters in
> identifiers, the damage (the sheer waste of developer time) can be much
> more substantial -- there will be two separate identifiers around, both
> looking exactly like each other but actually distinct, and unbounded
> amount of programmer time can be spent chasing after this extremely
> elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
> With some copy-and-paste during development and attempts at debugging,
> several copies of each distinct version of the identifier can be spread
> around the code, further hampering attempts at understanding.

How is that different from misreading "disk_burnt = True" as "disk_bumt =
True"? In the right (or perhaps wrong) font, like the ever-popular Arial,
the two can be visually indistinguishable. Or "call" versus "cal1"?

Surely the correct solution is something like pylint or pychecker? Or
banning the use of lower-case L and digit 1 in identifiers. I'm good with
both.

--
Steven.

Anders J. Munch

unread,

May 13, 2007, 7:25:04 PM5/13/07

to

Alex Martelli wrote:
>
> Homoglyphic characters _introduced by accident_ should not be discounted
> as a risk, as, it seems to me, was done early in this thread after the
> issue had been mentioned. In the past, it has happened to me to
> erroneously introduce such homoglyphs in a document I was preparing with
> a word processor, by a slight error in the use of the system- provided
> way for inserting characters not present on the keyboard; I found out
> when later I went looking for the name I _thought_ I had input (but I
> was looking for it spelled with the "right" glyph, not the one I had
> actually used which looked just the same) and just could not find it.

There's any number of things to be done about that.
1. # -*- encoding: ascii -*-
(I'd like to see you sneak those homoglyphic characters past *that*.)
2. pychecker and pylint - I'm sure you realise what they could do for you.
3. Use a font that doesn't have those characters or deliberately makes them
distinct (that could help web browsing safety too).

I'm not discounting the problem, I just dont believe it's a big one. Can we
chose a codepoint subset that doesn't have these dupes?

- Anders

Paul Rubin

unread,

May 13, 2007, 7:29:02 PM5/13/07

to

Alexander Schmolck <a.sch...@gmail.com> writes:
> Plenty of programming languages already support unicode identifiers,

Could you name a few? Thanks.

Steven D'Aprano

unread,

May 13, 2007, 7:45:56 PM5/13/07

to

On Sun, 13 May 2007 10:52:12 -0700, Paul Rubin wrote:

> "Martin v. Löwis" <mar...@v.loewis.de> writes:

>> This is a commonly-raised objection, but I don't understand why people
>> see it as a problem. The phishing issue surely won't apply, as you
>> normally don't "click" on identifiers, but rather type them. In a
>> phishing case, it is normally difficult to type the fake character
>> (because the phishing relies on you mistaking the character for another
>> one, so you would type the wrong identifier).
>
> It certainly does apply, if you're maintaining a program and someone
> submits a patch. In that case you neither click nor type the
> character. You'd normally just make sure the patched program passes
> the existing test suite, and examine the patch on the screen to make
> sure it looks reasonable. The phishing possibilities are obvious.

Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

As for project maintainers, surely a patch using some unexpected Unicode
locale would fail the "looks reasonable" test? That could even be
automated -- if the patch uses an unexpected "#-*- coding: blah" line, or
includes characters outside of a pre-defined range, ring alarm bells.
("Why is somebody patching my Turkish module in Korean?")

--
Steven

Marc 'BlackJack' Rintsch

unread,

May 13, 2007, 7:48:17 PM5/13/07

to

In <mailman.7627.1179086...@python.org>, Michael Torrie
wrote:

> I think non-ASCII characters makes the problem far far worse. While I
> may not understand what the function is by it's name in your example,
> allowing non-ASCII characters makes it works by forcing all would-be
> code readers have to have all kinds of necessary fonts just to view the
> source code. Things like reporting exceptions too. At least in your
> example I know the exception occurred in zieheDreiAbVon. But if that
> identifier is some UTF-8 string, how do I go about finding it in my text
> editor, or even reporting the message to the developers? I don't happen
> to have that particular keymap installed in my linux system, so I can't
> even type the letters!

You find it in the sources by the line number from the traceback and the
letters can be copy'n'pasted if you don't know how to input them with your
keymap or keyboard layout.

Ciao,
Marc 'BlackJack' Rintsch

Aldo Cortesi

unread,

May 13, 2007, 7:42:13 PM5/13/07

to pytho...@python.org

Thus spake "Martin v. Löwis" (mar...@v.loewis.de):

> - should non-ASCII identifiers be supported? why?

No! I believe that:

- The security implications have not been sufficiently explored. I don't
want to be in a situation where I need to mechanically "clean" code (say,
from a submitted patch) with a tool because I can't reliably verify it by
eye. We should learn from the plethora of Unicode-related security
problems that have cropped up in the last few years.
- Non-ASCII identifiers would be a barrier to code exchange. If I know
Python I should be able to easily read any piece of code written in it,
regardless of the linguistic origin of the author. If PEP 3131 is
accepted, this will no longer be the case. A Python project that uses
Urdu identifiers throughout is just as useless to me, from a
code-exchange point of view, as one written in Perl.
- Unicode is harder to work with than ASCII in ways that are more important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishable characters are used interchangeably.
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.

> - would you use them if it was possible to do so? in what cases?

No.

Regards,

Aldo

--
Aldo Cortesi
al...@nullcube.com
http://www.nullcube.com
Mob: 0419 492 863

Paul Rubin

unread,

May 13, 2007, 8:59:23 PM5/13/07

to

Steven D'Aprano <st...@REMOVE.THIS.cybersource.com.au> writes:
> > It certainly does apply, if you're maintaining a program and someone
> > submits a patch. In that case you neither click nor type the
> > character. You'd normally just make sure the patched program passes
> > the existing test suite, and examine the patch on the screen to make
> > sure it looks reasonable. The phishing possibilities are obvious.
>
> Not to me, I'm afraid. Can you explain how it works? A phisher might be
> able to fool a casual reader, but how does he fool the compiler into
> executing the wrong code?

The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.

Terry Reedy

unread,

May 13, 2007, 10:12:30 PM5/13/07

to pytho...@python.org

"Alan Franzoni" <alan.franz...@geemail.invalid> wrote in message
news:1u9kz7l2gcz1p.1...@40tude.net...

Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Löwis" ha scritto:
|Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.

=============================

When I proposed that PEP3131 include transliteration support, Martin
rejected the idea.

tjr

Neil Hodgson

unread,

May 13, 2007, 10:37:15 PM5/13/07

to

Paul Rubin wrote:
>> Plenty of programming languages already support unicode identifiers,
>
> Could you name a few? Thanks.

C#, Java, Ecmascript, Visual Basic.

Neil

Steven D'Aprano

unread,

May 13, 2007, 10:41:41 PM5/13/07

to

On Mon, 14 May 2007 09:42:13 +1000, Aldo Cortesi wrote:

> I don't
> want to be in a situation where I need to mechanically "clean"
> code (say, from a submitted patch) with a tool because I can't
> reliably verify it by eye.

But you can't reliably verify by eye. That's orders of magnitude more
difficult than debugging by eye, and we all know that you can't reliably
debug anything but the most trivial programs by eye.

If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.

> We should learn from the plethora of
> Unicode-related security problems that have cropped up in the last
> few years.

Of course we should. And one of the things we should learn is when and
how Unicode is a risk, and not imagine that Unicode is some sort of
mystical contamination that creates security problems just by being used.

> - Non-ASCII identifiers would be a barrier to code exchange. If I
> know
> Python I should be able to easily read any piece of code written
> in it, regardless of the linguistic origin of the author. If PEP
> 3131 is accepted, this will no longer be the case.

But it isn't the case now, so that's no different. Code exchange
regardless of human language is a nice principle, but it doesn't work in
practice. How do you use "any piece of code ... regardless of the
linguistic origin of the author" when you don't know what the functions
and classes and arguments _mean_?

Here's a tiny doc string from one of the functions in the standard
library, translated (more or less) to Portuguese. If you can't read
Portuguese at least well enough to get by, how could you possibly use
this function? What would you use it for? What does it do? What arguments
does it take?

def dirsorteinsercao(a, x, baixo=0, elevado=None):
"""da o artigo x insercao na lista a, e mantem-na a
supondo classificado e classificado. Se x estiver ja em a,
introduza-o a direita do x direita mais. Os args opcionais
baixos (defeito 0) e elevados (len(a) do defeito) limitam
a fatia de a a ser procurarado.
"""
# not a non-ASCII character in sight (unless I missed one...)

[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]

The particular function I chose is probably small enough and obvious
enough that you could work out what it does just by following the
algorithm. You might even be able to guess what it is, because Portuguese
is similar enough to other Latin languages that most people can guess
what some of the words might mean (elevados could be height, maybe?). Now
multiply this difficulty by a thousand for a non-trivial module with
multiple classes and dozens of methods and functions. And you might not
even know what language it is in.

No, code exchange regardless of natural language is a nice principle, but
it doesn't exist except in very special circumstances.

> A Python
> project that uses Urdu identifiers throughout is just as useless
> to me, from a code-exchange point of view, as one written in Perl.

That's because you can't read it, not because it uses Unicode. It could
be written entirely in ASCII, and still be unreadable and impossible to
understand.

> - Unicode is harder to work with than ASCII in ways that are more
> important
> in code than in human-language text. Humans eyes don't care if two
> visually indistinguishable characters are used interchangeably.
> Interpreters do. There is no doubt that people will accidentally
> introduce mistakes into their code because of this.

That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.

--
Steven.

Steven D'Aprano

unread,

May 13, 2007, 10:46:17 PM5/13/07

to

How? Just repeating in more words your original claim doesn't explain a
thing.

It seems to me that your argument is, only slightly exaggerated, akin to
the following:

"Unicode identifiers are bad because phishers will no longer need to
write call_evil_func() but can write call_ƎvĬľ_func() instead."

Maybe I'm naive, but I don't see how giving phishers the ability to
insert a call to ƒunction() in some module is any more dangerous than
them inserting a call to function() instead.

If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.

--
Steven.

Paul Rubin

unread,

May 13, 2007, 11:10:11 PM5/13/07

to

Neil Hodgson <nyamatong...@gmail.com> writes:
> >> Plenty of programming languages already support unicode identifiers,
> > Could you name a few? Thanks.
> C#, Java, Ecmascript, Visual Basic.

Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
nearly as bad a problem. Ecmascript is a horrible bug-prone language and
we want Python to move away from resembling it, not towards it. VB: well,
same as Ecmascript, I guess.

Paul Rubin

unread,

May 13, 2007, 11:12:23 PM5/13/07

to

Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> writes:
> If I'm mistaken, please explain why I'm mistaken, not just repeat your
> claim in different words.

if user_entered_password != stored_password_from_database:
password_is_correct = False
...
if password_is_correct:
log_user_in()

Does "password_is_correct" refer to the same variable in both places?

Steven D'Aprano

unread,

May 13, 2007, 11:42:56 PM5/13/07

to

No way of telling without a detailed code inspection. Who knows what
happens in the ... ? If a black hat has access to the code, he could
insert anything he liked in there, ASCII or non-ASCII.

How is this a problem with non-ASCII identifiers? password_is_correct is
all ASCII. How can you justify saying that non-ASCII identifiers
introduce a security hole that already exists in all-ASCII Python?

--
Steven.

Paul Rubin

unread,

May 14, 2007, 12:21:57 AM5/14/07

to

Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> writes:
> password_is_correct is all ASCII.

How do you know that? What steps did you take to ascertain it? Those
are steps you currently don't have to bother with.

John Nagle

unread,

May 14, 2007, 12:35:22 AM5/14/07

to

That's the first substantive objection I've seen. In a language
without declarations, trouble is more likely. Consider the maintenance
programmer who sees a variable name and retypes it elsewhere, not realizing
the glyphs are different even though they look the same. In a language
with declarations, that generates a compile-time error. In Python, it
doesn't.

John Nagle

Aldo Cortesi

unread,

May 14, 2007, 1:19:36 AM5/14/07

to pytho...@python.org

Thus spake Steven D'Aprano (ste...@REMOVE.THIS.cybersource.com.au):

> If you're relying on cursory visual inspection to recognize harmful code,
> you're already vulnerable to trojans.

What a daft thing to say. How do YOU recognize harmful code in a patch
submission? Perhaps you blindly apply patches, and then run your test suite on
a quarantined system, with an instrumented operating system to allow you to
trace process execution, and then perform a few weeks worth of analysis on the
data?

Me, I try to understand a patch by reading it. Call me old-fashioned.

> Code exchange regardless of human language is a nice principle, but it
> doesn't work in practice.

And this is clearly bunk. I have come accross code with transliterated
identifiers and comments in a different language, and while understanding was
hampered it wasn't impossible.

> That's no different from typos in ASCII. There's no doubt that we'll give
> the same answer we've always given for this problem: unit tests, pylint
> and pychecker.

A typo that can't be detected visually is fundamentally different problem from
an ASCII typo, as many people in this thread have pointed out.

"Martin v. Löwis"

unread,

May 14, 2007, 1:45:07 AM5/14/07

to André

> It should be noted that the Python community may use other forums, in
> other languages. They would likely be a lot more enthusiastic about
> this PEP than the usual crowd here (comp.lang.python).

Please spread the news.

Martin

Alex Martelli

unread,

May 14, 2007, 2:00:16 AM5/14/07

to

Steven D'Aprano <st...@REMOVE.THIS.cybersource.com.au> wrote:

> automated -- if the patch uses an unexpected "#-*- coding: blah" line, or

No need -- a separate PEP (also by Martin) makes UTF-8 the default
encoding, and UTF-8 can encode any Unicode character you like.

Alex

Alex Martelli

unread,

May 14, 2007, 2:00:17 AM5/14/07

to

Aldo Cortesi <al...@nullcube.com> wrote:

> Thus spake Steven D'Aprano (ste...@REMOVE.THIS.cybersource.com.au):
>
> > If you're relying on cursory visual inspection to recognize harmful code,
> > you're already vulnerable to trojans.
>
> What a daft thing to say. How do YOU recognize harmful code in a patch
> submission? Perhaps you blindly apply patches, and then run your test suite on
> a quarantined system, with an instrumented operating system to allow you to
> trace process execution, and then perform a few weeks worth of analysis on the
> data?
>
> Me, I try to understand a patch by reading it. Call me old-fashioned.

I concur, Aldo. Indeed, if I _can't_ be sure I understand a patch, I
don't accept it -- I ask the submitter to make it clearer.

Homoglyphs would ensure I could _never_ be sure I understand a patch,
without at least running it through some transliteration tool. I don't
think the world of open source needs this extra hurdle in its path.

Alex

Hendrik van Rooyen

unread,

May 14, 2007, 2:21:05 AM5/14/07

to pytho...@python.org

"Bruno Desthuilliers" <bd....q...hose@free.que..rt.fr> wrote:

>Martin v. Löwis a écrit :

>> So, please provide feedback, e.g. perhaps by answering these
>> questions:

>> - should non-ASCII identifiers be supported?
>

>No.

Agreed - I also do not think it is a good idea

>
>> why?
>
>Because it will definitivly make code-sharing impossible. Live with it
>or else, but CS is english-speaking, period. I just can't understand
>code with spanish or german (two languages I have notions of)
>identifiers, so let's not talk about other alphabets...
>

The understanding aside, it seems to me that the maintenance nightmare is
more irritating, as you are faced with stuff you can't type on your
keyboard, without resorting to look up tables and <alt> ... sequences.
And then you could still be wrong, as has been pointed out for capital
A and Greek alpha.

Then one should consider the effects of this on the whole issue of shared
open source python programs, as Bruno points out, before we argue that
I should not be "allowed" access to Greek, or French and German code
with umlauts and other diacritic marks, as someone else has done.

I think it is best to say nothing of Saint Cyril's script.

I think that to allow identifiers to be "native", while the rest of the
reserved words in the language remains ASCII English kind of
defeats the object of making the python language "language friendly".
It would need something like macros to enable the definition of
native language terms for things like "while", "for", "in", etc...

And we have been through the Macro thingy here, and the consensus
seemed to be that we don't want people to write their own dialects.

I think that the same arguments apply here.

>NB : I'm *not* a native english speaker, I do *not* live in an english
>speaking country, and my mother's language requires non-ascii encoding.
>And I don't have special sympathy for the USA. And yes, I do write my
>code - including comments - in english.
>

My case is similar, except that we are supposed to have eleven official
languages. - When my ancestors fought the English at Spion Kop*,
we could not even spell our names - and here I am defending the use of
this disease that masquerades as a language, in the interests of standardisation
of communication and ease of sharing and maintenance.

BTW - Afrikaans also has stuff like umlauts - my keyboard cannot type them
and I rarely miss it, because most of my communication is done in English.

- Hendrik

* Spion Kop is one of the few battles in history that went contrary to the
common usage whereby both sides claim victory. In this case, both sides
claimed defeat. "We have suffered a small reverse..." - Sir Redvers Buller,
who was known afterwards as Sir Reverse Buller, or the Ferryman of the
Tugela. To be fair, it was the first war with trenches in it, and nobody
knew how to handle them.

Jarek Zgoda

unread,

May 14, 2007, 3:39:54 AM5/14/07

to

Alexander Schmolck napisał(a):

>>> So, please provide feedback, e.g. perhaps by answering these
>>> questions:

>>> - should non-ASCII identifiers be supported? why?
>> No, because "programs must be written for people to read, and only
>> incidentally for machines to execute". Using anything other than "lowest
>> common denominator" (ASCII) will restrict accessibility of code. This is
>> not a literature, that requires qualified translators to get the text
>> from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.
>>
>> While I can read the code with Hebrew, Russian or Greek names
>> transliterated to ASCII, I would not be able to read such code in native.
>
> Who or what would force you to? Do you currently have to deal with hebrew,
> russian or greek names transliterated into ASCII? I don't and I suspect this
> whole panic about everyone suddenly having to deal with code written in kanji,
> klingon and hieroglyphs etc. is unfounded -- such code would drastically
> reduce its own "fitness" (much more so than the ASCII-transliterated chinese,
> hebrew and greek code I never seem to come across), so I think the chances
> that it will be thrust upon you (or anyone else in this thread) are minuscule.

I often must read code written by people using some kind of cyrillic
(Russians, Serbs, Bulgarians). "Native" names transliterated to ascii
are usual artifacts and I don't mind it.

> BTW, I'm not sure if you don't underestimate your own intellectual faculties
> if you think couldn't cope with greek or russian characters. On the other hand
> I wonder if you don't overestimate your ability to reasonably deal with code
> written in a completely foreign language, as long as its ASCII -- for anything
> of nontrivial length, surely doing anything with such code would already be
> orders of magnitude harder?

While I don't have problems with some of non-latin character sets, such
as greek and cyrillic (I was attending school in time when learning
Russian was obligatory in Poland and later I learned Greek), there are a
plenty I wouldn't be able to read, such as Hebrew, Arabic or Persian.

--
Jarek Zgoda

"We read Knuth so you don't have to."

Marc 'BlackJack' Rintsch

unread,

May 14, 2007, 3:45:22 AM5/14/07

to

Haskell. AFAIK the Haskell Report says so but the compilers don't
supported it last time I tried. :-)

Ciao,
Marc 'BlackJack' Rintsch

Neil Hodgson

unread,

May 14, 2007, 4:03:39 AM5/14/07

to

Martin v. Löwis:

> This PEP suggests to support non-ASCII letters (such as accented
> characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

I support this to ease integration with other languages and
platforms that allow non-ASCII letters to be used in identifiers. Python
has a strong heritage as a glue language and this has been enabled by
adapting to the features of various environments rather than trying to
assert a Pythonic view of how things should work.

Neil

Eric Brunel

unread,

May 14, 2007, 4:38:28 AM5/14/07

to

On Sun, 13 May 2007 21:10:46 +0200, Stefan Behnel
<stefan.beh...@web.de> wrote:
[snip]
> Now, I am not a strong supporter (most public code will use English
> identifiers anyway)

How will you guarantee that? I'm quite convinced that most of the public
code today started its life as private code earlier...

> So, introducing non-ASCII identifiers is just a
> small step further. Disallowing this does *not* guarantee in any way that
> identifiers are understandable for English native speakers. It only
> guarantees
> that identifiers are always *typable* by people who have access to latin
> characters on their keyboard. A rather small advantage, I'd say.

I would certainly not qualify that as "rather small". There have been
quite a few times where I had to change some public code. If this code had
been written in a character set that did not exist on my keyboard, the
only possibility would have been to copy/paste every identifier I had to
type. Have you ever tried to do that? It's actually quite simple to test
it: just remove on your keyboard a quite frequent letter ('E' is a good
candidate), and try to update some code you have at hand. You'll see that
it takes 4 to 5 times longer than writing the code directly, because you
always have to switch between keyboard and mouse far too often. In
addition to the unnecessary movements, it also completely breaks your
concentration. Typing foreign words transliterated to english actually
does take longer than typing "proper" english words, but at least, it can
be done, and it's still faster than having to copy/paste everything.

So I'd say that it would be a major drawback for code sharing, which - if
I'm not mistaken - is the basis for the whole open-source philosophy.
--
python -c "print ''.join([chr(154 - ord(c)) for c in
'U(17zX(%,5.zmz5(17l8(%,5.Z*(93-965$l7+-'])"

Eric Brunel

unread,

May 14, 2007, 4:49:28 AM5/14/07

to

On Sun, 13 May 2007 23:55:11 +0200, Bruno Desthuilliers
<bdesth.qu...@free.quelquepart.fr> wrote:

> Martin v. Löwis a écrit :

>> PEP 1 specifies that PEP authors need to collect feedback from the
>> community. As the author of PEP 3131, I'd like to encourage comments
>> to the PEP included below, either here (comp.lang.python), or to
>> pytho...@python.org
>> In summary, this PEP proposes to allow non-ASCII letters as
>> identifiers in Python. If the PEP is accepted, the following
>> identifiers would also become valid as class, function, or
>> variable names: Löffelstiel, changé, ошибка, or 売り場
>> (hoping that the latter one means "counter").
>> I believe this PEP differs from other Py3k PEPs in that it really
>> requires feedback from people with different cultural background
>> to evaluate it fully - most other PEPs are culture-neutral.

>> So, please provide feedback, e.g. perhaps by answering these
>> questions:
>> - should non-ASCII identifiers be supported?
>

> No.

>
>> why?
>
> Because it will definitivly make code-sharing impossible. Live with it
> or else, but CS is english-speaking, period. I just can't understand
> code with spanish or german (two languages I have notions of)
> identifiers, so let's not talk about other alphabets...

+1 on everything.

> NB : I'm *not* a native english speaker, I do *not* live in an english
> speaking country,

... and so am I (and this happens to be the same country as Bruno's...)

> and my mother's language requires non-ascii encoding.

... and so does my wife's (she's Japanese).

> And I don't have special sympathy for the USA. And yes, I do write my
> code - including comments - in english.

Again, +1. Even when writing code that appears to be "private" at some
time, one *never* knows what will become of it in the future. If it ever
goes public, its chances to evolve - or just to be maintained - are far
bigger if it's written all in english.

Stefan Behnel

unread,

May 14, 2007, 4:54:21 AM5/14/07

to Eric Brunel

Eric Brunel wrote:
> Even when writing code that appears to be "private" at some
> time, one *never* knows what will become of it in the future. If it ever
> goes public, its chances to evolve - or just to be maintained - are far
> bigger if it's written all in english.
>

> --python -c "print ''.join([chr(154 - ord(c)) for c in
> 'U(17zX(%,5.zmz5(17l8(%,5.Z*(93-965$l7+-'])"

Oh well, why did *that* code ever go public?

Stefan

Stefan Behnel

unread,

May 14, 2007, 5:00:29 AM5/14/07

to Eric Brunel

Eric Brunel wrote:
> On Sun, 13 May 2007 21:10:46 +0200, Stefan Behnel
> <stefan.beh...@web.de> wrote:
> [snip]
>> Now, I am not a strong supporter (most public code will use English
>> identifiers anyway)
>
> How will you guarantee that? I'm quite convinced that most of the public
> code today started its life as private code earlier...

Ok, so we're back to my original example: the problem here is not the
non-ASCII encoding but the non-english identifiers.

If we move the problem to a pure unicode naming problem:

How likely is it that it's *you* (lacking a native, say, kanji keyboard) who
ends up with code that uses identifiers written in kanji? And that you are the
only person who is now left to do the switch to an ASCII transliteration?

Any chance there are still kanji-enabled programmes around that were not hit
by the bomb in this scenario? They might still be able to help you get the
code "public".

Stefan

Stefan Behnel

unread,

May 14, 2007, 5:04:36 AM5/14/07

to Alex Martelli

Alex Martelli schrieb:

But then, where's the problem? Just stick to accepting only patches that are
plain ASCII *for your particular project*. And if you want to be sure, put an
ASCII encoding header in all source files (which you want to do anyway, to
prevent the same problem with string constants).

The PEP is only arguing to support this decision at a per-project level rather
than forbidding it at the language level. This makes sense as it moves the
power into the hands of those people who actually use it, not those who
designed the language.

Stefan

Stefan Behnel

unread,

May 14, 2007, 5:18:07 AM5/14/07

to Bruno Desthuilliers

Bruno Desthuilliers wrote:
> but CS is english-speaking, period.

That's a wrong assumption. I understand that people can have this impression
when they deal a lot with Open Source code, but I've seen a lot of places
where code was produced that was not written to become publicly available (and
believe me, it *never* will become Open Source). And the projects made strong
use of identifiers with domain specific names. And believe me, those are best
expressed in a language your client knows and expresses concepts in. And this
is definitely not the language you claim to be the only language in CS.

Stefan

Anton Vredegoor

unread,

May 14, 2007, 5:29:55 AM5/14/07

to

In article <vJU1i.37796$M.3...@news-server.bigpond.net.au>,
nyamatong...@gmail.com says...

Ouch! Now I seem to be disagreeing with the one who writes my editor.
What will become of me now?

A.

Nick Craig-Wood

unread,

May 14, 2007, 5:30:03 AM5/14/07

to

Martin v. Löwis <mar...@v.loewis.de> wrote:
> So, please provide feedback, e.g. perhaps by answering these
> questions:

Firstly on the PEP itself:

It defines characters that would be allowed. However not being up to
speed on unicode jargon I don't have a clear idea about which
characters those are. A page with some examples or even all possible
allowed characters would be great, plus some examples of disallowed
characters.

> - should non-ASCII identifiers be supported? why?

Only if PEP 8 was amended to state that ASCII characters only should
be used for publically released / library code. I'm quite happy with
Unicode in comments / docstrings (but that is supported already).

> - would you use them if it was possible to do so? in what cases?

My initial reaction is that it would be cool to use all those great
symbols. A variable called OHM etc! However on reflection I think it
would be a step back for the easy to read nature of python.

My worries are :-

a) English speaking people would invent their own dialects of python
which looked like APL with all those nice Unicode mathematical
operators / Greek letters you could use as variable/function names. I
like the symbol free nature of python which makes for easy
comprehension of code and don't want to see it degenerate.

b) Unicode characters would creep into the public interface of public
libraries. I think this would be a step back for the homogeneous
nature of the python community.

c) the python keywords are in ASCII/English. I hope you weren't
thinking of changing them?

...

In summary, I'm not particularly keen on the idea; though it might be
all right in private. Unicode identifiers are allowed in java though,
so maybe I'm worrying too much ;-)

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Eric Brunel

unread,

May 14, 2007, 6:02:27 AM5/14/07

to

On Mon, 14 May 2007 11:00:29 +0200, Stefan Behnel
<stefan.beh...@web.de> wrote:

> Eric Brunel wrote:
>> On Sun, 13 May 2007 21:10:46 +0200, Stefan Behnel
>> <stefan.beh...@web.de> wrote:
>> [snip]
>>> Now, I am not a strong supporter (most public code will use English
>>> identifiers anyway)
>>
>> How will you guarantee that? I'm quite convinced that most of the public
>> code today started its life as private code earlier...
>
> Ok, so we're back to my original example: the problem here is not the
> non-ASCII encoding but the non-english identifiers.

As I said in the rest of my post, I do recognize that there is a problem
with non-english identifiers. I only think that allowing these identifiers
to use a non-ASCII encoding will make things worse, and so should be
avoided.

> If we move the problem to a pure unicode naming problem:
>
> How likely is it that it's *you* (lacking a native, say, kanji keyboard)
> who
> ends up with code that uses identifiers written in kanji? And that you
> are the
> only person who is now left to do the switch to an ASCII transliteration?
>
> Any chance there are still kanji-enabled programmes around that were not
> hit
> by the bomb in this scenario? They might still be able to help you get
> the
> code "public".

Contrarily to what one might think seeing the great achievements of
open-source software, people willing to maintain public code and/or make
it evolve seem to be quite rare. If you add burdens on such people - such
as being able to read and write the language of the original code writer,
or forcing them to request a translation or transliteration from someone
else -, the chances are that they will become even rarer...

Stefan Behnel

unread,

May 14, 2007, 6:17:36 AM5/14/07

to Eric Brunel

Eric Brunel wrote:
> On Mon, 14 May 2007 11:00:29 +0200, Stefan Behnel

>> Any chance there are still kanji-enabled programmes around that were
>> not hit
>> by the bomb in this scenario? They might still be able to help you get
>> the
>> code "public".
>
> Contrarily to what one might think seeing the great achievements of
> open-source software, people willing to maintain public code and/or make
> it evolve seem to be quite rare. If you add burdens on such people -
> such as being able to read and write the language of the original code
> writer, or forcing them to request a translation or transliteration from
> someone else -, the chances are that they will become even rarer...

Ok, but then maybe that code just will not become Open Source. There's a
million reasons code cannot be made Open Source, licensing being one, lack of
resources being another, bad implementation and lack of documentation being
important also.

But that won't change by keeping Unicode characters out of source code.

Now that we're at it, badly named english identifiers chosen by non-english
native speakers, for example, are a sure way to keep people from understanding
the code and thus from being able to contribute resources.

I'm far from saying that all code should start using non-ASCII characters.
There are *very* good reasons why a lot of projects are well off with ASCII
and should obey the good advice of sticking to plain ASCII. But those are
mainly projects that are developed in English and use English documentation,
so there is not much of a risk to stumble into problems anyway.

I'm only saying that this shouldn't be a language restriction, as there
definitely *are* projects (I know some for my part) that can benefit from the
clarity of native language identifiers (just like English speaking projects
benefit from the English language). And yes, this includes spelling native
language identifiers in the native way to make them easy to read and fast to
grasp for those who maintain the code.

It should at least be an available option to use this feature.

Stefan

Neil Hodgson

unread,

May 14, 2007, 6:26:58 AM5/14/07

to

Anton Vredegoor:

> Ouch! Now I seem to be disagreeing with the one who writes my editor.
> What will become of me now?

It should be OK. I try to keep my anger under control and not cut
off the pixel supply at the first stirrings of dissent.

It may be an idea to provide some more help for multilingual text
such as allowing ranges of characters to be represented as hex escapes
or character names automatically. Then someone who only normally uses
ASCII can more easily audit patches that could contain non-ASCII characters.

Neil

Marc 'BlackJack' Rintsch

unread,

May 14, 2007, 6:49:04 AM5/14/07

to

In <slrnf4gaf...@irishsea.home.craig-wood.com>, Nick Craig-Wood
wrote:

> My initial reaction is that it would be cool to use all those great
> symbols. A variable called OHM etc!

This is a nice candidate for homoglyph confusion. There's the Greek
letter omega (U+03A9) Ω and the SI unit symbol (U+2126) Ω, and I think
some omegas in the mathematical symbols area too.

Ciao,
Marc 'BlackJack' Rintsch

Marco Colombo

unread,

May 14, 2007, 7:42:21 AM5/14/07

to

I suggest we keep focused on the main issue here, which is "shoud non-
ascii identifiers be allowed, given that we already allow non-ascii
strings literals and comments?"

Most arguments against this proposal really fall into the category
"ascii-only source files". If you want to promote code-sharing, then
you should enfore quite restrictive policies:
- 7-bit only source files, so that everyone is able to correctly
display and _print_ them (somehow I feel that printing foreign glyphs
can be harder than displaying them) ;
- English-only, readable comments _and_ identifiers (if you think of
it, it's really the same issue, readability... I know no Coding Style
that requires good commenting but allows meaningless identifiers).

Now, why in the first place one should be allowed to violate those
policies? One reason is freedom. Let me write my code the way I like
it, and don't force me writing it the way you like it (unless it's
supposed to be part of _your_ project, then have me follow _your_
style).

Another reason is that readability is quite a relative term...
comments that won't make any sense in a real world program, may be
appropriate in a 'getting started with' guide example:

# this is another way to increment variable 'a'
a += 1

we know a comment like that is totally useless (and thus harmful) to
any programmer (makes me think "thanks, but i knew that already"), but
it's perfectly appropriate if you're introducing that += operator for
the first time to a newbie.

You could even say that most string literals are best made English-
only:

print "Ciao Mondo!"

it's better written:

print _("Hello World!")

or with any other mean to allow the i18n of the output. The Italian
version should be implemented with a .po file or whatever.

Yet, we support non-ascii encodings for source files. That's in order
to give authors more freedom. And freedom comes at a price, of course,
as non-ascii string literals, comments and identifiers are all harmful
to some extents and in some contexts.

What I fail to see is a context in which it makes sense to allow non-
ascii literals and non-ascii comments but _not_ non-ascii identifiers.
Or a context in which it makes sense to rule out non-ascii identifiers
but not string literals and comments. E.g. would you accept a patch
with comments you don't understand (or even that you are not able to
display correctly)? How can you make sure the patch is correct, if you
can't read and understand the string literals it adds?

My point being that most public open source projects already have
plenty of good reasons to enforce an English-only, ascii-only policy
on source files. I don't think that allowing non-ascii indentifiers at
language level would hinder thier ability to enforce such a policy
more than allowing non-ascii comments or literals did.

OTOH, I won't be able to contribute much to a project that already
uses, say, Chinese for comments and strings. Even if I manage to
display the source code correctly here, still I won't understand much
of it. So I'm not losing much by allowing them to use Chinese for
identifiers too.
And whether it was a mistake on their part not to choose an "English
only, ascii only" policy it's their call, not ours, IMHO.

.TM.

Alexander Schmolck

unread,

May 14, 2007, 7:43:07 AM5/14/07

to

Neil Hodgson <nyamatong...@gmail.com> writes:

> Paul Rubin wrote:
>>> Plenty of programming languages already support unicode identifiers,
>>
>> Could you name a few? Thanks.
>

> C#, Java, Ecmascript, Visual Basic.

(i.e. everything that isn't a legacy or niche language)

scheme (major implementations such as PLT and the upcoming standard), the most
popular common lisp implementations, haskell[1], fortress[2], perl 6 and I should
imagine (but haven't checked) all new java or .NET based languages (F#,
IronPython, JavaFX, Groovy, etc.) as well -- the same goes for XML-based
languages.

(i.e. everything that's up and coming, too)

So as Neil said, I don't think keeping python ASCII and interoperable is an
option. I don't happen to think the anti-unicode arguments that have been
advanced so far terribly convincing so far[3], but even if they were it
wouldn't matter much -- the ability of functioning as a painless glue language
has always been absolutely vital for python.

cheers

'as

Footnotes:
[1] <http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource>

[2] <http://research.sun.com/projects/plrg/fortress.pdf>

[3] Although I do agree that mechanisms to avoid spoofing and similar
problems (what normalization scheme and constraints unicode identifiers
should be subjected to) merit careful discussion.

Laurent Pointal

unread,

May 14, 2007, 7:45:12 AM5/14/07

to

Martin v. Löwis a écrit :
> PEP 1 specifies that PEP authors need to collect feedback from the
> community. As the author of PEP 3131, I'd like to encourage comments
> to the PEP included below, either here (comp.lang.python), or to
> pytho...@python.org
>
> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").
>
> I believe this PEP differs from other Py3k PEPs in that it really
> requires feedback from people with different cultural background
> to evaluate it fully - most other PEPs are culture-neutral.
>

> So, please provide feedback, e.g. perhaps by answering these
> questions:

> - should non-ASCII identifiers be supported? why?

> - would you use them if it was possible to do so? in what cases?

I strongly prefer to stay with current standard limited ascii for
identifiers.

Ideally, it would be agreable to have variables like greek letters for
some scientific vars, for french people using éèçà in names...

But... (I join common obections):

* where are-they on my keyboard, how can I type them ?
(i can see french éèçà, but us-layout keyboard dont know them, imagine
kanji or greek)

* how do I spell this cyrilic/kanji char ?

* when there are very similar chars, how can I distinguish them?
(without dealing with same representation chars having different unicode
names)

* is "amédé" variable and "amede" the same ?

* its an anti-KISS rule.

* not only I write code, I read it too, and having such variation
possibility in names make code really more unreadable.
(unless I learn other scripting representation - maybe not a bad thing
itself, but its not the objective here).

* I've read "Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway."
But even with comments in german or spanish or japanese, I can guess to
identify what a (well written) code is doing with its data. It would be
very difficult with unicode spanning identifiers.

==> I wouldn't use them.

So, keep ascii only.
Basic ascii is the lower common denominator known and available
everywhere, its known by all developers who can identify these chars
correctly (maybe 1 vs I or O vs 0 can get into problems with uncorrect
fonts).

Maybe, make default file-encoding to utf8 and strings to be unicode
strings by default (with a s"" for basic strings by example), but this
is another problem.

L.Pointal.

Stefan Behnel

unread,

May 14, 2007, 7:49:41 AM5/14/07

to Marco Colombo

Very well written.

+1

Stefan

Duncan Booth

unread,

May 14, 2007, 8:24:49 AM5/14/07

to

Alexander Schmolck <a.sch...@gmail.com> wrote:

> scheme (major implementations such as PLT and the upcoming standard),
> the most popular common lisp implementations, haskell[1], fortress[2],
> perl 6 and I should imagine (but haven't checked) all new java or .NET
> based languages (F#, IronPython, JavaFX, Groovy, etc.) as well -- the
> same goes for XML-based languages.
>

Just to confirm that: IronPython does accept non-ascii identifiers. From
"Differences between IronPython and CPython":

> IronPython will compile files whose identifiers use non-ASCII
> characters if the file has an encoding comment such as "# -*- coding:
> utf-8 -*-". CPython will not compile such a file in any case.

Stefan Behnel

unread,

May 14, 2007, 8:33:31 AM5/14/07

to duncan...@suttoncourtenay.org.uk

Sounds like CPython would better follow IronPython here.

Stefan

Paul McGuire

unread,

May 14, 2007, 9:17:58 AM5/14/07

to

On May 14, 4:30 am, Nick Craig-Wood <n...@craig-wood.com> wrote:
>
> A variable called OHM etc!

> --
> Nick Craig-Wood <n...@craig-wood.com> --http://www.craig-wood.com/nick

Then can 'lambda' -> 'λ' be far behind? (I know this is a keyword
issue, not covered by this PEP, but I also sense that the 'lambda'
keyword has always been ranklesome.)

In my own personal English-only experience, I've thought that it would
be helpful to the adoption of pyparsing if I could distribute class
name translations, since so much of my design goal of pyparsing is
that it be somewhat readable as in:

integer = Word(nums)

is 'an integer is a word composed of numeric digits'.

By distributing a translation file, such as:

Palabra = Word
Grupo = Group
etc.

a Spanish-speaker could write their own parser using:

numero = Palabra(nums)

and this would still pass the "fairly easy-to-read" test, for that
user. While my examples don't use any non-ASCII characters, I'm sure
the issue would come up fairly quickly.

As to the responder who suggested not mixing ASCII/Latin with, say,
Hebrew in any given identifier, this is not always possible. On a
business trip to Israel, I learned that there are many terms that do
not have Hebrew correspondents, and so Hebrew technical literature is
sprinkled with English terms in Latin characters. This is especially
interesting to watch being typed on a terminal, as the Hebrew
characters are written on the screen right-to-left, and then an
English word is typed by switching the editor to left-to-right mode.
The cursor remains in the same position and the typed Latin characters
push out to the left as they are typed. Then typing in right-to-left
mode is resumed, just to the left of the Latin characters just
entered.

-- Paul

Duncan Booth

unread,

May 14, 2007, 9:44:04 AM5/14/07

to

Stefan Behnel <stefan.beh...@web.de> wrote:

>> Just to confirm that: IronPython does accept non-ascii identifiers.
>> From "Differences between IronPython and CPython":
>>
>>> IronPython will compile files whose identifiers use non-ASCII
>>> characters if the file has an encoding comment such as "# -*-
>>> coding: utf-8 -*-". CPython will not compile such a file in any
>>> case.
>
> Sounds like CPython would better follow IronPython here.

I cannot find any documentation which says exactly which non-ASCII
characters IronPython will accept.
I would guess that it probably follows C# in general, but it doesn't
follow C# identifier syntax exactly (in particular the leading @ to
quote keywords is not supported).

The C# identifier syntax from http://msdn2.microsoft.com/en-us/library/aa664670(VS.71).aspx
I think it differs from the PEP only in also allowing the Cf class of characters:

identifier:
available-identifier
@ identifier-or-keyword
available-identifier:
An identifier-or-keyword that is not a keyword
identifier-or-keyword:
identifier-start-character identifier-part-charactersopt
identifier-start-character:
letter-character
_ (the underscore character U+005F)
identifier-part-characters:
identifier-part-character
identifier-part-characters identifier-part-character
identifier-part-character:
letter-character
decimal-digit-character
connecting-character
combining-character
formatting-character
letter-character:
A Unicode character of classes Lu, Ll, Lt, Lm, Lo, or Nl
A unicode-escape-sequence representing a character of classes Lu, Ll, Lt, Lm, Lo, or Nl
combining-character:
A Unicode character of classes Mn or Mc
A unicode-escape-sequence representing a character of classes Mn or Mc
decimal-digit-character:
A Unicode character of the class Nd
A unicode-escape-sequence representing a character of the class Nd
connecting-character:
A Unicode character of the class Pc
A unicode-escape-sequence representing a character of the class Pc
formatting-character:
A Unicode character of the class Cf
A unicode-escape-sequence representing a character of the class Cf

For information on the Unicode character classes mentioned above, see
The Unicode Standard, Version 3.0, section 4.5.

ga...@dsdata.it

unread,

May 14, 2007, 10:42:37 AM5/14/07

to

On May 13, 5:44 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:

> In summary, this PEP proposes to allow non-ASCII letters as
> identifiers in Python. If the PEP is accepted, the following
> identifiers would also become valid as class, function, or
> variable names: Löffelstiel, changé, ошибка, or 売り場
> (hoping that the latter one means "counter").

I am strongly against this PEP. The serious problems and huge costs
already explained by others are not balanced by the possibility of
using non-butchered identifiers in non-ASCII alphabets, especially
considering that one can write any language, in its full Unicode
glory, in the strings and comments of suitably encoded source files.
The diatribe about cross language understanding of Python code is IMHO
off topic; if one doesn't care about international readers, using
annoying alphabets for identifiers has only a marginal impact. It's
the same situation of IRIs (a bad idea) with HTML text (happily
Unicode).

> - should non-ASCII identifiers be supported? why?

No, they are useless.

> - would you use them if it was possible to do so? in what cases?

No, never.
Being Italian, I'm sometimes tempted to use accented vowels in my
code, but I restrain myself because of the possibility of annoying
foreign readers and the difficulty of convincing every text editor I
use to preserve them

> Python code is written by many people in the world who are not familiar
> with the English language, or even well-acquainted with the Latin
> writing system. Such developers often desire to define classes and
> functions with names in their native languages, rather than having to
> come up with an (often incorrect) English translation of the concept
> they want to name.

The described set of users includes linguistically intolerant people
who don't accept the use of suitable languages instead of their own,
and of compromised but readable spelling instead of the one they
prefer.
Most "people in the world who are not familiar with the English
language" are much more mature than that, even when they don't write
for international readers.

> The syntax of identifiers in Python will be based on the Unicode
> standard annex UAX-31 [1]_, with elaboration and changes as defined
> below.

Not providing an explicit listing of allowed characters is inexcusable
sloppiness.
The XML standard is an example of how listings of large parts of the
Unicode character set can be provided clearly, exactly and (almost)
concisely.

> ``ID_Start`` is defined as all characters having one of the general
> categories uppercase letters (Lu), lowercase letters (Ll), titlecase
> letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
> (Nl), plus the underscore (XXX what are "stability extensions" listed in
> UAX 31).
>
> ``ID_Continue`` is defined as all characters in ``ID_Start``, plus
> nonspacing marks (Mn), spacing combining marks (Mc), decimal number
> (Nd), and connector punctuations (Pc).

Am I the first to notice how unsuitable these characters are? Many of
these would be utterly invisible ("variation selectors" are Mn) or
displayed out of sequence (overlays are Mn), or normalized away
(combining accents are Mn) or absurdly strange and ambiguous (roman
numerals are Nl, for instance).

Lorenzo Gatti

Bruno Desthuilliers

unread,

May 14, 2007, 10:49:31 AM5/14/07

to

Stefan Behnel a écrit :

> Eric Brunel wrote:
>> On Mon, 14 May 2007 11:00:29 +0200, Stefan Behnel
>>> Any chance there are still kanji-enabled programmes around that were
>>> not hit
>>> by the bomb in this scenario? They might still be able to help you get
>>> the
>>> code "public".
>> Contrarily to what one might think seeing the great achievements of
>> open-source software, people willing to maintain public code and/or make
>> it evolve seem to be quite rare. If you add burdens on such people -
>> such as being able to read and write the language of the original code
>> writer, or forcing them to request a translation or transliteration from
>> someone else -, the chances are that they will become even rarer...
>
> Ok, but then maybe that code just will not become Open Source. There's a
> million reasons code cannot be made Open Source, licensing being one, lack of
> resources being another, bad implementation and lack of documentation being
> important also.
>
> But that won't change by keeping Unicode characters out of source code.

Nope, but adding unicode glyphs support for identifiers will only make
things worse, and we (free software authors/users/supporters)
definitively *don't* need this.

> Now that we're at it, badly named english identifiers chosen by non-english
> native speakers, for example, are a sure way to keep people from understanding
> the code and thus from being able to contribute resources.

Broken English is certainly better than German or French or Italian when
it comes to sharing code.

> I'm far from saying that all code should start using non-ASCII characters.
> There are *very* good reasons why a lot of projects are well off with ASCII
> and should obey the good advice of sticking to plain ASCII. But those are
> mainly projects that are developed in English and use English documentation,
> so there is not much of a risk to stumble into problems anyway.
>
> I'm only saying that this shouldn't be a language restriction, as there
> definitely *are* projects (I know some for my part) that can benefit from the
> clarity of native language identifiers (just like English speaking projects
> benefit from the English language).

As far as I'm concerned, I find "frenglish" source code (code with
identifiers in French) a total abomination. The fact is that all the
language (keywords, builtins, stdlib) *is* in English. Unless you
address that fact, your PEP is worthless (and even if you really plan to
do something about this, I still find it a very bad idea for reasons
already exposed).

The fact is also that anyone at least half-serious wrt/ CS will learn
technical English anyway. And, as other already pointed, learning
technical English is certainly not the most difficult part when it comes
to programming.

> And yes, this includes spelling native
> language identifiers in the native way to make them easy to read and fast to
> grasp for those who maintain the code.

Yes, fine. So we end up with a code that's a mix of English (keywords,
builtins, stdlib, almost if not all third-part libs) and native
language. So, while native speakers will still have to deal with
English, non-native speakers won't be able to understand anything. Talk
about a great idea...

Bruno Desthuilliers

unread,

May 14, 2007, 11:04:15 AM5/14/07

to

Stefan Behnel a écrit :

> Bruno Desthuilliers wrote:
>> but CS is english-speaking, period.
>
> That's a wrong assumption.

I've never met anyone *serious* about programming and yet unable to read
and write CS-oriented technical English.

> I understand that people can have this impression
> when they deal a lot with Open Source code, but I've seen a lot of places
> where code was produced that was not written to become publicly available (and
> believe me, it *never* will become Open Source).

Yeah, fine. This doesn't mean that all and every people that may have to
work on this code is a native speaker of the language used - or even
fluent enough with it.

Eric Brunel

unread,

May 14, 2007, 11:14:09 AM5/14/07

to

On Mon, 14 May 2007 12:17:36 +0200, Stefan Behnel
<stefan.beh...@web.de> wrote:
> Eric Brunel wrote:
>> On Mon, 14 May 2007 11:00:29 +0200, Stefan Behnel
>>> Any chance there are still kanji-enabled programmes around that were
>>> not hit
>>> by the bomb in this scenario? They might still be able to help you get
>>> the
>>> code "public".
>>
>> Contrarily to what one might think seeing the great achievements of
>> open-source software, people willing to maintain public code and/or make
>> it evolve seem to be quite rare. If you add burdens on such people -
>> such as being able to read and write the language of the original code
>> writer, or forcing them to request a translation or transliteration from
>> someone else -, the chances are that they will become even rarer...
>
> Ok, but then maybe that code just will not become Open Source. There's a
> million reasons code cannot be made Open Source, licensing being one,
> lack of
> resources being another, bad implementation and lack of documentation
> being
> important also.
>
> But that won't change by keeping Unicode characters out of source code.

Maybe; maybe not. This is one more reason for a code preventing it from
becoming open-source. IMHO, there are already plenty of these reasons, and
I don't think we need a new one...

> Now that we're at it, badly named english identifiers chosen by
> non-english
> native speakers, for example, are a sure way to keep people from
> understanding
> the code and thus from being able to contribute resources.

I wish we could have an option forbidding these also ;-) But now, maybe
some of my own code would no more execute when it's turned on...

> I'm far from saying that all code should start using non-ASCII
> characters.
> There are *very* good reasons why a lot of projects are well off with
> ASCII
> and should obey the good advice of sticking to plain ASCII. But those are
> mainly projects that are developed in English and use English
> documentation,
> so there is not much of a risk to stumble into problems anyway.
>
> I'm only saying that this shouldn't be a language restriction, as there
> definitely *are* projects (I know some for my part) that can benefit
> from the
> clarity of native language identifiers (just like English speaking
> projects
> benefit from the English language). And yes, this includes spelling
> native
> language identifiers in the native way to make them easy to read and
> fast to
> grasp for those who maintain the code.

My point is only that I don't think you can tell right from the start that
a project you're working on will stay private forever. See Java for
instance: Sun said for quite a long time that it wasn't a good idea to
release Java as open-source and that it was highly unlikely to happen. But
it finally did...

You could tell that the rule should be that if the project has the
slightest chance of becoming open-source, or shared with people not
speaking the same language as the original coders, one should not use
non-ASCII identifiers. I'm personnally convinced that *any* industrial
project falls into this category. So accepting non-ASCII identifiers is
just introducing a disaster waiting to happen.

But now, I have the same feeling about non-ASCII strings, and I - as a
project leader - won't ever accept a source file which has a "_*_ coding
_*_" line specifying anything else than ascii... So even if I usually
don't buy the "we're already half-dirty, so why can't we be the dirtiest
possible" argument, I'd understand if this feature went into the language.
But I personnally won't ever use it, and forbid it from others whenever
I'll be able to.

> It should at least be an available option to use this feature.

If it's actually an option to the interpreter, I guess I'll just have to
alias python to 'python --ascii-only-please'...

Michel Claveau

unread,

May 14, 2007, 11:53:18 AM5/14/07

to

Hi !

> - should non-ASCII identifiers be supported? why?

> - would you use them if it was possible to do so? in what cases?

Yes.
And, more: yes yes yes

Because:

1) when I connect Python to j(ava)Script, if the pages "connected"
contains objects with non-ascii characters, I can't use it ; snif...

2) when I connect Python to databases, if there are fields (columns)
with emphatic letters, I can't use class properties for drive these
fields. Exemples:
"cité" (french translate of "city")
"téléphone" (for phone)

And, because non-ASCII characters are possible, they are no-obligatory
; consequently guys (snobs?) want stay in pure-ASCII dimension will
can.

* sorry for my bad english *

--
@-salutations

Michel Claveau

Anders J. Munch

unread,

May 14, 2007, 12:01:11 PM5/14/07

to

Eric Brunel wrote:
> You could tell that the rule should be that if the project has the
> slightest chance of becoming open-source, or shared with people not
> speaking the same language as the original coders, one should not use
> non-ASCII identifiers. I'm personnally convinced that *any* industrial
> project falls into this category. So accepting non-ASCII identifiers is
> just introducing a disaster waiting to happen.

Not at all. If the need arises, you just translate the whole thing. Contrary
to popular belief, this is a quick and easy thing to do.

So YAGNI applies, and even if you find that you do need it, you may still have
won on the balance! As the time saved by using your native language just might
outweigh the time spent translating.

- Anders

Anders J. Munch

unread,

May 14, 2007, 12:15:44 PM5/14/07

to

Hendrik van Rooyen wrote:
> And we have been through the Macro thingy here, and the consensus
> seemed to be that we don't want people to write their own dialects.

Macros create dialects that are understood only by the three people in your
project group. It's unreasonable to compare that to a "dialect" such as
Mandarin, which is exclusive to a tiny little clique of one billion people.

- Anders

ru...@yahoo.com

unread,

May 14, 2007, 12:30:42 PM5/14/07

to

On May 14, 9:53 am, Michel Claveau

Can a discussion about support for non-english identifiers (1)
conducted in a group where 99.9% of the posters are fluent
speakers of english (2), have any chance of being objective
or fair?

Although probably not-sufficient to overcome this built-in
bias, it would be interesting if some bi-lingual readers would
raise this issue in some non-english Python discussion
groups to see if the opposition to this idea is as strong
there as it is here.

(1) No quibbles about the distintion between non-english
and non-ascii please.
(2) Several posters have claimed non-native english speaker
status to bolster their position, but since they are clearly at
or near native-speaker levels of fluency, that english is not
their native language is really irrelevant.

Anton Vredegoor

unread,

May 14, 2007, 12:43:54 PM5/14/07

to

Neil Hodgson wrote:
> Anton Vredegoor:
>
>> Ouch! Now I seem to be disagreeing with the one who writes my editor.
>> What will become of me now?
>
> It should be OK. I try to keep my anger under control and not cut
> off the pixel supply at the first stirrings of dissent.

Thanks! I guess I won't have to make the obligatory Sovjet Russia joke
now :-)

> It may be an idea to provide some more help for multilingual text
> such as allowing ranges of characters to be represented as hex escapes
> or character names automatically. Then someone who only normally uses
> ASCII can more easily audit patches that could contain non-ASCII characters.

Now that I read about IronPython already supporting some larger
character set I feel like I'm somewhat caught in a side effect of an
embrace and extend scheme.

A.

Stefan Behnel

unread,

May 14, 2007, 12:45:58 PM5/14/07

to Jarek Zgoda

Jarek Zgoda schrieb:
> Stefan Behnel napisał(a):
>
>>> While I can read the code with Hebrew, Russian or Greek names
>>> transliterated to ASCII, I would not be able to read such code in native.
>> Then maybe it was code that was not meant to be read by you?
>
> OK, then. As a code obfuscation measure this would fit perfectly.

I actually meant it as a measure for clarity and readability for those who are
actually meant to *read* the code.

Stefan

Jakub Stolarski

unread,

May 14, 2007, 12:47:11 PM5/14/07

to

On May 13, 5:44 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> - should non-ASCII identifiers be supported? why?

No. It's good convention to stick with english. And if we stick with
english, why we should need non-ASCII characters? Any non-ASCII
character makes code less readable. We never know if our code became
public.

> - would you use them if it was possible to do so? in what cases?

No. I don't see any uses. I'm Polish. Polish-english mix looks funny.

Pierre Hanser

unread,

May 14, 2007, 12:54:57 PM5/14/07

to

This pep is not technical, or at least not only. It has
larger implications about society model we want.

Let me explain with an analogy:
let's compare 'ascii english' to coca-cola.

It's available nearly everywhere.

It does not taste good at first try, and is especially
repulsive to young children.

It's cheap and you don't expect much of it.

You know you can drink some in case of real need.

It's imperialist connotation is widely accepted(?)

But it's not good as your favorite beverage, beer, wine, ...

The world is full of other possibilities. Think, in case
of necessity you could even have to drink tea with yack
butter in himalaya! in normal circonstances, you should
never see any, but in extreme situation you may have to!

Were is freedom in such a world you could only drink coca?

I DON'T WANT TO HAVE TO DRINK COCA AT HOME ALL THE TIME.

and this pep is a glorious occasion to get free from it.

[disclaimer: coca is used here as the generic name it became,
and no real offense is intended]

--
Pierre

Terry Reedy

unread,

May 14, 2007, 2:01:16 PM5/14/07

to pytho...@python.org

"Stefan Behnel" <stefan.beh...@web.de> wrote in message
news:4648571B...@web.de...

| Sounds like CPython would better follow IronPython here.

One could also turn the argument around and say that there is no need to
follow IronPython; people who want non-ASCII identifiers can just juse
IronPython.

Michel Claveau

unread,

May 14, 2007, 2:19:42 PM5/14/07

to

And Il1 O0 ?

--
@-salutations

Michel Claveau

Marc 'BlackJack' Rintsch

unread,

May 14, 2007, 2:25:09 PM5/14/07

to

In <mn.74c37d75c...@mclaveauPas.De.Spam.com>, Michel Claveau
wrote:

> And Il1 O0 ?

Hm, we should ban digits from identifier names. :-)

Ciao,
Marc 'BlackJack' Rintsch

Stefan Behnel

unread,

May 14, 2007, 2:52:50 PM5/14/07

to Marc 'BlackJack' Rintsch

Marc 'BlackJack' Rintsch schrieb:

> In <mn.74c37d75c...@mclaveauPas.De.Spam.com>, Michel Claveau
> wrote:
>
>> And Il1 O0 ?
>
> Hm, we should ban digits from identifier names. :-)

Ah, good idea - and capital letters also. After all, they are rare enough in
English to just plain ignore their existance.

Stefan :)

Michael Yanowitz

unread,

May 14, 2007, 3:03:43 PM5/14/07

to pytho...@python.org

Let me guess - the next step will be to restrict the identifiers
to be at most 6 characters long.

Stefan :)
--
http://mail.python.org/mailman/listinfo/python-list

Grant Edwards

unread,

May 14, 2007, 3:14:49 PM5/14/07

to

And I don't really see any need for using more than two
characters. With just two letters (ignoring case, of course),
you can create 676 identifiers in any namespace. That's
certainly got to be enough. If not, adding a special caracter
suffix (e.g. $,%,#) to denote the data type should sufficient
expand the namespace.

So, let's just silently ignore anything past the first two.
That way we'd be compatible with Commodor PET Basic.

[You don't want to know how long it took me to find all of the
name-collision bugs after porting a basic program from a CP/M
system which had a fairly sophisticated Basic compiler (no line
numbers, all the normal structured programming flow control
constructs) to a Commodore PET which had a really crappy BASIC
interpreter.]

--
Grant Edwards grante Yow! Am I having fun yet?
at
visi.com

Grant Edwards

unread,

May 14, 2007, 3:16:25 PM5/14/07

to

On 2007-05-14, Michael Yanowitz <m.yan...@kearfott.com> wrote:

> Let me guess - the next step will be to restrict the identifiers
> to be at most 6 characters long.

Of course. If they're any longer than that then you can't fit
an entire identifier into a 26-bit CDC 6600 machine register so
you can do a compare with a single machine instruction.

--
Grant Edwards grante Yow! CHUBBY CHECKER just
at had a CHICKEN SANDWICH in
visi.com downtown DULUTH!

Grant Edwards

unread,

May 14, 2007, 3:22:16 PM5/14/07

to

On 2007-05-14, Grant Edwards <gra...@visi.com> wrote:
> On 2007-05-14, Michael Yanowitz <m.yan...@kearfott.com> wrote:
>
>> Let me guess - the next step will be to restrict the identifiers
>> to be at most 6 characters long.
>
> Of course. If they're any longer than that then you can't fit
> an entire identifier into a 26-bit CDC 6600 machine register so

36-bit

> you can do a compare with a single machine instruction.

--
Grant Edwards grante Yow! If our behavior is
at strict, we do not need fun!
visi.com

Méta-MCI

unread,

May 14, 2007, 3:49:42 PM5/14/07

to

Hi!

- should non-ASCII identifiers be supported? why?

- would you use them if it was possible to do so? in what cases?

Yes.

JScript can use letters with accents in identifiers
XML (1.1) can use letters with accents in tags
C# can use letters with accents in variables
SQL: MySQL/MS-Sql/Oralcle/etc. can use accents in fields or request
etc.
etc.

Python MUST make up for its lost time.

MCI

Jakub Stolarski

unread,

May 14, 2007, 4:31:16 PM5/14/07

to

And generally nobody use it.
It sounds like "are for art's sake".

But OK. Maybe it'll be some impulse to learn some new languages.

+1 for this PEP

Michel Claveau

unread,

May 14, 2007, 5:19:30 PM5/14/07

to

Hi!

;-)))

In whitespace-programming-language, only three characters are used :
Space - Tab - RC

No parasitic characters in listings ; economy of ink ; ecological
behavior ; LOL programming...

Must, Python, follow this way?

--
@-salutations

Michel Claveau

"Martin v. Löwis"

unread,

May 14, 2007, 6:02:46 PM5/14/07

to Marc 'BlackJack' Rintsch

> In <slrnf4gaf...@irishsea.home.craig-wood.com>, Nick Craig-Wood
> wrote:
>
>> My initial reaction is that it would be cool to use all those great
>> symbols. A variable called OHM etc!
>
> This is a nice candidate for homoglyph confusion. There's the Greek
> letter omega (U+03A9) Ω and the SI unit symbol (U+2126) Ω, and I think
> some omegas in the mathematical symbols area too.

Under the PEP, identifiers are converted to normal form NFC, and
we have

py> unicodedata.normalize("NFC", u"\u2126")
u'\u03a9'

So, OHM SIGN compares equal to GREEK CAPITAL LETTER OMEGA. It can't
be confused with it - it is equal to it by the proposed language
semantics.

Regards,
Martin

"Martin v. Löwis"

unread,

May 14, 2007, 6:14:17 PM5/14/07

to ga...@dsdata.it

> Not providing an explicit listing of allowed characters is inexcusable
> sloppiness.

That is a deliberate part of the specification. It is intentional that
it does *not* specify a precise list, but instead defers that list
to the version of the Unicode standard used (in the unicodedata
module).

> The XML standard is an example of how listings of large parts of the
> Unicode character set can be provided clearly, exactly and (almost)
> concisely.

And, indeed, this is now recognized as one of the bigger mistakes
of the XML recommendation: they provide an explicit list, and fail
to consider characters that are unassigned. In XML 1.1, they try
to address this issue, by now allowing unassigned characters in
XML names even though it's not certain yet what those characters
mean (until they are assigned).

>> ``ID_Continue`` is defined as all characters in ``ID_Start``, plus
>> nonspacing marks (Mn), spacing combining marks (Mc), decimal number
>> (Nd), and connector punctuations (Pc).
>
> Am I the first to notice how unsuitable these characters are?

Probably. Nobody in the Unicode consortium noticed, but what
do they know about suitability of Unicode characters...

Regards,
Martin