For better or worse, except for string literals which can be anything as long as you set a coding comment, python is pure ascii which simplifies everything. Lambda is not in the first 128 characters of Unicode, so it is highly unlikely to be accepted.
· ‘Lambda’ is exactly as discouraging to type as it needs to be. A more likely to be accepted alternate keyword is ‘whyareyounotusingdef’
· Python doesn’t attempt to look like mathematical formula
· ‘Lambda’ spelling is intuitive to most people who program
· TIMTOWTDI isn’t a religious edict. Python is more pragmatic than that.
· It’s hard to type in ALL editors unless your locale is set to (ancient?) Greek.
· … What are you doing to have an identifier outside of ‘[A-Za-z_][A-Za-z0-9_]+’?
> Here is my speculative language idea for Python:
Thank you for raising it.
> Allow the following alternative spelling of the keyword `lambda':
> λ
> […]
> Therefore I would really like this to be an official part of the
> Python syntax.
>
> I know people have been clamoring for shorter lambda-syntax in the
> past, I think this is a nice minimal extension.
The question is not whether it would be nice, nor how many people have
clamoured for it. That you feel this supports the proposal isn't a good
sign :-)
The question is: What significant improvements to the language are made
by this proposal, to counteract the significant cost of *any* such
change to the language syntax?
> Advantages:
>
> * The lambda keyword is quite long and distracts from the "meat" of
> the lambda expression. Replacing it by a single-character keyword
> improves readability.
I disagree on this point. Making lambda easier is an attractive
nuisance; the ‘def’ statement is superior (for Python) in most
situations, and so lamda expressions should not be easier than that.
> * The resulting code resembles more closely mathematical notation (in
> particular, lambda-calculus notation), so it brings Python closer to
> being "executable pseudo-code".
How is that an advantage?
> * The alternative spelling λ/lambda is quite intuitive (at least to
> anybody who knows Greek letters.)
I reject this use of “intuitive; no one knows intuitively what lambda
is, what λ is, what the correspondence between them is, and what they
mean in various contexts. All of that needs to be learned, specifically.
so if this is an advantage, it needs to be expressed somehow other than
“intuitive”. Maybe you mean “familiar”, and avoid that term because it
makes for a weaker argument?
> Disadvantages:
I agree with your assessment of disadvantages, and re-iterate the
inherent disadvantage that any language change brings significant cost
to the Python core developers and the whole Python community. That's why
most such suggestions have a significant hurdle to demonstrate a
benefit.
--
\ “Prediction is very difficult, especially of the future.” |
`\ —Niels Bohr |
_o__) |
Ben Finney
It is a pure-Python project that pre-compiles their "coconut" program
files to .py at compile time -
They have a shorter syntax for lambda already - maybe that could be of
use to you - and
maybe you can get them to accept your suggestion - it certainly would fit there.
"""
Lambdas
Coconut provides the simple, clean -> operator as an alternative to
Python’s lambda statements. The operator has the same precedence as
the old statement.
Rationale
In Python, lambdas are ugly and bulky, requiring the entire word
lambda to be written out every time one is constructed. This is fine
if in-line functions are very rarely needed, but in functional
programming in-line functions are an essential tool.
Example:
dubsums = map((x, y) -> 2*(x+y), range(0, 10), range(10, 20))
Doesn't this kind of violate Python's "one way to do it"?
(Also, sorry for the top post; I'm on mobile right now...)
--
Ryan
[ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong.
http://kirbyfan64.github.io/
> A variant Python would be welcome to translate all the operators
> and keywords into single-character tokens, using Unicode symbols
> for NOT EQUAL TO and so on - including using U+03BB in place of
> 'lambda'.
Probably it would not be "welcome", except in the usual sense that
"Python is open source, you can do what you want".
There was extensive discussion about the issues surrounding the
natural languages used by programmers in source documentation (eg,
identifier choice and comments) at the time of PEP 263. The mojibake
(choice of charset) problem has largely improved since then, thanks to
Unicode adoption, especially UTF-8. But the "Tower of Babel" issue
has not. Fundamentally, it's like women's clothes (they wear them to
impress, ie, communicate to, other women -- few men have the interest
to understand what is impressive ;-): programming is about programmers
communicating to other programmers. Maintaining the traditional
spelling of keywords and operators is definitely useful for that
purpose.
This is not to say that individuals who want a personalized[1]
language are wrong, just that it would have a net negative impact on
communication in teams.
BTW, Barry long advocated use of some variant syntaxes (the one I like
to remember inaccurately is "><" instead of "!="), and in fact
provided an easter egg import (barry_is_flufl or something like that)
that changed the syntax to suit him. I believe that module is pure
Python, so people who want to customize the lexical definition of
Python at the language level can do so AFAIK. You could probably even
spell it "import λ等" (to take a very ugly page from the Book of GNU,
mixing scripts in a single word -- the Han character means "etc.").
Footnotes:
[1] I don't have a better word. I mean something like "seasoned to
taste", almost "tasteful" but not quite.
> There was extensive discussion about the issues surrounding the
> natural languages used by programmers in source documentation (eg,
> identifier choice and comments) at the time of PEP 263. The mojibake
> (choice of charset) problem has largely improved since then, thanks to
> Unicode adoption, especially UTF-8. But the "Tower of Babel" issue
> has not. Fundamentally, it's like women's clothes (they wear them to
> impress, ie, communicate to, other women -- few men have the interest
> to understand what is impressive ;-): programming is about programmers
> communicating to other programmers.
With respect Stephen, that's codswallop :-)
It might be true that the average bogan[1] bloke or socially awkward
geek (including myself) might not care about impressive clothes, but
many men do dress to compete. The difference is more socio-economic:
typically women dress to compete across most s-e groups, while men
mostly do so only in the upper-middle and upper classes. And in the
upper classes, competition tends to be more understated and subtle
("good taste"), i.e. expensive Italian suits rather than hot pants.
Historically, it is usually men who dress like peacocks to impress
socially, while women are comparatively restrained. The drab business
suit of Anglo-American influence is no more representative of male
clothing through the ages as is the Communist Chinese "Mao suit".
And as for programmers... the popularity of one-liners, the obfuscated C
competition, code golf, "clever coding tricks" etc is rarely for the
purposes of communication *about code*. Communication is taking place,
but its about social status and cleverness. There's a very popular
StackOverflow site dedicated to code golf, where you will see people
have written their own custom languages specifically for writing terse
code. Nobody expects these languages to be used by more than a handful
of people. That's not their point.
> Maintaining the traditional
> spelling of keywords and operators is definitely useful for that
> purpose.
Okay, let's put aside the social uses of code-golfing and equivalent,
and focus on quote-unquote "real code", where programmers care more
about getting the job done and keeping it maintainable rather than
competing with other programmers for status, jobs, avoiding being the
sacrifical goat in the next round of stack-ranked layoffs, etc.
You're right of course that traditional spelling is useful, but perhaps
not as much as you think. After all, one person's traditional spelling
is another person's confusing notation and a third person's excessively
verbose spelling. Not too many people like Cobol-like spelling:
add 1 to the_number
over "n += 1". So I think that arguments for keeping "traditional
spelling" are mostly about familiarity. If we learned lambda calculus in
high school, perhaps λ would be less exotic.
I think that there is a good argument to be made in favour of increasing
the amount of mathematical notation used in code, but I would think that
since a lot of my code is mathematical in nature. I can see that makes
my code atypical.
Coming back to the specific change suggested here, λ as an alternative
keyword for lambda, I have a minor and major objection:
The minor objection is that I think that λ is too useful a one-letter
symbol to waste on a comparatively rare usage, anonymous functions. In
mathematical code, I would prefer to keep λ for wavelength, or for the
radioactive decay constant, rather than for anonymous functions.
The major objection is that I think its still too hard to expect the
average programmer to be able to produce the λ symbol on demand. We
don't all have a Greek keyboard :-)
I *don't* think that expecting programmers to learn λ is too difficult.
It's no more difficult than the word "lambda", or that | means bitwise
OR. Or for that matter, that * means multiplication. Yes, I've seen
beginners stumped by that. (Sometimes we forget that * is not something
you learn in maths class.)
So overall, I'm a -1 on this specific proposal.
[1] Non-Australians will probably recognise similar terms hoser,
redneck, chav, gopnik, etc.
--
Steve
1.
What is future of coding?
I feel it is not only language what could translate your ideas into reality.
Artificial intelligence in (future) editors (and also vim conceal) is
probably right way to enhance your coding power (with lambdas too).
2.
If we like to enhance python syntax with Unicode characters then I
think it is good to see larger context. There is ocean of
possibilities how to make it. (probably good possibilities too). For
example Unicode could help to add new operators. But it also brings a
lot of questions (how to write Knuth's arrow on my editor?) and
difficulties (how to give possibilities to implement class functions
for this (*) new operators? How to make for example triple arrow
possible?).
I propose to be prepared before opening Pandora's box. :)
(*) all of it? And it means probably all after future enhancement of
Unicode too?
3.
Questions around "only one possibilities how to write it" could be
probably answered with this?
a<b
a.__lt__(b)
> Questions around "only one possibilities how to write it" could be
> probably answered with this?
>
> a<b
> a.__lt__(b)
The maxim is not “only one way”. That is a common misconception, but it
is easily dispelled: read the Zen of Python (by ‘import this’ in the
interactive prompt).
Rather, the maxim is “There should be one obvious way to do it”, with a
parenthetical “and preferably only one”.
So the emphasis is on the way being *obvious*, and all other ways being
non-obvious. This leads, of course, to choosing the best way to also be
the one obvious way to do it.
Your example above supports this: the comparison ‘a < b’ is the one
obvious way to compare whether ‘a’ is less than ‘b’.
--
\ “It is forbidden to steal hotel towels. Please if you are not |
`\ person to do such is please not to read notice.” —hotel, |
_o__) Kowloon, Hong Kong |
Ben Finney
I don't support this lambda proposal (in this moment - but probably
somebody could convince me).
But if we will accept it then Unicode version could be the obvious one
couldn't be?
But if we will accept it then Unicode version could be the obvious one
couldn't be?
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
Matt Gilson // SOFTWARE ENGINEER
E: ma...@getpattern.com // P: 603.892.7736
It's probably also relevant in this context that more "modern"
languages tend to avoid the term lambda but embrace "anonymous
functions" with syntax such as
(x, y) -> x+y
or whatever.
So while "better syntax for lambda expressions" is potentially a
reasonable goal, I don't think that perpetuating the concept/name
"lambda" is necessary or valuable.
Paul
Exactly, there's not much value in having yet another way of writing
'lambda:'. Keeping other languages in mind and the conservative stance
Python usually takes, the arrow ('=>') would be the only valid
alternative for a "better syntax for lambda expressions". However, IIRC,
this has been debated and won't happen.
Personally, I have other associations with λ. Thus, I would rather see
it as a variable name in such contexts.
A fair point. But Python has a strong mathematical side (look how big
the numpy/scipy/matplotlib communities are), and we've already seen
how strongly they prefer "a @ b" to "a.matmul(b)". If there's support
for a language variant that uses more and shorter symbols, that would
be where I'd expect to find it.
> And as for programmers... the popularity of one-liners, the
> obfuscated C competition, code golf, "clever coding tricks" etc is
> rarely for the purposes of communication *about code*.
Sure, but *Python* is popular because it's easy to communicate *with*
and (usually) *about* Python code, and it does pretty well on "terse"
for many algorithmic idioms. (Yes, there are other reasons --
reasonable performance, batteries included, etc. That doesn't make
the design of the language *not* a reason for its popularity.)
You seem to be understanding my statements to be much more general
than they are. I'm only suggesting that this applies to Python as we
know and love it, and to Pythonic tradition.
> The major objection is that I think its still too hard to expect the
> average programmer to be able to produce the λ symbol on demand. We
> don't all have a Greek keyboard :-)
So what? If you run Mac OS X, Windows, or X11, you do have a keyboard
capable of producing Greek. And the same chords work in any Unicode-
capable editor, it's just that the Greek letters aren't printed on the
keycaps. Neither are emoticons, nor the CUA gestures (bucky-X[1],
bucky-C, bucky-V, and the oh-so-useful bucky-Z) but those are
everywhere. Any 10-year-old can find them somehow! To the extent
that Python would consider such changes (ie, a half-dozen or so
one-character replacements for multicharacter operators or keywords),
it would be very nearly as learnable to type them as to read them.
The problem (if it exists, of course -- obviously, I believe it does
but YMMV) is all about overloading people's ability to perceive the
meaning of code without reading it token by token.
Footnotes:
[1] Bucky = Control, Alt, Meta, Command, Option, Windows, etc. keys.
Um, as someone significantly older than 10 years old, I don't know how
to type a lambda character on my Windows UK keyboard...
Another identifier could be "/\" that looks like the uppercase lambda.
Great for those using an editor targeted at Python programmers, but most
editors are more general than that. Which means that programmers will
find themselves split into two camps: those who can easily type λ, and
those that cannot.
In the 1980s and 90s, I was a Macintosh user, and one nice feature of
the Macs at the time was the ease of typing non-ASCII characters. (Of
course there were a lot fewer back then: MacRoman is an 8-bit extention
to ASCII, compared to Unicode with its thousands of code points.)
Consequently I've used an Apple-specific language that included
operators like ≠ ≤ ≥ and it is *really nice*.
But Apple has the advantage of controlling the entire platform and they
could ensure that these characters could be input from any application
on any machine using exactly the same key sequence. (By memory, it was
option-= to get ≠.) We don't have that advantage, and frankly I think
you are underestimating the *practical* difficulties for input.
I recently discovered (by accident!) the Linux compose key. So now I
know how to enter µ at the keyboard: COMPOSE mu does the job. So maybe
COMPOSE lambda works? Nope. How about COMPOSE l or shift-l or ll or la
or yy (its an upside down y, right, and COMPOSE ee gives ə)?
No, none of these things work on my system. They may work on your system: since
discovering COMPOSE, I keep coming across people who state "oh, its easy
to type such-and-such a character, just type COMPOSE key-sequence, its
standard and will work on EVERY LINUX SYSTEM EVERYWHERE". Not a chance.
The key bindings for COMPOSE are anything but standard.
And COMPOSE is *really* hard to use well: it gives no feedback if you
make a mistake except to silently ignore your keypresses (or insert the
wrong character). So invariably, every time I want to enter a non-ASCII
character, it takes me out of "thinking about code" into "thinking about
how to enter characters", sometimes for minutes at a time as I hunt for
the character in "Character Map" or google for it on the Internet.
It may be reasonable to argue that code is read more than it is written:
- suppose that reading λ has a *tiny* benefit of 1% over "lambda"
(for those who have learned what it means);
- but typing it is (lets say) 50 times harder than typing "lambda";
- but we read code 50 times as often as we type it;
- so the total benefit (50*1.01 - 50) is positive.
Invent your own numbers, and you'll come up with your own results. I
don't think there's any *objective* way to decide this question. And
that's why I don't think that Python should take this step: let other
languages experiment with non-ASCII keywords first, or let people
experiment with translators that transform ≠ into != and λ into lambda.
--
Steve
> On Jul 13, 2016, at 9:44 PM, Steven D'Aprano <st...@pearwood.info> wrote:
>
> Which means that programmers will
> find themselves split into two camps: those who can easily type λ, and
> those that cannot.
We already have two camps: those who don't mind using "lambda" and those who would only use "def."
I would expect that those who will benefit most are people who routinely write expressions that involve a lambda that returns a lambda that returns a lambda. There is a niche for such programming style and using λ instead of lambda will improve the readability of such programs for those who can understand them in the current form.
For the "def" camp, the possibility of a non-ascii spelling will serve as yet another argument to avoid using anonymous functions.
> On Jul 13, 2016, at 9:44 PM, Steven D'Aprano <st...@pearwood.info> wrote:
>
> I think
> you are underestimating the *practical* difficulties for input.
I appreciate those difficulties (I am typing this on an iPhone), but I think they are irrelevant. I can imagine 3 scenarios:
1. (The 99% case) You will never see λ in the code and never write it yourself. You can be happily unaware of this feature.
2. You see λ occasionally, but don't like it. You continue using spelled out "lambda" (or just use "def") in the code that you write.
3. You work on a project where local coding style mandates that lambda is spelled λ. In this case, there will be plenty of places in the code base to copy and paste λ from. (In the worst case you copy and paste it from the coding style manual.) More likely, however, the project that requires λ would have a precommit hook that translates lambda to λ in all new code and you can continue using the 6-character keyword in your input.
> We already have two camps: those who don't mind using "lambda" and
> those who would only use "def."
I don't know anyone in the latter camp, do you?
I am in the camp that loves ‘lambda’ for some narrowly-specified
purposes *and* thinks ‘def’ is generally a better tool.
--
\ “… correct code is great, code that crashes could use |
`\ improvement, but incorrect code that doesn’t crash is a |
_o__) horrible nightmare.” —Chris Smith, 2008-08-22 |
Ben Finney
3. You work on a project where local coding style mandates that lambda is spelled λ. In this case, there will be plenty of places in the code base to copy and paste λ from. (In the worst case you copy and paste it from the coding style manual.) More likely, however, the project that requires λ would have a precommit hook that translates lambda to λ in all new code and you can continue using the 6-character keyword in your input.
- suppose that reading λ has a *tiny* benefit of 1% over "lambda"
(for those who have learned what it means);
- but typing it is (lets say) 50 times harder than typing "lambda";
- but we read code 50 times as often as we type it;
- so the total benefit (50*1.01 - 50) is positive.
But it also has the meaning of "the next character is special", such
as \n for newline or \uNNNN for a Unicode escape. However, I suspect
there might be a parsing conflict:
do_stuff(stuff_with_long_name, more_stuff, what_is_next_arg, \
At that point in the parsing, are you looking at a lambda function or
a line continuation? Sure, style guides would decry this (put the
backslash with its function, dummy!), but the parser can't depend on
style guides being followed.
-1 on using backslash for this.
-0 on λ.
ChrisA
-1 on using backslash for this.
-0 on λ.
Thanks,
S
Just to be a small data point, I have written code that uses λ as a
variable name (as someone mentioned elsewhere in the thread, Jupyter
Notebook makes typing Greek characters easy). Because this would
break code that I have written, and I suspect it would break other
code as well, I am -1 on the proposal. How selfish of me!
Cody
On Wed, Jul 13, 2016 at 7:44 PM, Steven D'Aprano <st...@pearwood.info> wrote:
- suppose that reading λ has a *tiny* benefit of 1% over "lambda"
(for those who have learned what it means);
- but typing it is (lets say) 50 times harder than typing "lambda";
- but we read code 50 times as often as we type it;
- so the total benefit (50*1.01 - 50) is positive.I actually *do* think λ is a little bit more readable. And I have no idea how to type it directly on my El Capitan system with the ABC Extended keyboard. But I still get 100% of the benefit in readability simply by using vim's conceal feature. If I used a different editor I'd have to hope for a similar feature (or program it myself), but this is purely a display question. Similarly, I think syntax highlighting makes my code much more readable, but I don't want colors for keywords built into the language. That is, and should remain, a matter of tooling not core language (I don't want https://en.wikipedia.org/wiki/ColorForth for Python).
FWIW, my conceal configuration is at link I give in a moment. I've customized a bunch of special stuff besides lambda, take it or leave it:
--
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons. Intellectual property is
to the 21st century what the slave trade was to the 16th.
On 14.07.2016 08:39, David Mertz wrote:
.....
That is, and should remain, a matter of tooling not core language (I don't want https://en.wikipedia.org/wiki/ColorForth for Python).
Very good point. That now is basically the core argument against it at least for me. So, -100 on the proposal from me. :)
> On Jul 13, 2016, at 4:12 PM, John Wong <gokop...@gmail.com> wrote:
>
> Sorry to be blunt. Are we going to add omega, delta, psilon and the entire Greek alphabet?
Breaking news: the entire Greek alphabet is already available for use in Python. If someone wants to write code that looks like a series of missing character boxes on your screen she already can.
t = (µw-µl)/c # those are used in
e = ε/c # multiple places.
σw_new = (σw**2 * (1 - (σw**2)/(c**2)*Wwin(t, e)) + γ**2)**.5I'm not sure what you're saying here. You do realise that the above is
perfectly valid Python 3? The SO question you quote is referring to
the fact that identifiers are restricted to (Unicode) *letters* and
that symbol characters can't be used as variable names.
All of which is tangential to the question here which is about using
Unicode in a *keyword*.
Paul
On 14 July 2016 at 23:13, John Wong <gokop...@gmail.com> wrote:
> Why should I write pi in two English characters instead of typing π? Python
> is so popular among the science community, so shouldn't we add that as well?
> Excerpt from the question on
> http://programmers.stackexchange.com/questions/16010/is-it-bad-to-use-unicode-characters-in-variable-names:
>
> t = (µw-µl)/c # those are used in
> e = ε/c # multiple places.
> σw_new = (σw**2 * (1 - (σw**2)/(c**2)*Wwin(t, e)) + γ**2)**.5
I'm not sure what you're saying here. You do realise that the above is
perfectly valid Python 3? The SO question you quote is referring to
the fact that identifiers are restricted to (Unicode) *letters* and
that symbol characters can't be used as variable names.
All of which is tangential to the question here which is about using
Unicode in a *keyword*.
Unicode-as-identifier makes a lot of sense in situations where you
have a data-driven API (like a pandas dataframe or
collections.namedtuple) and the data you're working with contains
Unicode characters. Hence my choice of example in
http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/
- it's easy to imagine cases where the named tuple attributes are
coming from a data source like headers in a CSV file, and in
situations like that, folks shouldn't be forced into awkward
workarounds just because their data contains non-ASCII characters.
> When Python 3 was cooking I remember there were debates on whether removing
> "lambda". It stayed, and I'm glad it did, but IMO that should tell it's not
> important enough to deserve the breakage of a rule which has never been
> broken (non-ASCII for a keyword).
This I largely agree with, though. The *one* argument for improvement
I see potentially working is the one I advanced back in March when I
suggested that adding support for Java's lambda syntax might be worth
doing: https://mail.python.org/pipermail/python-ideas/2016-March/038649.html
However, any proposals along those lines need to be couched in terms
of how they will advance the Python ecosystem as a whole, rather than
"I like using lambda expressions in my code, but I don't like the
'lambda' keyword", as we have a couple of decades worth of evidence
informing us that the latter isn't sufficient justification for
change.
Cheers,
Nick.
--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
However, any proposals along those lines need to be couched in termsof how they will advance the Python ecosystem as a whole, rather than
"I like using lambda expressions in my code, but I don't like the
'lambda' keyword", as we have a couple of decades worth of evidence
informing us that the latter isn't sufficient justification for
change.
I use the vim conceal plugin myself too. It's whimsical, but I like the appearance of it. So I get the sentiment of the original poster. But in my conceal configuration, I substitute a bunch of characters visually (if the attachment works, and screenshot example of some, but not all will be in this message). And honestly, having my text editor make the substitution is exactly what I want.
Unicode-as-identifier makes a lot of sense in situations
Alexander Belopolsky
<alexander....@gmail.com> writes:> We already have two camps: those who don't mind using "lambda" and
> those who would only use "def."I don't know anyone in the latter camp, do you?
I am in the camp that loves ‘lambda’ for some narrowly-specified
purposes *and* thinks ‘def’ is generally a better tool.
Yes, we know - that dramatic increase in the attack surface is why
PyPI is still ASCII only, even though full Unicode support is
theoretically possible.
It's not a major concern once an attacker already has you running
arbitrary code on your system though, as the main problem there is
that they're *running arbitrary code on your system*. , That means the
usability gains easily outweigh the increased obfuscation potential,
as worrying about confusable attacks at that point is like worrying
about a dripping tap upstairs when the Brisbane River is already
flowing through the ground floor of your house :)
On 18 July 2016 at 13:41, Rustom Mody <rusto...@gmail.com> wrote:
> Do consider:
>
>>>> Α = 1
>>>> A = 2
>>>> Α + 1 == A
> True
>>>>
>
> Can (IMHO) go all the way to
> https://en.wikipedia.org/wiki/IDN_homograph_attackYes, we know - that dramatic increase in the attack surface is why
PyPI is still ASCII only, even though full Unicode support is
theoretically possible.It's not a major concern once an attacker already has you running
arbitrary code on your system though, as the main problem there is
that they're *running arbitrary code on your system*. , That means the
usability gains easily outweigh the increased obfuscation potential,
as worrying about confusable attacks at that point is like worrying
about a dripping tap upstairs when the Brisbane River is already
flowing through the ground floor of your house :)Cheers,
One solution would be to restrict identifiers to only Unicode characters in appropriate classes. The open quotation mark is in the code class for punctuation, so it doesn't make sense to have it be part of an identifier.
> There was this question on the python list a few days ago:
> Subject: SyntaxError: Non-ASCII character
[...]
> I pointed out that the python2 error was more helpful (to my eyes) than
> python3s
And I pointed out how I thought the Python 3 error message could be
improved, but the Python 2 error message was not very good.
> Python3
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/ariston/foo.py", line 31
> wf = wave.open(“test.wav”, “rb”)
> ^
> SyntaxError: invalid character in identifier
It would be much more helpful if the caret lined up with the offending
character. Better still, if the offending character was actually stated:
wf = wave.open(“test.wav”, “rb”)
^
SyntaxError: invalid character '“' in identifier
> Python2
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "foo.py", line 31
> SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no
> encoding declared; see http://python.org/dev/peps/pep-0263/ for details
As I pointed out earlier, this is less helpful. The line itself is not
shown (although the line number is given), nor is the offending
character. (Python 2 can't show the character because it doesn't know
what it is -- it only knows the byte value, not the encoding.) But in
the person's text editor, chances are they will see what looks to them
like a perfectly reasonable character, and have no idea which is the
byte \xe2.
> IOW
> 1. The lexer is internally (evidently from the error message) so
> ASCII-oriented that any “unicode-junk” just defaults out to identifiers
> (presumably comments are dealt with earlier) and then if that lexing action
> fails it mistakenly pinpoints a wrong *identifier* rather than just an
> impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you
are right that the lexer considers all otherwise-unexpected characters
to be part of an identifier, why is that a problem?
I agree that it is mildly misleading to say
invalid character '“' in identifier
when “ is not part of an identifier:
py> '“test'.isidentifier()
False
but I don't think you can jump from that to your conclusion that
Python's unicode support is somewhat "wrongheaded". Surely a much
simpler, less inflammatory response would be to say that this one
specific error message could be improved?
But... is it REALLY so bad? What if we wrote it like this instead:
py> result = my§function(arg)
File "<stdin>", line 1
result = my§function(arg)
^
SyntaxError: invalid character in identifier
Isn't it more reasonable to consider that "my§function" looks like it is
intended as an identifier, but it happens to have an illegal character
in it?
> combine that with
> 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII
How does @ emulate Perl?
As for your second part, about not going outside of ASCII, yes, that is
official policy for Python operators, keywords and builtins.
> makes it seem (to me) python's unicode support is somewhat wrongheaded.
--
Steve
--
---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/-gsjDSht8VU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano wrote:On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
> IOW
> 1. The lexer is internally (evidently from the error message) so
> ASCII-oriented that any “unicode-junk” just defaults out to identifiers
> (presumably comments are dealt with earlier) and then if that lexing action
> fails it mistakenly pinpoints a wrong *identifier* rather than just an
> impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you
are right that the lexer considers all otherwise-unexpected characters
to be part of an identifier, why is that a problem?It's a problem because those characters could never be part of an identifier. So it seems like a bug.
On Tuesday, July 19, 2016 at 5:06:17 PM UTC+5:30, Neil Girdhar wrote:On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano wrote:On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
> IOW
> 1. The lexer is internally (evidently from the error message) so
> ASCII-oriented that any “unicode-junk” just defaults out to identifiers
> (presumably comments are dealt with earlier) and then if that lexing action
> fails it mistakenly pinpoints a wrong *identifier* rather than just an
> impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you
are right that the lexer considers all otherwise-unexpected characters
to be part of an identifier, why is that a problem?It's a problem because those characters could never be part of an identifier. So it seems like a bug.
An armchair-design solution would say: We should give the most appropriate answer for every possible unicode character category
This would need to take all the Unicode character-categories and Python lexical-categories and 'cross-product' them — a humongous task to little advantage
A more practical solution would be to take the best of the python2 and python3 current approaches:
"Invalid character XX in line YY"
and just reveal nothing about what lexical category — like identifier — python thinks the char is coming in.
The XX is like python2 and the YY like python3
If it can do better than '\xe2' — ie a codepoint — that’s a bonus but not strictly necessary
On Tue, Jul 19, 2016 at 8:18 AM Rustom Mody wrote:
On Tuesday, July 19, 2016 at 5:06:17 PM UTC+5:30, Neil Girdhar wrote:On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano wrote:On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:
> IOW
> 1. The lexer is internally (evidently from the error message) so
> ASCII-oriented that any “unicode-junk” just defaults out to identifiers
> (presumably comments are dealt with earlier) and then if that lexing action
> fails it mistakenly pinpoints a wrong *identifier* rather than just an
> impermissible character like python 2
You seem to be jumping to a rather large conclusion here. Even if you
are right that the lexer considers all otherwise-unexpected characters
to be part of an identifier, why is that a problem?It's a problem because those characters could never be part of an identifier. So it seems like a bug.
An armchair-design solution would say: We should give the most appropriate answer for every possible unicode character category
This would need to take all the Unicode character-categories and Python lexical-categories and 'cross-product' them — a humongous task to little advantageI don't see why this is a "humongous task". Anyway, your solution boils down to the simplest fix in the lexer which is to block some characters from matching any category, does it not?
There's historically been relatively little work put into designing
the error messages coming out of the lexer, so if it's a task you're
interested in stepping up and taking on, you could probably find
someone willing to review the patches.
But if you perceive "Volunteers used their time as efficiently as
possible whilst fully Unicode enabling the CPython compilation
toolchain, since it was a dependency that needed to be addressed in
order to permit other more interesting changes, rather than an
inherently rewarding activity in its own right" as "wrongheaded", you
may want to spend some time considering the differences between
community-driven and customer-driven development.
Cheers,
Nick.
--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
> My suggested solution involved this:
> Currently the lexer — basically an automaton — reveals which state its in
> when it throws error involving "identifier"
> Suggested change:
>
> if in_ident_state:
> if current_char is allowable as ident_char:
> continue as before
> elif current_char is ASCII:
> Usual error
> else:
> throw error eliding the "in_ident state"
> else:
> as is...
I'm sorry, you've lost me. Is this pseudo-code (1) of the current
CPython lexer, (2) what you imagine the current CPython lexer does, or
(3) what you think it should do? Because you call it a "change", but
you're only showing one state, so it's not clear if its the beginning or
ending state.
Basically I guess what I'm saying is that if you are suggesting a
concrete change to the lexer, you should be more precise about what
needs to actually change.
> BTW after last post I tried some things and found other unsatisfactory (to
> me) behavior in this area; to wit:
>
> >>> x = 0o19
> File "<stdin>", line 1
> x = 0o19
> ^
> SyntaxError: invalid syntax
>
> Of course the 9 cannot come in an octal constant but "Syntax Error"??
> Seems a little over general
>
> My preferred fix:
> make a LexicalError sub exception to SyntaxError
What's the difference between a LexicalError and a SyntaxError?
Under what circumstances is it important to distinguish between them?
It would be nice to have a more descriptive error message, but why
should I care whether the invalid syntax "0o19" is caught by a lexer or
a parser or the byte-code generator or the peephole optimizer or
something else? Really all I need to care about is:
- it is invalid syntax;
- why it is invalid syntax (9 is not a legal octal digit);
- and preferably, that it is caught at compile-time rather than run-time.
--
Steve
The codepoint '“' doesn't match either of them, which is a good hint
that Python shouldn't really be saying "invalid character in identifier"
(it's the first character, but it can't be part of an identifier).
> 4. Unicode have more than one codepoint for some symbols that look alike,
> for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑",
> but this one is invalid in Python 3. The italic/bold/serif distinction
> seems enough for a distinction, and when editing a code with an Unicode
> char like that, most people would probably copy and paste the symbol
> instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers
are converted into the normal form NFKC while parsing; comparison of
identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
But if you perceive "Volunteers used their time as efficiently aspossible whilst fully Unicode enabling the CPython compilation
toolchain, since it was a dependency that needed to be addressed in
order to permit other more interesting changes, rather than an
inherently rewarding activity in its own right" as "wrongheaded", you
may want to spend some time considering the differences between
community-driven and customer-driven development.
On 7/20/16, Danilo J. S. Bellini <danilo....@gmail.com> wrote:> 4. Unicode have more than one codepoint for some symbols that look alike,
> for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑",
> but this one is invalid in Python 3. The italic/bold/serif distinction
> seems enough for a distinction, and when editing a code with an Unicode
> char like that, most people would probably copy and paste the symbol
> instead of typing it, leading to a consistent use of the same symbol.I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers
are converted into the normal form NFKC while parsing; comparison of
identifiers is based on NFKC."From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
On 7/20/16, Danilo J. S. Bellini <danilo....@gmail.com> wrote:
> 4. Unicode have more than one codepoint for some symbols that look alike,
> for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑",
> but this one is invalid in Python 3. The italic/bold/serif distinction
> seems enough for a distinction, and when editing a code with an Unicode
> char like that, most people would probably copy and paste the symbol
> instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers
are converted into the normal form NFKC while parsing; comparison of
identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
2016-07-21 1:53 GMT-03:00 Pavol Lisy <pavol...@gmail.com>:On 7/20/16, Danilo J. S. Bellini <danilo....@gmail.com> wrote:
> 4. Unicode have more than one codepoint for some symbols that look alike,
> for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. There's also "∑",
> but this one is invalid in Python 3. The italic/bold/serif distinction
> seems enough for a distinction, and when editing a code with an Unicode
> char like that, most people would probably copy and paste the symbol
> instead of typing it, leading to a consistent use of the same symbol.
I am not sure what do you like to say, so for sure some info:
PEP-3131 (https://www.python.org/dev/peps/pep-3131/): "All identifiers
are converted into the normal form NFKC while parsing; comparison of
identifiers is based on NFKC."
From this point of view all sigmas are same:
set(unicodedata.normalize('NFKC', i) for i in "Σ𝚺𝛴𝜮𝝨𝞢") == {'Σ'}
In this item I just said that most programmers would probably keep the same character in a source code file due to copying and pasting, and that even when it doesn't happen (the copy-and-paste action), visual differences like italic/bold/serif are enough for one to notice (when using another input method).At first, I was thinking on a code with one of those symbols as a variable name (any of them), but PEP3131 challenges that. Actually, any conversion to a normal form means that one should never use unicode identifiers outside the chosen normal form. It would be better to raise an error instead of converting.
So should we disable the lowercase 'l', the uppercase 'I', and the
digit '1', because they can be confused? What about the confusability
of "m" and "rn"? O and 0 are similar in some fonts. And case
insensitivity brings its own problems - is "ss" equivalent to "ß", and
is "ẞ" equivalent to either? Turkish distinguishes between "i", which
upper-cases to "İ", and "ı", which upper-cases to "I".
We already have interminable debates about letter similarities across
scripts. I'm sure everyone agrees that Cyrillic "и" is not the same
letter as Latin "i", but we have "AАΑ" in three different scripts.
Should they be considered equivalent? I think not, because in any
non-trivial context, you'll know whether the program's been written in
Greek, a Slavic language, or something using the Latin script. But
maybe you disagree. Okay; are "BВΒ" all to be considered equivalent
too? What about "СC"? "XХΧᚷ"? They're visually similar, but they're
not equivalent in any other way. And if you're going to say things
should be considered equivalent solely on the basis of visuals, you
get into a minefield - should U+200B ZERO WIDTH SPACE be completely
ignored, allowing "AB" to be equivalent to "A\u200bB" as an
identifier?
This debate should probably continue on python-list (if anywhere). I
doubt Python is going to change its normalization rules any time soon,
and if it does, it'll need a very solid reason (and probably a PEP
with all the pros and cons).
ChrisA
[getattr(obj, i) for i in dir(obj) if i in "Σ𝚺𝛴𝜮𝝨𝞢"] # [0, 1, 2, 3, 4, 5]
but:
[obj.Σ, obj.𝚺, obj.𝛴, obj.𝜮, obj.𝝨, obj.𝞢, ] # [0, 0, 0, 0, 0, 0]
So you could mix any of them while editing identifiers. (but you could
not mix them while writing parameters in getattr, setattr and type)
But getattr, setattr and type are other beasts, because they can use
"non identifiers", non letter characters too:
setattr(obj,'+', 7)
dir(obj) # ['+', ...] # but obj.+ is syntax error
setattr(obj,u"\udcb4", 7)
dir(obj) # [..., '\udcb4' ,...]
obj = type("SomeClass", (object,), {c: i for i, c in enumerate("+-*/")})()
Maybe there is still some Babel curse here and some sort of
normalize_dir, normalize_getattr, normalize_setattr, normalize_type
could help? I am not sure. They probably make things more complicated
than simpler.
No; I'm not saying that. I'm completely disagreeing with #1's value. I
don't think the language interpreter should concern itself with
visually-confusing identifiers. Unicode normalization is about
*equivalent characters*, not confusability, and I think that's as far
as Python should go.
> 1. Using SyntaxError for lexical errors sounds as strange as saying a
> misspell/typo is a syntax mistake in a natural language.
Why? Regardless of whether the error is found by the tokeniser, the
lexer, the parser, or something else, it is still a *syntax error*. Why
would the programmer need to know, or care, what part of the
compiler/interpreter detects the error?
Also consider that not all Python interpreters will divide up the task
of interpreting code exactly the same way. Tokenisers, lexers and
parsers are very closely related and not necessarily distinct. Should
the *exact same typo* generate TokenError in one Python, LexerError in
another, and ParserError in a third? What is the advantage of that?
> 2. About those lexical error messages, the caret is worse than the lack of
> it when it's not aligned, but unless I'm missing something, one can't
> guarantee that the terminal is printing the error message with the right
> encoding. Including the row and column numbers in the message would be
> helpful.
It would be nice for the caret to point to the illegal character, but
it's not *wrong* to point past it to the end of the token that contains
the illegal character.
> 4. Unicode have more than one codepoint for some symbols that look alike,
> for example "Σ𝚺𝛴𝜮𝝨𝞢" are all valid uppercase sigmas. Ther
Not really. Look at their names:
GREEK CAPITAL LETTER SIGMA
MATHEMATICAL BOLD CAPITAL SIGMA
MATHEMATICAL ITALIC CAPITAL SIGMA
MATHEMATICAL BOLD ITALIC CAPITAL SIGMA
MATHEMATICAL SANS-SERIF BOLD CAPITAL SIGMA
MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL SIGMA
Personally, I don't understand why the Unicode Consortium has included
all these variants. But whatever the reason, the names hint strongly
that they have specialised purposes, and shouldn't be used when you want
the letter Σ.
But, if you do, Python will normalise them all to Σ, so there's no real
harm done, except to the readability of your code.
[...]
> when editing a code with an Unicode
> char like that, most people would probably copy and paste the symbol
> instead of typing it, leading to a consistent use of the same symbol.
You are assuming that the programmer's font includes glyphs for all of
six of those code points. More likely, the programmer will see Σ for the
first code point, and the other five will display as a pair of "missing
glyph" boxes. (That's exactly what I see in my mail client, and in the
Python interpreter.)
Why a pair of boxes? Because they are code points in the Supplementary
Multilingual Planes, and require *two* 16-bit code units in UTF-16. So
naive Unicode software with poor support for the SMPs will display two
boxes, one for each surrogate code point.
Even if the code points display correctly, with distinct glyphs, your
comment that most people will be forced to copy and paste the symbol is
precisely why I am reluctant to see Python introduce non-ASCII keywords
or operators. It's a pity, because I think that non-ASCII operators at
least can make a much richer language (although I wouldn't want to see
anything as extreme as APL). Perhaps I will change my mind in a few more
years, as the popularity of emoji encourage more applications to have
better support for non-ASCII and the SMPs.
[...]
> 6. Python 3 code is UTF-8 and Unicode identifiers are allowed. Not having
> Unicode keywords is merely contingent on Python 2 behavior that emphasized
> ASCII-only code (besides comments and strings).
No, it is a *policy decision*. It is not because Python 2 didn't support
them. Python 2 didn't support non-ASCII identifiers either, but Python 3
intentionally broke with that.
> 7. The discussion isn't about lambda or anti-lambda bias, it's about
> keyword naming and readability. Who gains/loses with that resource? It
> won't hurt those who never uses lambda and never uses Unicode identifiers.
It will hurt those who have to read code with a mystery λ that they
don't know what it means and they have no idea how to search for it. At
least "python lambda" is easy to search for.
It will hurt those who want to use λ as an identifier. I include myself
in that category. I don't want λ to be reserved as a keyword.
I look at it like this: use λ as a keyword makes as much sense as making
f a keyword so that we can save a few characters by writing:
f myfunction(arg, x, y):
pass
instead of def. I use f as an identifier in many places, e.g.:
for f in list_of_functions:
...
or in functional code:
compose(f, g)
Yes, I can *work around it* by naming things f_ instead of f, but that's
ugly. Even though it saves a few keystrokes, I wouldn't want f to be
reserved as a keyword, and the same goes for λ as lambda.
> 8. I don't know if any consensus can emerge in this matter about lambdas,
> but there's another subject that can be discussed together: macros.
I'm pretty sure that Guido has ruled "Over My Dead Body" to anything
resembling macros in Python.
However, we can experiment with adding keywords and macro-like
facilities without Guido's permission. For example:
http://www.staringispolite.com/likepython/
It's a joke, of course, but the technology is real.
Imagine, if you will, that somebody you could declare a "dialect" at the
start of Python modules, just after the optional language cookie:
# -*- coding: utf-8 -*-
# -*- dialect math -*-
which would tell importlib to run the code through some sort of
source/AST transformation before importing it. That will allow us to
localise the keywords, introduce new operators, and all the other things
Guido hates *wink* and still be able to treat the code as normal Python.
A bad idea? Probably an awful one. But it's worth experimenting with it,
It will be fun, and it *just might* turn out to be a good idea.
For the record, in the 1980s and 1990s, Apple used a similar idea for
two of their scripting languages, Hypertalk and Applescript, allowing
users to localise keywords. Hypertalk is now defunct, and Applescript
has dropped that feature, which suggests that it is a bad idea. Or maybe
it was just ahead of its time.
--
Steve
This idea of "visually confusable" seems like a very silly thing to worry about, as others have noted.
It's not just that completely different letters from different alphabets may "look similar", it's also that the similarity is completely dependent on the specific font used for display. My favorite font might have clearly distinguished glyphs for the Cyrillic, Roman, and Greek "A", even if your font uses identical glyphs.
So in this crazy scenario, Python would have to gain awareness of the fonts installed in every text editor and display device of every user.
On Jul 21, 2016 7:26 AM, "Steven D'Aprano" <st...@pearwood.info> wrote:
> You are assuming that the programmer's font includes glyphs for all of
> six of those code points. More likely, the programmer will see Σ for the
> first code point, and the other five will display as a pair of "missing
> glyph" boxes. (That's exactly what I see in my mail client, and in the
> Python interpreter.)
Fwiw, on my OSX laptop, with whatever particular fonts I have installed there, using a particular webmail service in the particular browser I use, I see all six glyphs.
If I were to copy-paste into a text editor, all bets would be off, and depend on the editor and it its settings. Same for interactive shells run in particular terminal apps.
Viewing right now, on my Android tablet and the Gmail app, I see a bunch of missing glyph markers. But quite likely I could install fonts or change settings on this device to render them.
> >>> А = 1
> >>> A = A + 1
>
> because the A's look more indistinguishable than the sigmas and are
> internally more distinct
> If the choice is to simply disallow the confusables that’s probably the
> best choice
>
> IOW
> 1. Disallow co-existence of confusables (in identifiers)
That would require disallowing 1 l and I, as well as O and 0. Or are
you, after telling us off for taking an ASCII-centric perspective, going
to exempt ASCII confusables?
In a dynamic language like Python, how do you prohibit these
confusables? Every time Python does a name binding operation, is it
supposed to search the entire namespace for potential confusables?
That's going to be awful expensive.
Confusables are a real problem in URLs, because they can be used for
phishing attacks. While even the most tech-savvy user is vulnerable, it
is especially the *least* savvy users who are at risk, which makes it
all the more important to protect against confusables in URLs.
But in programming code? Your demonstration with the Latin A and the
Greek alpha Α or Cyrillic А is just a party trick. In a world where most
developers do something like:
pip install randompackage
python -m randompackage
without ever once looking at the source code, I think we have bigger
problems. Or rather, even the bigger problems are not that big.
If you're worried about confusables, there are alternatives other than
banning them: your editor or linter might highlight them. Or rather than
syntax highlighting, perhaps editors should use *semantic highlighting*
and colour-code variables:
https://medium.com/@evnbr/coding-in-color-3a6db2743a1e
in which case your A and A will be highlighted in completely different
colours, completely ruining the trick.
(Aside: this may also help with the "oops I misspelled my variable and
the compiler didn't complain" problem. If "self.dashes" is green and
"self.dahses" is blue, you're more likely to notice the typo.)
--
Steve