Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/
That brings us to possible changes in Python in this area, which is an
interesting topic.
As for \0, can we ban all ASCII & C1 control characters except
whitespace? I see no place for them in source code.
For homoglyphs/confusables, should there be a SyntaxWarning when an
identifier looks like ASCII but isn't?
For right-to-left text: does anyone actually name identifiers in
Hebrew/Arabic? AFAIK, we should allow a few non-printing
"joiner"/"non-joiner" characters to make it possible to use all Arabic
words. But it would be great to consult with users/teachers of the
languages.
Should Python run the bidi algorithm when parsing and disallow reordered
tokens? Maybe optionally?
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic.
As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion).
To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.
(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com). I have a second gist containing the transformer, but it is still a private gist atm.)
Some other discoveries:
“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.
“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.
Potential beneficial uses:
I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.
-- Paul McGuire
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.
(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com). I have a second gist containing the transformer, but it is still a private gist atm.)
Some other discoveries:
“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.
“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “︳” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.
Potential beneficial uses:
I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.
-- Paul McGuire
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Code of Conduct: http://python.org/psf/codeofconduct/
> So why does Python apply NFKC normalization to variable names??
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.
Unfortunately, it goes too far, because it's unlikely that we want "ᵖ"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".
On Sun, Nov 14, 2021 at 10:27 AM MRAB <pyt...@mrabarnett.plus.com> wrote:Unfortunately, it goes too far, because it's unlikely that we want "ᵖ"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".Is it possible to only capture things like the combining characters and not the "equivalent" ones like the above?
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.
Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file?
Out of all the approximately thousand bazillion ways to write obfuscated
Python code, which may or may not be malicious, why are Unicode
confusables worth this level of angst and concern?
def 𝚑𝓮𝖑𝒍𝑜():
try:
𝔥e𝗅𝕝𝚘︴ = "Hello"
𝕨𝔬r𝓵ᵈ﹎ = "World"
ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
𝒉eℓˡ𝗈()
# snippet from unittest/util.py
_𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
return ₛ
On Sat, Nov 13, 2021 at 5:04 PM <pt...@austin.rr.com> wrote:
def 𝚑𝓮𝖑𝒍𝑜():
return ₛ
0_o color me impressed, I did not think that would be legal syntax. Would be interesting to include in a textbook, if for nothing else other than to academically demonstrate that it is possible, as I suspect many are not aware.
It's interesting that the egalitarian wish to allow use of native "alphabetics" has turned out to be such a viper's nest.
Particular thanks to Stephen J. Turnbull for his thoughtful and well-informed contribution above.Kind regards,Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Code of Conduct: http://python.org/psf/codeofconduct/