[Python-Dev] pre-PEP: Unicode Security Considerations for Python

218 views
Skip to first unread message

Petr Viktorin

unread,
Nov 1, 2021, 8:23:20 AM11/1/21
to pytho...@python.org
Hello,
Today, an attack called "Trojan source" was revealed, where a malicious
contributor can use Unicode features (left-to-right text and homoglyphs)
to code that, when shown in an editor, will look different from how a
computer language parser will process it.
See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.

This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report
and decided that it should be handled in code editors, diff viewers,
repository frontends and similar software, rather than in the language.

I agree: in my opinion, the attack is similar to abusing any other
"gotcha" where Python doesn't parse text as a non-expert human would.
For example: `if a or b == 'yes'`, mutable default arguments, or a
misleading typo.

Nevertheless, I did do a bit of research about similar gotchas in
Python, and I'd like to publish a summary as an informational PEP,
pasted below.



> PEP: 9999
> Title: Unicode Security Considerations for Python
> Author: Petr Viktorin <enc...@gmail.com>
> Status: Active
> Type: Informational
> Content-Type: text/x-rst
> Created: 01-Nov-2021
> Post-History:
>
> Abstract
> ========
>
> This document explains possible ways to misuse Unicode to write Python
> programs that appear to do something else than they actually do.
>
> This document does not give any recommendations and solutions.
>
>
> Introduction
> ============
>
> Python code is written in `Unicode`_ – a system for encoding and
> handling all kinds of written language.
> While this allows programmers from all around the world to express themselves,
> it also allows writing code that is potentially confusing to readers.
>
> It is possible to misuse Python's Unicode-related features to write code that
> *appears* to do something else than what it does.
> Evildoers could take advantage of this to trick code reviewers into
> accepting malicious code.
>
> The possible issues generally can't be solved in Python itself without
> excessive restrictions of the language.
> They should be solved in code edirors and review tools
> (such as *diff* displays), by enforcing project-specific policies,
> and by raising awareness of individual programmers.
>
> This document purposefully does not give any solutions
> or recommendations: it is rather a list of things to keep in mind.
>
> This document is specific to Python.
> For general security considerations in Unicode text, see [tr36]_ and [tr39]_.
>
>
> Acknowledgement
> ===============
>
> Investigation for this document was prompted by [CVE-2021-42574],
> *Trojan Source Attacks* reported by Nicholas Boucher and Ross Anderson,
> which focuses on Bidirectional override characters in a variety of languages.
>
>
> Confusing Features
> ==================
>
> This section lists some Unicode-related features that can be surprising
> or misusable.
>
>
> ASCII-only Considerations
> -------------------------
>
> ASCII is a subset of Unicode
>
> While issues with the ASCII character set are generally well understood,
> the're presented here to help better understanding of the non-ASCII cases.
>
> Confusables and Typos
> '''''''''''''''''''''
>
> Some characters look alike.
> Before the age of computers, most mechanical typewriters lacked the keys for
> the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l``
> (lowercase L) instead. Human readers could tell them apart by context only.
> In programming language, however, distinction between digits and letters is
> critical -- and most fonts designed for programmers make it easy to tell them
> apart.
>
> Similarly, the uppercase “I” and lowercase “l” can look similar in fonts
> designed for human languages, but programmers' fonts make them noticeably
> different.
>
> However, what is “noticeably” different always depend on the context.
> Humans tend to ignore details in longer identifiers: the variable name
> ``accessibi1ity_options`` can still look indistinguishable from
> ``accessibility_options``, while they are distinct for the compiler.
>
> The same can be said for plain typos: most humans will not notice the typo in
> ``responsbility_chain_delegate``.
>
> Control Characters
> ''''''''''''''''''
>
> Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF``
> pairs (``\r\n``) as an end of line characters.
> Most code editors do as well, but there are editors that display “non-native”
> line endings as unknown characters (or nothing at all), rather than ending
> the line, displaying this example::
>
> # Don't call this function:
> fire_the_missiles()
>
> as a harmless comment like::
>
> # Don't call this function:⬛fire_the_missiles()
>
> CPython treats the control character NUL (``\0``) as end of input,
> but many editors simply skip it, possibly showing code that Python will not
> run as a regular part of a file.
>
> Some characters can be used to hide/overwrite other characters when source is
> listed in common terminals:
>
> * BS (``\b``, Backspace) moves the cursor back, so the character after it
> will overwrite the character before.
> * CR (``\r``, carriage return) moves the cursor to the start of line,
> subsequent characters overwrite the start of the line.
> * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary
> control of the terminal.
>
>
> Confusable Characters in Identifiers
> ------------------------------------
>
> Python allows characters of all any scripts – Latin letters to ancient Egyptian
> hieroglyphs – in identifiers (such as variable names).
> See :pep:`3131` for details and rationale.
> Only “letters and numbers” are allowed (see `Identifiers and keywords`_
> for details), so while ``γάτα`` is a valid Python identifier, ``🐱`` is not.
> Non-printing control characters are also not allowed.
>
> However, within the allowed set there is a large number of “confusables”.
> For example, the uppercase versions of the Latin `b`, Greek `β` (Beta), and
> Cyrillic `в` (Ve) often look identical: ``B``, ``Β`` and ``В``, respectively.
>
> This allows identifiers that look the same to humans, but not to Python.
> For example, all of the following are distinct identifiers:
>
> * ``scope`` (Latin, ASCII-only)
> * ``scоpe`` (wih a Cyrillic `о`)
> * ``scοpe`` (with a Greek `ο`)
> * ``ѕсоре`` (all Cyrillic letters)
>
> Additionally, some letters can look like non-letters:
>
> * The letter for the Hawaiian *ʻokina* looks like an apostrophe;
> ``ʻHelloʻ`` is a Python identifier, not a string.
> * The East Asian symbol for *ten* looks like a plus sign,
> so ``十= 10`` is a complete Python statement.
>
> .. note::
>
> The converse also applies – some symbols look like letters – but since
> Python does not allow arbitrary symbols in identifiers, this is not an
> issue.
>
>
> Confusable Digits
> ------------------
>
> Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such
> as ``.`` or ``e``).
>
> However, when numbers are converted from strings, such as in the ``int`` and
> ``float`` constructors or by the ``str.format`` method, any decimal digit
> can be used. For example ``߅`` (``NKO DIGIT FIVE``) or ``௫``
> (``TAMIL DIGIT FIVE``) work as the digit ``5``.
>
> Some scripts include digits that look similar to ASCII ones, but have a
> different value. For example::
>
> >>> int('৪୨')
> 42
> >>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
> five
>
>
> Bidirectional Text
> ------------------
>
> Some scripts, such as Hebrew or Arabic, are written right-to-left.
> Phrases in such scripts interact with nearby text in ways that can be
> surprising to people who aren't familiar with these writing systems and their
> computer representation.
>
> The exact process is complicated, and explained in Unicode® Standard Annex #9,
> "Unicode Bidirectional Algorithm".
>
> Some surprising examples include:
>
> * In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
>
> * In the statement ``قيمة = ערך``, the variable ``قيمة`` is set
> to the value of ``ערך``.
>
> * In the statement ``قيمة - (ערך ** 2)``, the value of ``ערך`` is squared and
> then subtracted from ``قيمة``.
> The *opening* parenthesis is displayed as ``)``.
>
> * In the following, the second line is the same as first, except
> ``A`` is replaced by the Hebrew ``א``. Both assign a 100-character string to
> the variable ``s``.
> Note how the symbols and numbers between Hebrew characters is shown in
> reverse order::
>
> s = "A" * 100 # "A" is assigned
>
> s = "א" * 100 # "א" is assigned
>
>
> Bidirectional Marks, Embeddings, Overrides and Isolates
> -------------------------------------------------------
>
> The rules for determining the direction of text do not always yield the
> intended results, so Unicode provides several ways to alter it.
>
> The most basic are **directional marks**, which are invisible but affect text
> as a left-to-right (or right-to-left) character would.
> Following with the example above, in the next example the ``A``/``א`` is
> replaced by the Latin ``x`` followed or preceded by a
> right-to-left mark (``U+200F``). This assigns a 200-character string to ``s``
> (100 copies of `x` interspersed with 100 invisible marks)::
>
> s = "x‏" * 100 # "‏x" is assigned
>
> The directional **embedding**, **override** and **isolate** characters
> are also invisible, but affect the ordering of all text after them until either
> ended by a dedicated character, or until the end of line.
> (Unicode specifies the effect to last until the end of a “paragraph” (see [tr9]_),
> but allows tools to interpret newline characters as paragraph ends
> (see [u5.8]_). Most code editors and terminals do so.)
>
> These characters essentially allow arbitrary reordering of the text that
> follows them. Python only allows them in strings and comments, which does limit
> their potential (especially in combination with the fact that Python's comments
> always extend to the end of a line), but it doesn't render them harmless.
>
>
> Normalizing identifiers
> -----------------------
>
> Python strings are collections of *Unicode codepoints*, not “characters”.
>
> For reasons like compatibility with earlier encodings, Unicode often has
> several ways to encode what is essentially a single “character”.
> For example, all are these different ways of writing ``Å`` as a Python string,
> each of which is unequal to the others.
>
> * ``"\N{LATIN CAPITAL LETTER A WITH RING ABOVE}"`` (1 codepoint)
> * ``"\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}"`` (2 codepoints)
> * ``"\N{ANGSTROM SIGN}"`` (1 codepoint, but different)
>
> For another example, the ligature ``fi`` has a dedicated Unicode codepoint,
> even though it has the same meaning as the two letters ``fi``.
>
> Also, common letters frequently have several distinct variations.
> Unicode provides them for contexts where the difference has some semantic
> meaning, like mathematics. For example, some variations of ``n`` are:
>
> * ``n`` (LATIN SMALL LETTER N)
> * ``𝐧`` (MATHEMATICAL BOLD SMALL N)
> * ``𝘯`` (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
> * ``n`` (FULLWIDTH LATIN SMALL LETTER N)
> * ``ⁿ`` (SUPERSCRIPT LATIN SMALL LETTER N)
>
> Unicode includes alorithms to *normalize* variants like these to a single
> form, and Python identifiers are normalized.
> (There are several normal forms; Python uses ``NFKC``.)
>
> For example, ``xn`` and ``xⁿ`` are the same identifier in Python::
>
> >>> xⁿ = 8
> >>> xn
> 8
>
> … as is ``fi`` and ``fi``, and as are the different ways to encode ``Å``.
>
> This normalization applies *only* to identifiers, however.
> Functions that treat strings as identifiers, such as ``getattr``,
> do not perform normalization::
>
> >>> class Test:
> ... def finalize(self):
> ... print('OK')
> ...
> >>> Test().finalize()
> OK
> >>> Test().finalize()
> OK
> >>> getattr(Test(), 'finalize')
> Traceback (most recent call last):
> ...
> AttributeError: 'Test' object has no attribute 'finalize'
>
> This also applies when importing:
>
> * ``import finalization`` performs normalization, and looks for a file
> named ``finalization.py`` (and other ``finalization.*`` files).
> * ``importlib.import_module("finalization")`` does not normalize,
> so it looks for a file named ``finalization.py``.
>
> Some filesystems independently apply normalization and/or case folding.
> On some systems, ``finalization.py``, ``finalization.py`` and
> ``FINALIZATION.py`` are three distinct filenames; on others, some or all
> of these can name the same file.
>
>
> Source Encoding
> ---------------
>
> The encoding of Python source files is given by a specific regex on the first
> two lines of a file, as per `Encoding declarations`_.
> This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
>
> This can be misused in combination with Python-specific special-purpose
> encodings (see `Text Encodings`_).
> For example, with ``encoding: unicode_escape``, characters like
> quotes or braces can be hidden in an (f-)string, with many tools (syntax
> highlighters, linters, etc.) considering them part of the string.
> For example::
>
> # For writing Japanese, you don't need an editor that supports
> # UTF-8 source encoding: unicode_escape sequences work just as well.
>
> import os
>
> message = '''
> This is "Hello World" in Japanese:
> \u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
>
> This runs `echo WHOA` in your shell:
> \u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
> \u0073\u0079\u0073\u0074\u0065\u006d\u0028
> \u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
> \u0029\u0029\u002c\u0027\u0027\u0027
> '''
>
>
> Open Issues
> ===========
>
> We should probably write and publish:
>
> * Recommendations for Text Editors and Code Tools
> * Recommendations for Programmers and Teams
> * Possible Improvements in Python
>
>
> References
> ==========
>
> .. _Unicode: https://home.unicode.org/
> .. _`Encoding declarations`: https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations
> .. _`Identifiers and keywords`: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
> .. _`Text Encodings`: https://docs.python.org/3/library/codecs.html#text-encodings
> .. [u5.8] http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213
> .. [tr9] http://www.unicode.org/reports/tr9/
> .. [tr36] Unicode Technical Report #36: Unicode Security Considerations
> http://www.unicode.org/reports/tr39/
> .. [tr39] Unicode® Technical Standard #39: Unicode Security Mechanisms
> http://www.unicode.org/reports/tr39/
> .. [CVE-2021-42574] CVE-2021-42574
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
>
>
> Copyright
> =========
>
> This document is placed in the public domain or under the
> CC0-1.0-Universal license, whichever is more permissive.
>
>
>
> ..
> Local Variables:
> mode: indented-text
> indent-tabs-mode: nil
> sentence-end-double-space: t
> fill-column: 70
> coding: utf-8
> End:
>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/6DBJJRQHA2SP5Q27MOMDSTCOXMW7ITNR/
Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

unread,
Nov 1, 2021, 9:43:53 AM11/1/21
to pytho...@python.org
Thanks for writing this Petr!

A few comments below.

On Mon, Nov 01, 2021 at 01:17:02PM +0100, Petr Viktorin wrote:

> >ASCII-only Considerations
> >-------------------------
> >
> >ASCII is a subset of Unicode
> >
> >While issues with the ASCII character set are generally well understood,
> >the're presented here to help better understanding of the non-ASCII cases.

You should mention that some very common typefaces (fonts) are more
confusable than others. For instance, Arial (a common font on Windows
systems) makes the two letter combination 'rn' virtually
indistinguishable from the single letter 'm'.


> >Before the age of computers, most mechanical typewriters lacked the keys
> >for the digits ``0`` and ``1``

I'm not sure that "most" is justifed here. One of the most popular
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked
the 1 key but had a 0 distinct from O.

https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg

The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford
Typewriter. As did possibly the best selling typewriter in history, the
IBM Selectric (introduced in 1961).

http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery

Perhaps you should say "many older mechanical typewriters"?


> >Bidirectional Text
> >------------------

The section on bidirectional text is interesting, because reading it in
my email client mutt, all the examples are left to right.

You might like to note that not all applications support bidirectional
text.


> >Unicode includes alorithms to *normalize* variants like these to a
> >single form, and Python identifiers are normalized.

Typo: "algorithms".



This is a good and useful document, thank you again.


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/CHGK6LLBMVRQ6GGEMRWYJNRLUL7KUMVS/

Serhiy Storchaka

unread,
Nov 1, 2021, 1:36:17 PM11/1/21
to pytho...@python.org
This is excellent!

01.11.21 14:17, Petr Viktorin пише:
>> CPython treats the control character NUL (``\0``) as end of input,
>> but many editors simply skip it, possibly showing code that Python
>> will not
>> run as a regular part of a file.

It is an implementation detail and we will get rid of it. It only
happens when you read the Python script from a file. If you import it as
a module or run with runpy, the NUL character is an error.

>> Some characters can be used to hide/overwrite other characters when
>> source is
>> listed in common terminals:
>>
>> * BS (``\b``, Backspace) moves the cursor back, so the character after it
>>   will overwrite the character before.
>> * CR (``\r``, carriage return) moves the cursor to the start of line,
>>   subsequent characters overwrite the start of the line.
>> * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary
>>   control of the terminal.

ESC (``\x1B``) starts many control sequences.

``\1A`` means the end of the text file on Windows. Some programs (for
example "type") ignore the rest of the file.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/CBI7ME3YUAVVH5B6LSC745GJSVUIZJHO/

Toshio Kuratomi

unread,
Nov 1, 2021, 2:42:58 PM11/1/21
to Petr Viktorin, pytho...@python.org
This is an excellent enumeration of some of the concerns!

One minor comment about the introductory material:

On Mon, Nov 1, 2021 at 5:21 AM Petr Viktorin <enc...@gmail.com> wrote:

> >
> > Introduction
> > ============
> >
> > Python code is written in `Unicode`_ – a system for encoding and
> > handling all kinds of written language.

Unicode specifies the mapping of glyphs to code points. Then a second
mapping from code points to sequences of bytes is what is actually
recorded by the computer. The second mapping is what programmers
using Python will commonly think of as the encoding while the majority
of what you're writing about has more to do with the first mapping.
I'd try to word this in a way that doesn't lead a reader to conflate
those two mappings.

Maybe something like this?

`Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.

> > While this allows programmers from all around the world to express themselves,
> > it also allows writing code that is potentially confusing to readers.
> >

-Toshio
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/Q2T3GKC6R6UH5O7RZJJNREG3XQDDZ6N4/

Jim J. Jewett

unread,
Nov 1, 2021, 7:48:30 PM11/1/21
to pytho...@python.org
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement."

Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/RV7RU7DGWFIBEGFKNYDP63ZRJNP5Y4YU/

Terry Reedy

unread,
Nov 1, 2021, 9:08:54 PM11/1/21
to pytho...@python.org
On 11/1/2021 8:17 AM, Petr Viktorin wrote:

> Nevertheless, I did do a bit of research about similar gotchas in
> Python, and I'd like to publish a summary as an informational PEP,
> pasted below.

Very helpful.

>> Bidirectional Text
>> ------------------
>>
>> Some scripts, such as Hebrew or Arabic, are written right-to-left.

[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local
(contiguous sequences are properly reversed), and extended (see below).
The handling depends on the display software and may depend on the
quoting. Tk and hence tkinter (and IDLE) text widgets do local handing.
Windows Notepad++ does local handling of unquoted code but extending
handling of quoted text. Windows Notepad currently does extended
handling even without quotes.

In extended handling, phrases ...

>> Phrases in such scripts interact with nearby text in ways that can be
>> surprising to people who aren't familiar with these writing systems
>> and their
>> computer representation.
>>
>> The exact process is complicated, and explained in Unicode® Standard
>> Annex #9,
>> "Unicode Bidirectional Algorithm".
>>
>> Some surprising examples include:
>>
>> * In the statement ``ערך = 23``, the variable ``ערך`` is set to the
>> integer 23.

In local handling, one sees <hebrew-rtl> = 23`. In extended handling,
one sees 23 = <hebrew-rtl>. (Notepad++ sees backticks as quotes.)


>> Source Encoding
>> ---------------
>>
>> The encoding of Python source files is given by a specific regex on
>> the first
>> two lines of a file, as per `Encoding declarations`_.
>> This mechanism is very liberal in what it accepts, and thus easy to
>> obfuscate.
>>
>> This can be misused in combination with Python-specific special-purpose
>> encodings (see `Text Encodings`_).


Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to
something?


>> For example, with ``encoding: unicode_escape``, characters like
>> quotes or braces can be hidden in an (f-)string, with many tools (syntax
>> highlighters, linters, etc.) considering them part of the string.
>> For example::

I don't see the connection between the text above and the example that
follows.

>>     # For writing Japanese, you don't need an editor that supports
>>     # UTF-8 source encoding: unicode_escape sequences work just as well.
[etc]


--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/

Steven D'Aprano

unread,
Nov 1, 2021, 9:21:58 PM11/1/21
to pytho...@python.org
On Mon, Nov 01, 2021 at 11:41:06AM -0700, Toshio Kuratomi wrote:

> Unicode specifies the mapping of glyphs to code points. Then a second
> mapping from code points to sequences of bytes is what is actually
> recorded by the computer. The second mapping is what programmers
> using Python will commonly think of as the encoding while the majority
> of what you're writing about has more to do with the first mapping.

I don't think that is correct.

According to the Unicode consortium -- and I hope that they would know
*wink* -- Unicode is the universal character encoding. In other words:

"Unicode provides a unique number for every character"

https://www.unicode.org/standard/WhatIsUnicode.html

Not glyphs.

("Character" in natural language is a bit of a fuzzy concept, so I think
that Unicode here is referring to what their glossary calls an abstract
character.)

The usual meaning of glyph is for the graphical images used
by fonts (typefaces) for display. Sense 2 in the Unicode glossary here:

https://www.unicode.org/glossary/#glyph

I'm not really sure what they mean by sense 1, unless they mean a
representative glyph, which is intended to stand in as an example of the
entire range of glyphs.

Unicode does not specify what the glyphs for code points are, although
it does provide representative samples. See, for example, their comment
on emoji:

"The Unicode Consortium provides character code charts that show a
representative glyph"

http://www.unicode.org/faq/emoji_dingbats.html

Their code point charts likewise show representative glyphs for other
letters and symbols, not authoritative. And of course, many abstract
characters do not have glyphs at all, e.g. invisible joiners, control
characters, variation selectors, noncharacters, etc.

The mapping from bytes to code points and abstract characters is also
part of Unicode. The UTF encodings are part of Unicode:

https://www.unicode.org/faq/utf_bom.html#gen2

The "U" in UTF literally stands for Unicode :-)


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/I7ZRNIHSQ7UL4NSKOXFRYBYHQEXGNBPA/

Petr Viktorin

unread,
Nov 2, 2021, 10:04:28 AM11/2/21
to pytho...@python.org, st...@pearwood.info, a.ba...@gmail.com, jimjj...@gmail.com, p.sc...@guideline.com, tjr...@udel.edu
On 01. 11. 21 13:17, Petr Viktorin wrote:
> Hello,
> Today, an attack called "Trojan source" was revealed, where a malicious
> contributor can use Unicode features (left-to-right text and homoglyphs)
> to code that, when shown in an editor, will look different from how a
> computer language parser will process it.
> See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.
>
> This is not a bug in Python.
> As far as I know, the Python Security Response team reviewed the report
> and decided that it should be handled in code editors, diff viewers,
> repository frontends and similar software, rather than in the language.
>
> I agree: in my opinion, the attack is similar to abusing any other
> "gotcha" where Python doesn't parse text as a non-expert human would.
> For example: `if a or b == 'yes'`, mutable default arguments, or a
> misleading typo.
>
> Nevertheless, I did do a bit of research about similar gotchas in
> Python, and I'd like to publish a summary as an informational PEP,
> pasted below.


Thanks for the comments, everyone! I've updated the document and sent it
to https://github.com/python/peps/pull/2129
A rendered version is at
https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst



Toshio Kuratomi wrote:
> `Unicode`_ is a system for handling all kinds of written language.
> It aims to allow any character from any human natural language (as
> well as a few characters which are not from natural languages) to be
> used. Python code may consist of almost all valid Unicode characters.

Thanks! That's a nice summary; I condensed it a bit more and used it.
(I'm not joining the conversation on glyphs, characters, codepoints and
encodings -- that's much too technical for this document. Using the
specific technical terms unfortunately doesn't help understanding, so I
use the vague ones like "character" and "letter".)


Jim J. Jewett wrote:
>> "The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement."
>
> Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own.

I'm not a native speaker, but as I understand it, "十" is closer to a
single-letter word than a single-digit number. It translates better as
"ten" than "10". (And it appears in "十四", "fourteen", just like "four"
appears in "fourteen".)


Patrick Schultz wrote:
> - The Unicode consortium has a list of confusables, in case useful

Yup, and it's linked from the documents that describe how to use it. I
link to those rather than just the list.
But thank you!


Terry Reedy wrote:
>>> Bidirectional Text
>>> ------------------
>>>
>>> Some scripts, such as Hebrew or Arabic, are written right-to-left.
>
> [Suggested addition, subject to further revision.]
>
> There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes.

I'd like to leave these details out of the document. The examples should
render convincingly in browsers. The text should now describe the
behavior even if you open it in an editor that does things differently,
and acknowledge that such editors exist. (The behavior of specific
editors/toolkits might well change in the future.)

>>> For example, with ``encoding: unicode_escape``, characters like
>>> quotes or braces can be hidden in an (f-)string, with many tools (syntax
>>> highlighters, linters, etc.) considering them part of the string.
>>> For example::
>
> I don't see the connection between the text above and the example that follows.
>
>>> # For writing Japanese, you don't need an editor that supports
>>> # UTF-8 source encoding: unicode_escape sequences work just as well.
> [etc]

Let me know if it's clear in the newest version, with this note:

> Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> declaration. The ``unicode_escape`` encoding instructs Python to treat
> ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> a comma (punctuator), etc.


Steven D'Aprano wrote:

>>> Before the age of computers, most mechanical typewriters lacked the keys
>>> for the digits ``0`` and ``1``
>
> I'm not sure that "most" is justifed here. One of the most popular
> typewriters in history, the Underwood #5 (from 1900 to 1920), lacked
> the 1 key but had a 0 distinct from O.
>
> https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg
>
> The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford
> Typewriter. As did possibly the best selling typewriter in history, the
> IBM Selectric (introduced in 1961).
>
> http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery
>
> Perhaps you should say "many older mechanical typewriters"?
>
>

Ah, interesting! I only ever saw and read about ones that have a bunch
of accented letters, leaving no space for dedicated 0/1 keys :)
My typewriter looks like this: https://imgur.com/a/J34gqVZ

>>> Bidirectional Text
>>> ------------------
>
> The section on bidirectional text is interesting, because reading it in
> my email client mutt, all the examples are left to right.
>
> You might like to note that not all applications support bidirectional
> text.

It might be handled by your terminal rather than mutt.
I made the text work even if the examples don't render the way I'd like.


_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/OB6C54HCBESUTANUVOTTIUI7N2IYDPQV/

Petr Viktorin

unread,
Nov 2, 2021, 10:17:55 AM11/2/21
to Serhiy Storchaka, pytho...@python.org


On 01. 11. 21 18:32, Serhiy Storchaka wrote:
> This is excellent!
>
> 01.11.21 14:17, Petr Viktorin пише:
>>> CPython treats the control character NUL (``\0``) as end of input,
>>> but many editors simply skip it, possibly showing code that Python
>>> will not
>>> run as a regular part of a file.
>
> It is an implementation detail and we will get rid of it. It only
> happens when you read the Python script from a file. If you import it as
> a module or run with runpy, the NUL character is an error.

That brings us to possible changes in Python in this area, which is an
interesting topic.

As for \0, can we ban all ASCII & C1 control characters except
whitespace? I see no place for them in source code.


For homoglyphs/confusables, should there be a SyntaxWarning when an
identifier looks like ASCII but isn't?

For right-to-left text: does anyone actually name identifiers in
Hebrew/Arabic? AFAIK, we should allow a few non-printing
"joiner"/"non-joiner" characters to make it possible to use all Arabic
words. But it would be great to consult with users/teachers of the
languages.
Should Python run the bidi algorithm when parsing and disallow reordered
tokens? Maybe optionally?
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/

Serhiy Storchaka

unread,
Nov 2, 2021, 11:57:35 AM11/2/21
to Petr Viktorin, pytho...@python.org
02.11.21 16:16, Petr Viktorin пише:
> As for \0, can we ban all ASCII & C1 control characters except
> whitespace? I see no place for them in source code.

All control characters except CR, LF, TAB and FF are banned outside
comments and string literals. I think it is worth to ban them in
comments and string literals too. In string literals you can use
backslash-escape sequences, and comments should be human readable, there
are no reason to include control characters in them. There is a
precedence of emitting warnings for some superficial escapes in strings.


> For homoglyphs/confusables, should there be a SyntaxWarning when an
> identifier looks like ASCII but isn't?

It would virtually ban Cyrillic. There is a lot of Cyrillic letters
which look like Latin letters, and there are complete words written in
Cyrillic which by accident look like other words written in Latin.

It is a work for linters, which can have many options for configuring
acceptable scripts, use spelling dictionaries and dictionaries of
homoglyphs, etc.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/

Chris Angelico

unread,
Nov 2, 2021, 12:10:02 PM11/2/21
to pytho...@python.org
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <enc...@gmail.com> wrote:
> Let me know if it's clear in the newest version, with this note:
>
> > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > a comma (punctuator), etc.
>

Huh. Is that level of generality actually still needed? Can Python
deprecate all but a small handful of encodings?

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/WA7P7YLY7N6CGF7N5G6DVG3PIA24BPS7/

Jim J. Jewett

unread,
Nov 2, 2021, 12:52:25 PM11/2/21
to pytho...@python.org
Serhiy Storchaka wrote:
> 02.11.21 16:16, Petr Viktorin пише:
> > As for \0, can we ban all ASCII & C1 control characters except
> > whitespace? I see no place for them in source code.

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them.

If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
> > It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).

Simplicity won, in part because of existing practice in EMACS scripting, particularly with some Asian languages.

> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

It might be time for the documentation to mention a specific linter/configuration that does this. It also might be reasonable to do by default in IDLE or even the interactive shell.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/BCZI6HCZJ34XABFFZETJMWFQWOUG4UB4/

Marc-Andre Lemburg

unread,
Nov 2, 2021, 1:04:08 PM11/2/21
to Petr Viktorin, pytho...@python.org
On 01.11.2021 13:17, Petr Viktorin wrote:
>> PEP: 9999
>> Title: Unicode Security Considerations for Python
>> Author: Petr Viktorin <enc...@gmail.com>
>> Status: Active
>> Type: Informational
>> Content-Type: text/x-rst
>> Created: 01-Nov-2021
>> Post-History:

Thanks for writing this up. I'm not sure whether a PEP is the right place
for such documentation, though. Wouldn't it be more visible in the standard
Python documentation ?

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 02 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/

David Mertz, Ph.D.

unread,
Nov 2, 2021, 2:05:45 PM11/2/21
to Marc-Andre Lemburg, Python-Dev
This is an amazing document, Petr. Really great work!

I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP. 

Chris Angelico

unread,
Nov 2, 2021, 2:17:26 PM11/2/21
to Python-Dev
On Wed, Nov 3, 2021 at 5:07 AM David Mertz, Ph.D. <david...@gmail.com> wrote:
>
> This is an amazing document, Petr. Really great work!
>
> I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP.
>

There are quite a few other PEPs that have similar sorts of advice,
like PEP 257 on docstrings, and several of the type hinting PEPs. IMO
it's fine.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/NICZBYG332C4WBFZVCHCTDTEP3NGEF7B/

Terry Reedy

unread,
Nov 2, 2021, 5:07:05 PM11/2/21
to pytho...@python.org
On 11/2/2021 1:02 PM, Marc-Andre Lemburg wrote:
> On 01.11.2021 13:17, Petr Viktorin wrote:
>>> PEP: 9999
>>> Title: Unicode Security Considerations for Python
>>> Author: Petr Viktorin <enc...@gmail.com>
>>> Status: Active
>>> Type: Informational
>>> Content-Type: text/x-rst
>>> Created: 01-Nov-2021
>>> Post-History:
>
> Thanks for writing this up. I'm not sure whether a PEP is the right place
> for such documentation, though. Wouldn't it be more visible in the standard
> Python documentation ?

There is already "Unicode HOW TO" We could add "Unicode problems and
pitfalls".


--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/5KDNR5RIITKMIKGSZK2WCPEQDA6AJGQE/

Steven D'Aprano

unread,
Nov 2, 2021, 8:09:02 PM11/2/21
to pytho...@python.org
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
> On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <enc...@gmail.com> wrote:
> > Let me know if it's clear in the newest version, with this note:
> >
> > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > > a comma (punctuator), etc.
> >
>
> Huh. Is that level of generality actually still needed? Can Python
> deprecate all but a small handful of encodings?

To be clear, are you proposing to deprecate the encodings *completely*
or just as the source code encoding?

Personally, I think that using obscure encodings as the source encoding
is one of those "linters and code reviews should check it" issues.

Besides, now that I've learned about this unicode_escape encoding, I
think that's going to be *awesome* for winning obfuscated Python
competitions! *wink*


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/27IDDKAADVBAZSRZ2I5EO5SLXZIY6ANW/

Chris Angelico

unread,
Nov 2, 2021, 8:25:13 PM11/2/21
to python-dev
On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
> > On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <enc...@gmail.com> wrote:
> > > Let me know if it's clear in the newest version, with this note:
> > >
> > > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > > > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > > > a comma (punctuator), etc.
> > >
> >
> > Huh. Is that level of generality actually still needed? Can Python
> > deprecate all but a small handful of encodings?
>
> To be clear, are you proposing to deprecate the encodings *completely*
> or just as the source code encoding?

Only source code encodings. Obviously we still need to be able to cope
with all manner of *data*, but Python source code shouldn't need to be
in bizarre, weird encodings.

(Honestly, I'd love to just require that Python source code be UTF-8,
but that would probably cause problems, so mandating that it be one of
a small set of encodings would be a safer option.)

> Personally, I think that using obscure encodings as the source encoding
> is one of those "linters and code reviews should check it" issues.
>
> Besides, now that I've learned about this unicode_escape encoding, I
> think that's going to be *awesome* for winning obfuscated Python
> competitions! *wink*

TBH, I'm not entirely sure how valid it is to talk about *security*
considerations when we're dealing with Python source code and variable
confusions, but that's a term that is well understood.

But to the extent that it is a security concern, it's not one that
linters can really cope with. I'm not sure how a linter would stop
someone from publishing code on PyPI that causes confusion by its
character encoding, for instance.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/HJ452KNBAFXI6WBQ6OUMHHZRRETPC7QL/

Kyle Stanley

unread,
Nov 2, 2021, 10:00:35 PM11/2/21
to david...@gmail.com, Marc-Andre Lemburg, Python-Dev
I'd suggest both: briefer, easier to read write up for average user in docs, more details/semantics in informational PEP. Thanks for working on this, Petr!

Jim J. Jewett

unread,
Nov 3, 2021, 12:17:24 AM11/3/21
to pytho...@python.org
Chris Angelico wrote:
> I'm not sure how a linter would stop
> someone from publishing code on PyPI that causes confusion by its
> character encoding, for instance.

If it becomes important, the cheeseshop backend can run various validations (including a linter) on submissions, and include those results in the display template.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/NO6XRUPLOEAO2ZMUJEXXRNQMVFWZUGLT/

Stephen J. Turnbull

unread,
Nov 3, 2021, 1:21:08 AM11/3/21
to Serhiy Storchaka, pytho...@python.org
Serhiy Storchaka writes:
> This is excellent!
>
> 01.11.21 14:17, Petr Viktorin пише:
> >> CPython treats the control character NUL (``\0``) as end of input,
> >> but many editors simply skip it, possibly showing code that Python
> >> will not
> >> run as a regular part of a file.
>
> It is an implementation detail and we will get rid of it.

You can't, probably not for a decade, because people will be running
versions of Python released before you change it. I hope this PEP
will address Python as it is as well as as it will be.

> It only happens when you read the Python script from a file.

Which is one of the likely vectors for malware. It might be worth
teaching virus checkers about this, for example.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/OUFJ47LYOHQ245BIKWVPCH4OCDB4CM7N/

Stephen J. Turnbull

unread,
Nov 3, 2021, 1:28:07 AM11/3/21
to Serhiy Storchaka, pytho...@python.org
Serhiy Storchaka writes:

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too.

+1

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
>
> It would virtually ban Cyrillic.

+1 (for the comment and for the implied -1 on SyntaxWarning, let's
keep the Cyrillic repertoire in Python!)

> It is a work for linters,

+1

Aside from the reasons Serhiy presents, I'd rather not tie
this kind of rather ambiguous improvement in Unicode handling to the
release cycle.

It might be worth having a pep9999 module/script in Python (perhaps
more likely, PyPI but maintained by whoever does the work to make
these improvements + Petr or somebody Petr trusts to do it), that
lints scripts specifically for confusables and other issues.

Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/Z62GMKAJLHZJD3YSEOJKKBWUZSBYEIVA/

Stephen J. Turnbull

unread,
Nov 3, 2021, 1:45:57 AM11/3/21
to Jim J. Jewett, pytho...@python.org
Jim J. Jewett writes:

> At the time, we considered it, and we also considered a narrower
> restriction on using multiple scripts in the same identifier, or at
> least the same identifier portion (so it was OK if separated by
> _).

This would ban "παν語", aka "pango". That's arguably a good idea
(IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

> Simplicity won, in part because of existing practice in EMACS
> scripting, particularly with some Asian languages.

Interesting. I maintained a couple of Emacs libraries (dictionaries
and input methods) for Japanese in XEmacs, and while hyphen-separated
mixtures of ASCII and Japanese are common, I don't recall ever seeing
an identifier with ASCII and Japanese glommed together without a
separator. It was almost always of the form "English verb - Japanese
lexical component". Or do you consider that "relatively complicated"?

> It might be time for the documentation to mention a specific
> linter/configuration that does this. It also might be reasonable
> to do by default in IDLE or even the interactive shell.

It would have to be easy to turn off, perhaps even provide
instructions in the messages. I would guess that for code that uses
it at all, it would be common. So the warnings would likely make
those tools somewhere between really annoying and unusable.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/FPO3EJISKDZUVMC3RMJJQZIKGCOG35CX/

Stephen J. Turnbull

unread,
Nov 3, 2021, 2:14:22 AM11/3/21
to Chris Angelico, pytho...@python.org
Chris Angelico writes:

> Huh. Is that level of generality actually still needed? Can Python
> deprecate all but a small handful of encodings?

I think that's pointless. With few exceptions (GB18030, Big5 has a
couple of code point pairs that encode the same very rare characters,
ISO 2022 extensions) you're not going to run into the confuseables
problem, and AFAIK the only generic BIDI solution is Unicode (the ISO
8859 encodings of Hebrew and Arabic do not have direction markers).

What exactly are you thinking?

The only thing I'd like to see is to rearrange the codec aliases so
that the "common names" would denote the maximal repertoires in each
family (gb denotes gb18030, sjis denotes shift_jisx0213, etc) as in
the WhatWG recommendations for web browsers. But that's probably too
backward incompatible to fly.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/W4RJJVAJN7FB24R52YSCU2Y3QZQE3BEL/

Chris Angelico

unread,
Nov 3, 2021, 2:55:10 AM11/3/21
to pytho...@python.org
On Wed, Nov 3, 2021 at 5:12 PM Stephen J. Turnbull
<stephenj...@gmail.com> wrote:
>
> Chris Angelico writes:
>
> > Huh. Is that level of generality actually still needed? Can Python
> > deprecate all but a small handful of encodings?
>
> I think that's pointless. With few exceptions (GB18030, Big5 has a
> couple of code point pairs that encode the same very rare characters,
> ISO 2022 extensions) you're not going to run into the confuseables
> problem, and AFAIK the only generic BIDI solution is Unicode (the ISO
> 8859 encodings of Hebrew and Arabic do not have direction markers).
>
> What exactly are you thinking?

You'll never eliminate confusables (even ASCII has some, depending on
font). But I was surprised to find that Python would let you use
unicode_escape for source code.



# coding: unicode_escape

x = '''

Code example:

\u0027\u0027\u0027 # format in monospaced on the web site

print("Did you think this would be executed?")

\u0027\u0027\u0027 # end monospaced

Surprise!
'''

print("There are %d lines in x." % len(x.split(chr(10))))



With some carefully-crafted comments, a lot of human readers will
ignore the magic tokens. It's not uncommon to put example code into
triple-quoted strings, and it's also not all that surprising when
simplified examples do things that you wouldn't normally want done
(like monkeypatching other modules), since it's just an example, after
all.

I don't have access to very many editors, but SciTE, VS Code, nano,
and the GitHub gist display all syntax-highlighted this as if it were
a single large string. Only Idle showed it as code in between, and
that's because it actually decoded it using the declared character
coding, so the magic lines showed up with actual apostrophes.

Maybe the phrase "a small handful" was a bit too hopeful, but would it
be possible to mandate (after, obviously, a deprecation period) that
source encodings be ASCII-compatible?

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/QQM7HLRMVKBELRRYBJYGR356QVSOLKKZ/

Serhiy Storchaka

unread,
Nov 3, 2021, 4:00:17 AM11/3/21
to pytho...@python.org
02.11.21 18:49, Jim J. Jewett пише:
> If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.

If you mean backslash-escaped sequences like \uXXXX, there is no reason
to ban them in comments. Unlike to Java they do not have special meaning
outside of string literals. But if you mean terminal control sequences
(which change color or move cursor) they should not be allowed in comments.

> At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).

I implemented this restrictions in one of my projects. The character set
was limited, and even this did not solve all issues with homoglyphs.

I think that we should not introduce such arbitrary limitations at the
parser level and left it to linters.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/2TT3VX4D4FMRSEOV4O2ICYTC7VC5M2J4/

Stephen J. Turnbull

unread,
Nov 3, 2021, 5:03:11 AM11/3/21
to Chris Angelico, pytho...@python.org
Chris Angelico writes:

> But I was surprised to find that Python would let you use
> unicode_escape for source code.

I'm not surprised. Today it's probably not necessary, but I've
exchanged a lot of code (not Python, though) with folks whose editors
were limited to 8 bit codes or even just ASCII. It wasn't frequent
that I needed to discuss non-ASCII code with them (that they needed to
run) but it would have been painful to do without some form of codec
that encoded Japanese using only ASCII bytes.

> Maybe the phrase "a small handful" was a bit too hopeful, but would it
> be possible to mandate (after, obviously, a deprecation period) that
> source encodings be ASCII-compatible?

Not sure what you mean there. In the usual sense of ASCII-compatible
(the ASCII bytes always mean the corresponding character in the ASCII
encoding), I think there are at least two ASCII-incompatible encodings
that would cause a lot of pain if they were prohibited, specifically
Shift JIS and Big5. (In certain contexts in those encodings an ASCII
byte frequently is a trailing byte in a multibyte character.) I'm sure
there is a ton of legacy Python code in those encodings in East Asia,
some of which is still maintained in the original encoding. And of
course UTF-16 is incompatible in that sense, although I don't know if
anybody actually saves Python code in UTF-16.

It might make sense to prohibit unicode_escape nowadays -- I think
almost all systems now can handle Unicode properly, but I don't think
we can go farther than that.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ESIU62AXASWUDX7MSAAAAPMTFIDONIAI/

Serhiy Storchaka

unread,
Nov 3, 2021, 5:19:16 AM11/3/21
to pytho...@python.org
03.11.21 11:01, Stephen J. Turnbull пише:
> And of
> course UTF-16 is incompatible in that sense, although I don't know if
> anybody actually saves Python code in UTF-16.

CPython does not currently support UTF-16 for source files.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/KN4MPLKSRKQOJM2DUFQNO4UGGOJN5YNU/

Chris Angelico

unread,
Nov 3, 2021, 5:48:23 AM11/3/21
to pytho...@python.org
On Wed, Nov 3, 2021 at 8:01 PM Stephen J. Turnbull
<stephenj...@gmail.com> wrote:
>
> Chris Angelico writes:
>
> > But I was surprised to find that Python would let you use
> > unicode_escape for source code.
>
> I'm not surprised. Today it's probably not necessary, but I've
> exchanged a lot of code (not Python, though) with folks whose editors
> were limited to 8 bit codes or even just ASCII. It wasn't frequent
> that I needed to discuss non-ASCII code with them (that they needed to
> run) but it would have been painful to do without some form of codec
> that encoded Japanese using only ASCII bytes.

Bearing in mind that string literals can always have their own
escapes, this feature is really only important to the source code
tokens themselves.

> > Maybe the phrase "a small handful" was a bit too hopeful, but would it
> > be possible to mandate (after, obviously, a deprecation period) that
> > source encodings be ASCII-compatible?
>
> Not sure what you mean there. In the usual sense of ASCII-compatible
> (the ASCII bytes always mean the corresponding character in the ASCII
> encoding), I think there are at least two ASCII-incompatible encodings
> that would cause a lot of pain if they were prohibited, specifically
> Shift JIS and Big5. (In certain contexts in those encodings an ASCII
> byte frequently is a trailing byte in a multibyte character.)

Ah, okay, so much for that, then. What about the weaker sense:
Characters below 128 are always and only represented by those byte
values? So if you find byte value 39, it might not actually be an
apostrophe, but if you're looking for an apostrophe, you know for sure
that it'll be represented by byte value 39?

> It might make sense to prohibit unicode_escape nowadays -- I think
> almost all systems now can handle Unicode properly, but I don't think
> we can go farther than that.
>

Yes. I'm sure someone will come along and say "but I have to have an
all-ASCII source file, directly runnable, with non-ASCII variable
names", because XKCD 1172, but I don't have enough sympathy for that
obscure situation to want the mess that unicode_escape can give.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/4JUWQJRMPCSPY3CCJCXLJKBVZ2UFW56F/

Marc-Andre Lemburg

unread,
Nov 3, 2021, 6:12:45 AM11/3/21
to Chris Angelico, python-dev
On 03.11.2021 01:21, Chris Angelico wrote:
> On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano <st...@pearwood.info> wrote:
>>
>> On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
>>> On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin <enc...@gmail.com> wrote:
>>>> Let me know if it's clear in the newest version, with this note:
>>>>
>>>>> Here, ``encoding: unicode_escape`` in the initial comment is an encoding
>>>>> declaration. The ``unicode_escape`` encoding instructs Python to treat
>>>>> ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
>>>>> a comma (punctuator), etc.
>>>>
>>>
>>> Huh. Is that level of generality actually still needed? Can Python
>>> deprecate all but a small handful of encodings?
>>
>> To be clear, are you proposing to deprecate the encodings *completely*
>> or just as the source code encoding?
>
> Only source code encodings. Obviously we still need to be able to cope
> with all manner of *data*, but Python source code shouldn't need to be
> in bizarre, weird encodings.
>
> (Honestly, I'd love to just require that Python source code be UTF-8,
> but that would probably cause problems, so mandating that it be one of
> a small set of encodings would be a safer option.)

Most Python code will be written in UTF-8 going forward, but there's
still a lot of code out there in other encodings. Limiting this
to some reduced set doesn't really make sense, since it's not
clear where to draw the line.

Coming back to the thread topic, many of the Unicode security
considerations don't apply to non-Unicode encodings, since those
usually don't support e.g. changing the bidi direction within a
stream of text or other interesting features you have in Unicode
such as combining code points, invisible (space) code points, font
rendering hint code points, etc.

So in a sense, those non-Unicode encodings are safer than
using UTF-8 :-)

Please also note that most character lookalikes are not encoding
issues, but instead font issues, which then result in the characters
looking similar.

There are fonts which are designed to avoid this
and it's no surprise that source code fonts typically do make
e.g. 0 and O, as well as 1 and l look sufficiently different to be
able to notice the difference.

Things get a lot harder when dealing with combining characters, since
it's not always easy to spot the added diacritics, e.g. try
this:

>>> print ('a\u0348bc') # strong articulation
a͈bc
>>> print ('a\u034Fbc') # combining grapheme joiner
a͏bc

The latter is only "visible" in the unicode_escape encoding:

>>> print ('a\u034Fbc'.encode('unicode_escape'))
b'a\\u034fbc'

Projects wanting to limit code encoding settings, disallow using
bidi markers and other special code points in source code, can easily
do this via e.g. pre-commit hooks, special editor settings, code
linters or security scanners.

I don't think limiting the source code encoding is the right approach
to making code more secure. Instead, tooling has to be used to detect
potentially malicious code points in code.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 03 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/MBWBY47ILPL3E6733W4XAZXF2M6RKFH6/

Paul Moore

unread,
Nov 3, 2021, 6:27:34 AM11/3/21
to Marc-Andre Lemburg, python-dev
On Wed, 3 Nov 2021 at 10:11, Marc-Andre Lemburg <m...@egenix.com> wrote:
> I don't think limiting the source code encoding is the right approach
> to making code more secure. Instead, tooling has to be used to detect
> potentially malicious code points in code.

+1

Discussing "making code more secure" without being clear on what the
threat model is, is always going to be inconclusive. In this case, I
believe the threat model is "an untrusted 3rd party submitting a PR
which potentially contains malicious code to a Python project". For
that threat, I think the correct approach is for core Python to
promote awareness (via this PEP and maybe something in the docs
themselves) and for projects to implement appropriate code checks that
are run against all PRs to flag this sort of issue.

What threat can't be addressed at a per-project level, but *can* be
addressed in core Python (without triggering so many false positives
that people are trained to ignore the warnings or work around the
prohibitions, defeating the purpose of the change)?

Paul
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/FQ42C66BVCE6AQFSP4J6V6ERS4VV44MK/

Petr Viktorin

unread,
Nov 3, 2021, 6:40:23 AM11/3/21
to pytho...@python.org
On 03. 11. 21 2:58, Kyle Stanley wrote:
> I'd suggest both: briefer, easier to read write up for average user in
> docs, more details/semantics in informational PEP. Thanks for working on
> this, Petr!

Well, this is the brief write-up :)
Maybe it would work better if the info was integrated into the relevant
parts of the docs, rather than be a separate HOWTO.

I went with an informational PEP because it's quicker to publish.

>
> On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. <david...@gmail.com
> <mailto:david...@gmail.com>> wrote:
>
> This is an amazing document, Petr. Really great work!
>
> I think I agree with Marc-André that putting it in the actual Python
> documentation would give it more visibility than in a PEP.
>
> On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg <m...@egenix.com
> <mailto:m...@egenix.com>> wrote:
>
> On 01.11.2021 13:17, Petr Viktorin wrote:
> >> PEP: 9999
> >> Title: Unicode Security Considerations for Python
> >> Author: Petr Viktorin <enc...@gmail.com
> <mailto:enc...@gmail.com>>
> >> Status: Active
> >> Type: Informational
> >> Content-Type: text/x-rst
> >> Created: 01-Nov-2021
> >> Post-History:
>
> Thanks for writing this up. I'm not sure whether a PEP is the
> right place
> for such documentation, though. Wouldn't it be more visible in
> the standard
> Python documentation ?
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Nov
> 02 2021)
> >>> Python Projects, Coaching and Support ...
> https://www.egenix.com/ <https://www.egenix.com/>
> >>> Python Product Development ...
> https://consulting.egenix.com/ <https://consulting.egenix.com/>
> ________________________________________________________________________
>
> ::: We implement business ideas - efficiently in both time and
> costs :::
>
>    eGenix.com Software, Skills and Services GmbH
> Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611
> https://www.egenix.com/company/contact/
> <https://www.egenix.com/company/contact/>
> https://www.malemburg.com/ <https://www.malemburg.com/>
>
> _______________________________________________
> Python-Dev mailing list -- pytho...@python.org
> <mailto:pytho...@python.org>
> To unsubscribe send an email to python-d...@python.org
> <mailto:python-d...@python.org>
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> <mailto:pytho...@python.org>
> To unsubscribe send an email to python-d...@python.org
> <mailto:python-d...@python.org>
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/6OET4CKEZIA34PAXIJR7BUDKT2DPX2DG/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ZNANXZ7VP6CVDAGWEFXHKYFO6AR3MZXQ/

Steven D'Aprano

unread,
Nov 3, 2021, 6:47:20 AM11/3/21
to pytho...@python.org
On Tue, Nov 02, 2021 at 05:55:55PM +0200, Serhiy Storchaka wrote:

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them. There is a
> precedence of emitting warnings for some superficial escapes in strings.

Agreed. I don't think there is any good reason for including control
characters (apart from whitespace) in comments.

In strings, I would consider allowing VT (vertical tab) as well, that is
whitespace.

>>> '\v'.isspace()
True

But I don't have a strong opinion on that.


[Petr]
> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?

Let's not enshrine as a language "feature" that non Western European
languages are dangerous second-class citizens.


> It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

Agreed.


> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

Linters and editors. I have no objection to people using editors that
highlight non-ASCII characters in blinking red letters, so long as I can
turn that option off :-)



--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/RWE5FIWHUM5PSOJ6BI2PAO5TDE3KLC5D/

Steven D'Aprano

unread,
Nov 3, 2021, 7:22:45 AM11/3/21
to pytho...@python.org
On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:

> TBH, I'm not entirely sure how valid it is to talk about *security*
> considerations when we're dealing with Python source code and variable
> confusions, but that's a term that is well understood.

It's not like Unicode is the only way to write obfuscated code,
malicious or otherwise.


> But to the extent that it is a security concern, it's not one that
> linters can really cope with. I'm not sure how a linter would stop
> someone from publishing code on PyPI that causes confusion by its
> character encoding, for instance.

Do we require that PyPI prevents people from publishing code that causes
confusion by its poorly written code and obfuscated and confusing
identifiers?

The linter is to *flag the issue* during, say, code review or before
running the code, like other code quality issues.

If you're just running random code you downloaded from the internet
using pip, then Unicode confusables are the least of your worries.

I'm not really sure why people get so uptight about Unicode confusables,
while being blasé about the opportunities to smuggle malicious code into
pure ASCII code.

https://en.wikipedia.org/wiki/Underhanded_C_Contest

Is it unfamiliarity? Worse? "Real programmers write identifiers in
English." And the ironic thing is, while it is very difficult indeed for
automated checkers to detect underhanded code in ASCII, it is trivially
easier for editors, linters and other tools to spot the sort of Unicode
confusables we're talking about here. But we spend all our energy
worrying about the minor issue, and almost none on the broader problem
of malicious code in general.

I'm pretty sure I could upload a library to PyPI that included

os.system('rm -rf .')

and nobody would blink an eye, but if I write:

A = 1
А = 2
Α = 3
print(A, А, Α)

everyone goes insane. Let's keep the threat in perspective. Writing an
informational PEP for the education of people is a great idea. Rushing
into making wholesale changes to the interpreter, not so much.


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/YGPSWZL4Z7LKTUHC25JVMHA5LUSLLQEL/

Steven D'Aprano

unread,
Nov 3, 2021, 7:26:55 AM11/3/21
to pytho...@python.org
On Wed, Nov 03, 2021 at 11:11:00AM +0100, Marc-Andre Lemburg wrote:

> Coming back to the thread topic, many of the Unicode security
> considerations don't apply to non-Unicode encodings, since those
> usually don't support e.g. changing the bidi direction within a
> stream of text or other interesting features you have in Unicode
> such as combining code points, invisible (space) code points, font
> rendering hint code points, etc.
>
> So in a sense, those non-Unicode encodings are safer than
> using UTF-8 :-)

Thank you MAL for that timely reminder that most encodings are not
Unicode. I have to admit that I often forget that there is a whole
universe of non-Unicode, non-ASCII encodings.


> Please also note that most character lookalikes are not encoding
> issues, but instead font issues, which then result in the characters
> looking similar.

+1


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/NJFO5C7367F4NLLQTJRNNNUCRRLA6BES/

Serhiy Storchaka

unread,
Nov 3, 2021, 7:38:19 AM11/3/21
to pytho...@python.org
03.11.21 12:36, Petr Viktorin пише:
> On 03. 11. 21 2:58, Kyle Stanley wrote:
>> I'd suggest both: briefer, easier to read write up for average user in
>> docs, more details/semantics in informational PEP. Thanks for working
>> on this, Petr!
>
> Well, this is the brief write-up :)
> Maybe it would work better if the  info was integrated into the relevant
> parts of the docs, rather than be a separate HOWTO.
>
> I went with an informational PEP because it's quicker to publish.

What is the supposed target audience of this document? If it is core
Python developers only, then PEP is the right place to publish it. But I
think that it rather describes potential issues in arbitrary Python
project, and as such, it will be more accessible as a part of the Python
documentation (as a HOW-TO article perhaps). AFAIK all other
informational PEPs are about developing Python, not developing in Python
(even if they are (mis)used (e.g. PEP 8) outside their scope).

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/TM7EU4QHHXTJMXGQT2EJRKZYZ764HNAD/

Chris Angelico

unread,
Nov 3, 2021, 7:45:39 AM11/3/21
to python-dev
On Wed, Nov 3, 2021 at 10:22 PM Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Wed, Nov 03, 2021 at 11:21:53AM +1100, Chris Angelico wrote:
>
> > TBH, I'm not entirely sure how valid it is to talk about *security*
> > considerations when we're dealing with Python source code and variable
> > confusions, but that's a term that is well understood.
>
> It's not like Unicode is the only way to write obfuscated code,
> malicious or otherwise.
>
>
> > But to the extent that it is a security concern, it's not one that
> > linters can really cope with. I'm not sure how a linter would stop
> > someone from publishing code on PyPI that causes confusion by its
> > character encoding, for instance.
>
> Do we require that PyPI prevents people from publishing code that causes
> confusion by its poorly written code and obfuscated and confusing
> identifiers?
>
> The linter is to *flag the issue* during, say, code review or before
> running the code, like other code quality issues.
>
> If you're just running random code you downloaded from the internet
> using pip, then Unicode confusables are the least of your worries.
>
> I'm not really sure why people get so uptight about Unicode confusables,
> while being blasé about the opportunities to smuggle malicious code into
> pure ASCII code.
>

Right, which is why I was NOT talking about confusables. I don't
consider them to be a particularly Unicode-related threat, although
the larger range of available characters does make it more plausible
than in ASCII.

But I do see a problem with code where most editors misrepresent the
code, where abuse of a purely ASCII character encoding for purely
ASCII code can cause all kinds of tooling issues. THAT is a more
viable attack vector, since code reviewers will be likely to assume
that their syntax highlighting is correct.

And yes, I'm aware that Python can't be expected to cope with poor
tools, but when *many* well-known tools have the same problem, one
must wonder who should be solving the issue.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/WMSJLYG5YQ7SMNHXKSXNEMM7UKKIARCN/

Petr Viktorin

unread,
Nov 3, 2021, 8:32:52 AM11/3/21
to Stephen J. Turnbull, Serhiy Storchaka, pytho...@python.org
We seem to agree that this is work for linters. That's reasonable; I'd
generalize it to "tools and policies". But even so, discussing what we'd
expect linters to do is on topic here.
Perhaps we can even find ways for the language to support linters --
type checking is also for external tools, but has language support.

For example: should the parser emit a lightweight audit event if it
finds a non-ASCII identifier? (See below for why ASCII is special.)
Or for encoding declarations?

On 03. 11. 21 6:26, Stephen J. Turnbull wrote:
> Serhiy Storchaka writes:
>
> > All control characters except CR, LF, TAB and FF are banned outside
> > comments and string literals. I think it is worth to ban them in
> > comments and string literals too.
>
> +1
>
> > > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > > identifier looks like ASCII but isn't?
> >
> > It would virtually ban Cyrillic.
>
> +1 (for the comment and for the implied -1 on SyntaxWarning, let's
> keep the Cyrillic repertoire in Python!)

I don't think this would actually ban Cyrillic/Greek.
(My suggestion is not vanilla confusables detection; it might require
careful reading: "should there be a [linter] warning when an identifier
looks like ASCII but isn't?")

I am not a native speaker, but I did try a bit to find an actual
ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
think they might be very rare.
Even if there was such a word -- or a one-letter abbreviation used as a
variable name -- it would be confusing to use. Removing the possibility
of confusion could *help* Cyrillic users. (I can't speak for them; this
is just a brainstorming idea.)

Steven adds:
> Let's not enshrine as a language "feature" that non Western European
> languages are dangerous second-class citizens.

That would be going too far, yes, but the fact is that non-English
languages *are* second-class citizens. Code that uses Python keywords
and stdlib must use English, and possibly another language. It is the
mixing of languages that can be dangerous/confusing, not the languages
themselves.


>
> > It is a work for linters,
>
> +1
>
> Aside from the reasons Serhiy presents, I'd rather not tie
> this kind of rather ambiguous improvement in Unicode handling to the
> release cycle.
>
> It might be worth having a pep9999 module/script in Python (perhaps
> more likely, PyPI but maintained by whoever does the work to make
> these improvements + Petr or somebody Petr trusts to do it), that
> lints scripts specifically for confusables and other issues.

If I have any say in it, the name definitely won't include a PEP number ;)
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/LB4O3YVDNVVNLYPMNH236QXGGUYG4BVI/

Petr Viktorin

unread,
Nov 3, 2021, 8:46:02 AM11/3/21
to pytho...@python.org
This is a very good point. Let's not point fingers, but figure out how
to make users' lives easier together :)


This was the first time I was "in" on an embargoed "issue", and let me
tell you, I was surprised by the amount of time spent on polishing the
messaging. Now, you can't reasonably twist all this into a "Python is
insecure" or "Company X products are insecure" headline, which is good,
but with that out of the way we can focus on *what* could be improved
over *where* the improvement could be and who should do it.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/FNUZCNDF7K2LLHRYRDYY3ZZYISRCI4XJ/

Petr Viktorin

unread,
Nov 3, 2021, 8:51:33 AM11/3/21
to pytho...@python.org


On 03. 11. 21 12:33, Serhiy Storchaka wrote:
> 03.11.21 12:36, Petr Viktorin пише:
>> On 03. 11. 21 2:58, Kyle Stanley wrote:
>>> I'd suggest both: briefer, easier to read write up for average user in
>>> docs, more details/semantics in informational PEP. Thanks for working
>>> on this, Petr!
>>
>> Well, this is the brief write-up :)
>> Maybe it would work better if the  info was integrated into the relevant
>> parts of the docs, rather than be a separate HOWTO.
>>
>> I went with an informational PEP because it's quicker to publish.
>
> What is the supposed target audience of this document?

Good question! At this point it looks like it's linter authors.

> If it is core
> Python developers only, then PEP is the right place to publish it. But I
> think that it rather describes potential issues in arbitrary Python
> project, and as such, it will be more accessible as a part of the Python
> documentation (as a HOW-TO article perhaps). AFAIK all other
> informational PEPs are about developing Python, not developing in Python
> (even if they are (mis)used (e.g. PEP 8) outside their scope).

There's a bunch of packaging PEPs, or a PEP on what the the
/usr/bin/python command should be. I think PEP 672 is in good company
for now.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/UTNIZZVWL56G7KSYSS67PYYZ2YPE7NX3/

Stephen J. Turnbull

unread,
Nov 3, 2021, 9:15:55 AM11/3/21
to Chris Angelico, pytho...@python.org
Chris Angelico writes:

> Ah, okay, so much for that, then. What about the weaker sense:
> Characters below 128 are always and only represented by those byte
> values? So if you find byte value 39, it might not actually be an
> apostrophe, but if you're looking for an apostrophe, you know for sure
> that it'll be represented by byte value 39?

1. The apostrophe that Python considers a string delimiter is always
represented by byte value 39 in the compiler input. So the only
time that wouldn't be true is if escape sequences are allowed to
represent characters. I believe unicode_escape is the only codec
that does.

2. There's always eval which will accept a string containing escape
sequences.

> Yes. I'm sure someone will come along and say "but I have to have an
> all-ASCII source file, directly runnable, with non-ASCII variable
> names", because XKCD 1172, but I don't have enough sympathy for that
> obscure situation to want the mess that unicode_escape can give.

It's not an obscure situation to me. As I wrote earlier, been there,
done that, made my own T-shirt. I don't *think* it matters today, but
the number of DOS machines and Windows 98 machines left in Japan is
not zero. Probably they can't run Python 3, but that's not something
I can testify to.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/RNCM3QNGBRRM5GW6SL3Q6FP6R55F5CHU/

Serhiy Storchaka

unread,
Nov 3, 2021, 11:47:30 AM11/3/21
to pytho...@python.org
03.11.21 14:31, Petr Viktorin пише:
> For example: should the parser emit a lightweight audit event if it
> finds a non-ASCII identifier? (See below for why ASCII is special.)
> Or for encoding declarations?

There are audit events for import and compile. You can also register
import hooks if you want more fanny preprocessing than just
unicode-encoding. I do not think we need to add more specific audit
events, they were not designed for this.

And I think it is too late to detect suspicious code at the time of its
execution. It should be detected before adding that code to the code
base (review tools, pre-commit hooks).

> I don't think this would actually ban Cyrillic/Greek.
> (My suggestion is not vanilla confusables detection; it might require
> careful reading: "should there be a [linter] warning when an identifier
> looks like ASCII but isn't?")

Yes, but it should be optional and configurable and not be the part of
the Python compiler. This is not our business as Python core developers.

> I am not a native speaker, but I did try a bit to find an actual
> ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
> think they might be very rare.

With simple script I have found 62 words common between English and
Ukrainian: гасу/racy, горе/rope, рима/puma, міх/mix, etc. But there are
much more English and Ukrainian words which contains only letters which
can be confused with letters from other script. And identifiers can
contains abbreviations and shortening, they are not all can be found in
dictionaries.

> Even if there was such a word -- or a one-letter abbreviation used as a
> variable name -- it would be confusing to use. Removing the possibility
> of confusion could *help* Cyrillic users. (I can't speak for them; this
> is just a brainstorming idea.)

I never used non-Latin identifiers in Python, but I guess that where
they are used (in schools?) there is a mix of English and non-English
identifiers, and identifiers consisting of parts of English and
non-English words without even an underscore between them. I know
because in other languages they just use inconsistent transliteration.
Emitting any warning by default is a discrimination of non-English
users. It would be better to not add support of non-ASCII identifiers at
first place.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/XHHXRWGKTDTZIYGS6AB3DKEVFH5D6BHV/

Serhiy Storchaka

unread,
Nov 3, 2021, 11:59:43 AM11/3/21
to pytho...@python.org
03.11.21 15:14, Stephen J. Turnbull пише:
> So the only
> time that wouldn't be true is if escape sequences are allowed to
> represent characters. I believe unicode_escape is the only codec
> that does.

Also raw_unicode_escape and utf_7. And maybe punycode or idna, I am not
sure.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/XRORKXTTV55YOSMP7Z7MAL4AG2UQRXHK/

Chris Jerdonek

unread,
Nov 3, 2021, 2:02:12 PM11/3/21
to Petr Viktorin, pytho...@python.org
On Tue, Nov 2, 2021 at 7:21 AM Petr Viktorin <enc...@gmail.com> wrote:
That brings us to possible changes in Python in this  area, which is an
interesting topic.

Is there a use case or need for allowing the comment-starting character “#” to occur when text is still in the right-to-left direction? Disallowing that would prevent Petr’s examples in which active code is displayed after the comment mark, which to me seems to be one of the more egregious examples. Or maybe this case is no worse than others and isn’t worth singling out.

—Chris





As for \0, can we ban all ASCII & C1 control characters except
whitespace? I see no place for them in source code.


For homoglyphs/confusables, should there be a SyntaxWarning when an
identifier looks like ASCII but isn't?

For right-to-left text: does anyone actually name identifiers in
Hebrew/Arabic? AFAIK, we should allow a few non-printing
"joiner"/"non-joiner" characters to make it possible to use all Arabic
words. But it would be great to consult with users/teachers of the
languages.
Should Python run the bidi algorithm when parsing and disallow reordered
tokens? Maybe optionally?

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/

Jim J. Jewett

unread,
Nov 3, 2021, 7:07:35 PM11/3/21
to pytho...@python.org
Stephen J. Turnbull wrote:
> Jim J. Jewett writes:
> > At the time, we considered it, and we also considered a narrower
> > restriction on using multiple scripts in the same identifier, or at
> > least the same identifier portion (so it was OK if separated by
> > _).

> > This would ban "παν語", aka "pango". That's arguably a good idea
> (IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

I am not quite motivated enough to search the archives, but I'm pretty sure the examples actually found were less prominent than that. There seemed to be at least one or two fora where it was something of a local idiom.

>... I don't recall ever seeing
> an identifier with ASCII and Japanese glommed together without a
> separator. It was almost always of the form "English verb - Japanese
> lexical component".

The problem was that some were written without a "-" or "_" to separate the halves. It looked fine -- the script change was obvious to even someone who didn't speak the non-English language. But having to support that meant any remaining restriction on mixed scripts would be either too weak to be worthwhile, or too complicated to write into the python language specification.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/CUTMZG55WY3CLNNKB6VTPCOUXJ22EEZY/

pt...@austin.rr.com

unread,
Nov 13, 2021, 5:01:08 PM11/13/21
to pytho...@python.org, Alex Martelli, Alex Martelli, Anna Martelli Ravenscroft

I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic.

 

As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion).

 

To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:

 

def 𝚑𝓮𝖑𝒍𝑜():

    try:

        𝔥e𝗅𝕝𝚘 = "Hello"

        𝕨𝔬r𝓵 = "World"

        ᵖ𝖗𝐢𝘯𝓽(f"{𝗵𝓵𝔩º_}, {𝖜𝒓l}!")

    except 𝓣𝕪𝖤𝗿𝖔𝚛 as ⅇ𝗑c:

        𝒑rₙₜ("failed: {}".𝕗𝗼ʳᵐª(ᵉ𝐱𝓬))

 

if _𝓪𝑚𝕖__ == "__main__":

    𝒉eℓˡ𝗈()

 

 

# snippet from unittest/util.py

_𝓟𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽𝕷𝔼𝗡 = 12

def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢𝖝𝕝𝚎𝑛, 𝑓𝗳𝗂𝑥𝗹𝚗):

    ˢ𝗸𝗽 = 𝐥𝘯(𝖘) - r𝚎𝖋𝐢x𝗅𝓷 - 𝒔𝙪𝘅𝗹𝙚ₙ

    if si𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺𝕯𝙀𝘙L𝔈𝒩:

        𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡𝘯:])

    return

 

 

You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.

 

(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com). I have a second gist containing the transformer, but it is still a private gist atm.)

 

 

Some other discoveries:

“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.

“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.

 

 

Potential beneficial uses:

I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.

 

 

-- Paul McGuire

 

 

Stestagg

unread,
Nov 13, 2021, 5:15:30 PM11/13/21
to pytho...@python.org
This is my favourite version of the issue:

е = lambda е, e: е if е > e else e
print(е(2, 1), е(1, 2)) # python 3 outputs: 2 2

Steve

_______________________________________________

Terry Reedy

unread,
Nov 13, 2021, 5:37:29 PM11/13/21
to pytho...@python.org
On 11/13/2021 4:35 PM, pt...@austin.rr.com wrote:
> I’ve not been following the thread, but Steve Holden forwarded me the

> To explore the extreme case, I wrote a pyparsing transformer to convert
> identifiers in a body of Python source to mixed font, equivalent to the
> original source after NFKC normalization. Here are hello.py, and a
> snippet from unittest/utils.py:
>
> def 𝚑𝓮𝖑𝒍𝑜():
>
>     try:
>
> 𝔥e𝗅𝕝𝚘︴ = "Hello"
>
> 𝕨𝔬r𝓵ᵈ﹎ = "World"
>
>         ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")
>
>     except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:
>
> 𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
>
> if _︴ⁿ𝓪𝑚𝕖__ == "__main__":
>
> 𝒉eℓˡ𝗈()
>
> # snippet from unittest/util.py
>
> _𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
>
> def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
>
>     ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
>
>     if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
>
> 𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) -
> 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
>
>     return ₛ
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.

Wow. After pasting the util.py snippet into current IDLE, which on my
Windows machine* displays the complete text:

>>> dir()
['_PLACEHOLDER_LEN', '__annotations__', '__builtins__', '__doc__',
'__loader__', '__name__', '__package__', '__spec__', '_shorten']
>>> _shorten('abc', 1, 1)
'abc'
>>> _shorten('abcdefghijklmnopqrw', 2, 2)
'ab[15 chars]rw'

* Does not at all work in CommandPrompt, even after supposedly changing
to a utf-8 codepage with 'chcp 65000'.

--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/NSGBCZQ2R6G2HGPAID4ZI35YCRMF7ERC/

Christopher Barker

unread,
Nov 14, 2021, 12:18:36 PM11/14/21
to pt...@austin.rr.com, Python Dev, Alex Martelli, Alex Martelli, Anna Martelli Ravenscroft
On Sat, Nov 13, 2021 at 2:03 PM <pt...@austin.rr.com> wrote:

def 𝚑𝓮𝖑𝒍𝑜():

    try:

        𝔥e𝗅𝕝𝚘 = "Hello"

        𝕨𝔬r𝓵 = "World"

        ᵖ𝖗𝐢𝘯𝓽(f"{𝗵𝓵𝔩º_}, {𝖜𝒓l}!")

    except 𝓣𝕪𝖤𝗿𝖔𝚛 as ⅇ𝗑c:

        𝒑rₙₜ("failed: {}".𝕗𝗼ʳᵐª(ᵉ𝐱𝓬))


Wow. Just Wow.

So why does Python apply  NFKC normalization to variable names?? I can't for the life of me figure out why that would be helpful at all.

The string methods, sure, but names?

And, in fact, the normalization is not used for string comparisons or hashes as far as I can tell.

In [36]: weird
Out[36]: 'ᵖ𝖗𝐢𝘯𝓽'

In [37]: normal
Out[37]: 'print'

In [38]: eval(weird + "('yup, that worked')")
yup, that worked

In [39]: weird == normal
Out[39]: False

In [40]: weird[0] in normal
Out[40]: False

This seems very odd (and dangerous) to me.

Is there a good reason? and is it too late to change it?

-CHB







 

 

if _𝓪𝑚𝕖__ == "__main__":

    𝒉eℓˡ𝗈()

 

 

# snippet from unittest/util.py

_𝓟𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽𝕷𝔼𝗡 = 12

def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢𝖝𝕝𝚎𝑛, 𝑓𝗳𝗂𝑥𝗹𝚗):

    ˢ𝗸𝗽 = 𝐥𝘯(𝖘) - r𝚎𝖋𝐢x𝗅𝓷 - 𝒔𝙪𝘅𝗹𝙚ₙ

    if si𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺𝕯𝙀𝘙L𝔈𝒩:

        𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡𝘯:])

    return

 

 

You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.

 

(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com). I have a second gist containing the transformer, but it is still a private gist atm.)

 

 

Some other discoveries:

“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.

“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”) can only be used as identifier body characters. “” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.

 

 

Potential beneficial uses:

I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.

 

 

-- Paul McGuire

 

 

_______________________________________________


--
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

Jim J. Jewett

unread,
Nov 14, 2021, 12:40:15 PM11/14/21
to pytho...@python.org
ptmcg@austin.rr.com wrote:

> ... add a cautionary section on homoglyphs, specifically citing
> “A” (LATIN CAPITAL LETTER A) and “Α” (GREEK CAPITAL LETTER ALPHA)
> as an example problem pair.

There is a unicode tech report about confusables, but it is never clear where to stop. Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII already a problem? And if we do it at all, is there any way to avoid making Cyrillic languages second-class?

I'm not quickly finding the contemporary report, but these should be helpful if you want to go deeper:

http://www.unicode.org/reports/tr36/
http://unicode.org/reports/tr36/confusables.txt
https://util.unicode.org/UnicodeJsps/confusables.jsp


> I wanted to look a little further at the use of characters in identifiers
> beyond the standard 7-bit ASCII, and so I found some of these same
> issues dealing with Unicode NFKC normalization. The first discovery was
> the overlapping normalization of “ªº” with “ao”.

Here I don't see the problem. Things that look slightly different are really the same, and you can write it either way. So you can use what looks like a funny font, but the closest it comes to a security risk is that maybe you could access something without a casual reader realizing that you are doing so. They would know that you *could* access it, just not that you *did*.

> Some other discoveries:
> “·” (ASCII 183) is a valid identifier body character, making “_···” a valid
> Python identifier.

That and the apostrophe are Unicode consortium regrets, because they are normally punctuation, but there are also languages that use them as letters.
The apostrophe is (supposedly only) used by Afrikaans, I asked a native speaker about where/how often it was used, and the similarity to Dutch was enough that Guido felt comfortable excluding it. (It *may* have been similar to using the apostrophe for a contraction in English, and saying it therefore represents a letter, but the scope was clearly smaller.) But the dot is used in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed for sensible identifiers. It is worth listing as a warning, and linters should probably complain.

> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “︳︴﹍﹎﹏_”)
> can only be used as identifier body characters. “︳” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.

So go ahead and warn, but it isn't clear how that could be abused to look like something other than a syntax error, except maybe through soft keywords. (Ha! I snuck in a call to async︳def that had been imported with *, and you didn't worry about the import *, or the apparently wild cursor position marker, or the strange async definition that was never used! No way I could have just issued a call to _flush and done the same thing!)

> Potential beneficial uses:
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups
> instead of colors. Module names using characters from one group,
> builtins from another, program variables from another, maybe
> distinguish local from global variables. Colorizing has always been an
> obvious syntax highlight feature, but is an accessibility issue for those
> with difficulty distinguishing colors.

I kind of like the idea, but ... if you're doing it on-the-fly in the editor, you could just use different fonts. If you're actually saving those changes, it seems likely to lead to a lot of spurious diffs if anyone uses a different editor.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/NPTL43EVT2FF76LXIBBWVHDU6NXH3HF5/

MRAB

unread,
Nov 14, 2021, 1:21:50 PM11/14/21
to pytho...@python.org
On 2021-11-14 17:17, Christopher Barker wrote:
> On Sat, Nov 13, 2021 at 2:03 PM <pt...@austin.rr.com
> <mailto:pt...@austin.rr.com>> wrote:
>
> def 𝚑𝓮𝖑𝒍𝑜():
>
> __
>
>     try:____
>
> 𝔥e𝗅𝕝𝚘︴ = "Hello"____
>
> 𝕨𝔬r𝓵ᵈ﹎ = "World"____
>
>         ᵖ𝖗𝐢𝘯𝓽(f"{𝗵e𝓵𝔩º_}, {𝖜ₒ𝒓lⅆ︴}!")____
>
>     except 𝓣𝕪ᵖe𝖤𝗿ᵣ𝖔𝚛 as ⅇ𝗑c:____
>
> 𝒑rℹₙₜ("failed: {}".𝕗𝗼ʳᵐªt(ᵉ𝐱𝓬))
>
>
> Wow. Just Wow.
>
> So why does Python apply  NFKC normalization to variable names?? I can't
> for the life of me figure out why that would be helpful at all.
>
> The string methods, sure, but names?
>
> And, in fact, the normalization is not used for string comparisons or
> hashes as far as I can tell.
>
[snip]

It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.

Unfortunately, it goes too far, because it's unlikely that we want "ᵖ"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/PNZICEQGVEAQH7KNBCBSS4LPAO25JBF3/

Alex Martelli via Python-Dev

unread,
Nov 14, 2021, 1:40:50 PM11/14/21
to Christopher Barker, pt...@austin.rr.com, Python Dev, Alex Martelli, Anna Martelli Ravenscroft
Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html section 5 says: "if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate" (vs NFKC for a language with case-insensitive identifiers) so to follow the standard we should have used NFC rather than NFKC. Not sure if it's too late to fix this "oops" in future Python versions.

Alex

Christopher Barker

unread,
Nov 14, 2021, 2:08:18 PM11/14/21
to MRAB, Python Dev
On Sun, Nov 14, 2021 at 10:27 AM MRAB <pyt...@mrabarnett.plus.com> wrote:
> So why does Python apply  NFKC normalization to variable names??
 
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.

sure, but this is code, written by humans (or meta-programming). Maybe I'm showing my english bias, but would it be that limiting to have identifiers be based on codepoints, period?

Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file?

But if so ...
 
Unfortunately, it goes too far, because it's unlikely that we want "ᵖ"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".

Is it possible to only capture things like the combining characters and not the "equivalent" ones like the above?

-CHB

Daniel Pope

unread,
Nov 14, 2021, 2:21:20 PM11/14/21
to pytho...@python.org

On Sun, 14 Nov 2021, 19:07 Christopher Barker, <pyth...@gmail.com> wrote:
On Sun, Nov 14, 2021 at 10:27 AM MRAB <pyt...@mrabarnett.plus.com> wrote:
Unfortunately, it goes too far, because it's unlikely that we want "ᵖ"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".

Is it possible to only capture things like the combining characters and not the "equivalent" ones like the above?

Yes, that is NFC. NKFC converts to equivalent characters and also composes; NFC just composes.

Richard Damon

unread,
Nov 14, 2021, 2:28:32 PM11/14/21
to pytho...@python.org
On 11/14/21 2:07 PM, Christopher Barker wrote:
> Why does someone that wants to use, .e.g. "é" in an identifier have
> to be able to represent it two different ways in a code file?
>
The issue here is that fundamentally, some editors will produce composed
characters and some decomposed characters to represent the same actual
'character'

These two methods are defined by Unicode to really represent the same
'character', it is just that some defined sequences of combining
codepoints just happen to have a composed 'abbreviation' defined also.

Having to exactly match the byte sequence says that some people will
have a VERY hard time entering usable code if there tools support
Unicode, but use the other convention.

--
Richard Damon

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/WXGHMDIAY2M77MUMBM4NU7LZTIQTEBNP/

David Mertz, Ph.D.

unread,
Nov 14, 2021, 2:41:52 PM11/14/21
to Christopher Barker, Python Dev
On Sun, Nov 14, 2021, 2:14 PM Christopher Barker 
It's probably to deal with "é" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.

Why does someone that wants to use, .e.g. "é" in an identifier have to be able to represent it two different ways in a code file?

Imagine that two different programmers work with the same code base, and their text editors or keystrokes enter "é" in different ways.

Or imagine just one programmer doing so on two different machines/environments.

As an example, I wrote this reply on my Android tablet (with such-and-such OS version). I have no idea what actual codepoint(s) are entered when I press and hold the "e" key for a couple seconds to pop up character variations.

If I wrote it on OSX, I'd probably press "alt-e e" on my US International key layout. Again, no idea what codepoints actually are entered. If I did it on Linux, I'd use "ctrl-shift u 00e9". In that case, I actually know the codepoint.

Richard Damon

unread,
Nov 14, 2021, 4:00:23 PM11/14/21
to pytho...@python.org
But would have to look up the actual number to enter them.

Imagine of ALL your source code had to be entered via code-point numbers.

BTW, you should be able to enable 'composing' under Linux too, just like
under OSX with the right input driver loaded.

--
Richard Damon

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/N76K3RML5QIFW56BRRVUOW5HGKSJAIVA/

Steven D'Aprano

unread,
Nov 14, 2021, 7:48:16 PM11/14/21
to pytho...@python.org
Out of all the approximately thousand bazillion ways to write obfuscated
Python code, which may or may not be malicious, why are Unicode
confusables worth this level of angst and concern?

I looked up "Unicode homoglyph" on CVE, and found a grand total of seven
hits:

https://www.cvedetails.com/google-search-results.php?q=unicode+homoglyph

all of which appear to be related to impersonation of account names. I
daresay if I expanded my search terms, I would probably find some more,
but it is clear that Unicode homoglyphs are not exactly a major threat.

In my opinion, the other Steve's (Stestagg) example of obfuscated code
with homoglyphs for e (as well as a few similar cases, such as
homoglyphs for A) mostly makes for an amusing curiosity, perhaps worth a
plugin for Pylint and other static checkers, but not much more. I'm not
entirely sure what Paul's more lurid examples are supposed to indicate.
If your threat relies on a malicious coder smuggling in identifiers like
"𝚑𝓮𝖑𝒍𝑜" or "ªº" and having the reader not notice, then I'm not going to
lose much sleep over it.

Confusable account names and URL spoofing are proven, genuine threats.
Beyond that, IMO the actual threat window from confusables is pretty
small. Yes, you can write obfuscated code, and smuggle in calls to
unexpected functions:

result = lеn(sequence) # Cyrillic letter small Ie

but you still have to smuggle in a function to make it work:

def lеn(obj):
# something malicious

And if you can do that, the Unicode letter is redundant. I'm not sure
why any attacker would bother.


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/XNRW6JSFGO4DQOGVNY2FEZAUBN6P2HRR/

Christopher Barker

unread,
Nov 15, 2021, 1:15:12 AM11/15/21
to Steven D'Aprano, Python Dev
On Sun, Nov 14, 2021 at 4:53 PM Steven D'Aprano <st...@pearwood.info> wrote:
Out of all the approximately thousand bazillion ways to write obfuscated
Python code, which may or may not be malicious, why are Unicode
confusables worth this level of angst and concern?

I for one am not full of angst nor particularly concerned. Though ti's a fine idea to inform folks about h this issues.

I am, however, surprised and disappointed by the NKFC normalization.

For example, in writing math we often use different scripts to mean different things (e.g. TeX's 
Blackboard Bold). So if I were to use some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want them to get normalized.

Then there's the question of when this normalization happens (and when it doesn't). If one is doing any kind of metaprogramming, even just using getattr() and setattr(), things could get very confusing:

In [55]: class Junk:
    ...:     𝗵e𝓵𝔩º = "hello"
    ...:

In [56]: setattr(Junk, "ᵖ𝖗𝐢𝘯𝓽", "print")

In [57]: dir(Junk)
Out[57]:
 '__weakref__',
<snip>
 'hello',
 'ᵖ𝖗𝐢𝘯𝓽']

In [58]: Junk.hello
Out[58]: 'hello'

In [59]: Junk.𝗵e𝓵𝔩º
Out[59]: 'hello'

In [60]: Junk.print
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-60-f2a7d3de5d06> in <module>
----> 1 Junk.print

AttributeError: type object 'Junk' has no attribute 'print'

In [61]: Junk.ᵖ𝖗𝐢𝘯𝓽
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-61-004f4c8b2f07> in <module>
----> 1 Junk.ᵖ𝖗𝐢𝘯𝓽

AttributeError: type object 'Junk' has no attribute 'print'

In [62]: getattr(Junk, "ᵖ𝖗𝐢𝘯𝓽")
Out[62]: 'print'

Would a proposal to switch the normalization to NFC only have any hope of being accepted?

and/or adding normaliztion to setattr() and maybe other places where names are set in code?

-CHB

Stephen J. Turnbull

unread,
Nov 15, 2021, 3:26:03 AM11/15/21
to Christopher Barker, Python Dev
Christopher Barker writes:

> Would a proposal to switch the normalization to NFC only have any hope of
> being accepted?

Hope, yes. Counting you, it's been proposed twice. :-) I don't know
whether it would get through. We know this won't affect the stdlib,
since that's restricted to ASCII. I suppose we could trawl PyPI and
GitHub for "compatibles" (the Unicode term for "K" normalizations).

> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> want them to get normalized.

Independent of the question of the normalization of Python
identifiers, I think using those characters this way is a bad idea.
In fact, I think adding these symbols to Unicode was a bad idea; they
should be handled at a higher level in the linguistic stack (by
semantic markup).

You're confusing two things here. In Unicode, a script is a
collection of characters used for a specific language, typically a set
of Unicode blocks of characters (more or less; there are a lot of Han
ideographs that are recognizable as such to Japanese but are not part
of the repertoire of the Japanese script). That is, these characters
are *different* from others that look like them.

Blackboard Bold is more what we would usually call a "font": the
(math) italic "x" and the (math) bold italic "x" are the same "x", but
one denotes a scalar and the other a vector in many math books. A
roman "R" probably denotes the statistical application, an italic "R"
the reaction function in game theory model, and a Blackboard Bold "R"
the set of real numbers. But these are all the same character.

It's a bad idea to rely on different (Unicode) scripts that use the
same glyphs for different characters to look different from each
other, unless you "own" the fonts to be used. As far as I know
there's no way for a Python program to specify the font to be used to
display itself though. :-)

It's also a UX problem. At slightly higher layer in the stack, I'm
used to using Japanese input methods to input sigma and pi which
produce characters in the Greek block, and at least the upper case
forms that denote sum and product have separate characters in the math
operators block. I understand why people who literally write
mathematics in Greek might want those not normalized, but I sure am
going to keep using "Greek sigma", not "math sigma"! The probability
that I'm going to have a Greek uppercase sigma in my papers is nil,
the probability of a summation symbol near unity. But the summation
symbol is not easily available, I have to scroll through all the
preceding Unicode blocks to find Mathematical Operators. So I am
perfectly happy with uppercase Greek sigma for that role (as is
XeTeX!!)

And the thing is, of course those Greek letters really are Greek
letters: they were chosen because pi is the homophone of p which is
the first letter of "product", and sigma is the homophone of s which
is the first letter of "sum". Å for Ångström is similar, it's the
initial letter of a Swedish name.

Sure, we could fix the input methods (and search methods!! -- people
are going to input the character they know that corresponds to the
glyph *they* see, not the bit pattern the *CPU* sees). But that's as
bad as trying to fix mail clients. Not worth the effort because I'm
pretty sure you're gonna fail -- it's one of those "you'll have to pry
this crappy software that annoys admins around the world from my cold
dead fingers" issues, which is why their devs refuse to fix them.

Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/5GHPVNJLLOKBYPE7FSU5766XYP6IJPEK/

Abdur-Rahmaan Janhangeer

unread,
Nov 15, 2021, 3:34:58 AM11/15/21
to Steven D'Aprano, Python Dev
Well,

Yet another issue is adding vulnerabilities in plain sight.

Human code reviewers will see this:

if user.admin == "something":

Static analysers will see

if user.admin == "something<hidden chars>":

but will not flag it as it's up to the user to verify the logic of  things

and as such soft authors can plant backdoors in plain sight

Kind Regards,

Abdur-Rahmaan Janhangeer
github
Mauritius

Kyle Stanley

unread,
Nov 15, 2021, 3:38:36 AM11/15/21
to pt...@austin.rr.com, Python Dev, Alex Martelli, Alex Martelli, Anna Martelli Ravenscroft
On Sat, Nov 13, 2021 at 5:04 PM <pt...@austin.rr.com> wrote:

 

def 𝚑𝓮𝖑𝒍𝑜():

    try:

        𝔥e𝗅𝕝𝚘 = "Hello"

        𝕨𝔬r𝓵 = "World"

        ᵖ𝖗𝐢𝘯𝓽(f"{𝗵𝓵𝔩º_}, {𝖜𝒓l}!")

    except 𝓣𝕪𝖤𝗿𝖔𝚛 as ⅇ𝗑c:

        𝒑rₙₜ("failed: {}".𝕗𝗼ʳᵐª(ᵉ𝐱𝓬))

 

if _𝓪𝑚𝕖__ == "__main__":

    𝒉eℓˡ𝗈()

 

 

# snippet from unittest/util.py

_𝓟𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽𝕷𝔼𝗡 = 12

def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢𝖝𝕝𝚎𝑛, 𝑓𝗳𝗂𝑥𝗹𝚗):

    ˢ𝗸𝗽 = 𝐥𝘯(𝖘) - r𝚎𝖋𝐢x𝗅𝓷 - 𝒔𝙪𝘅𝗹𝙚ₙ

    if si𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺𝕯𝙀𝘙L𝔈𝒩:

        𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) - 𝙨𝚞𝒇fix𝙡𝘯:])

    return


0_o color me impressed, I did not think that would be legal syntax. Would be interesting to include in a textbook, if for nothing else other than to academically demonstrate that it is possible, as I suspect many are not aware.

--
--Kyle R. Stanley, Python Core Developer (what is a core dev?)
Pronouns: they/them (why is my pronoun here?)

Petr Viktorin

unread,
Nov 15, 2021, 4:00:55 AM11/15/21
to pytho...@python.org
On 15. 11. 21 9:25, Stephen J. Turnbull wrote:
> Christopher Barker writes:
>
> > Would a proposal to switch the normalization to NFC only have any hope of
> > being accepted?
>
> Hope, yes. Counting you, it's been proposed twice. :-) I don't know
> whether it would get through. We know this won't affect the stdlib,
> since that's restricted to ASCII. I suppose we could trawl PyPI and
> GitHub for "compatibles" (the Unicode term for "K" normalizations).

I don't think PyPI/GitHub are good resources to trawl.

Non-ASCII identifiers were added for the benefit of people who use
non-English languages. But both on PyPI and GitHub are overwhelmingly
projects written in English -- especially if you look at the more
popular projects.
It would be interesting to reach out to the target audience here... but
they're not on this list, either. Do we actually know anyone using this?


I do teach beginners in a non-English language, but tell them that they
need to learn English if they want to do any serious programming. Any
code that's to be shared more widely than a country effectively has to
be in English. It seems to me that at the level where you worry about
supply chain attacks and you're doing code audits, something like
CPython's policy (ASCII only except proper names and Unicode-related
tests) is a good idea.
Or not? I don't know anyone who actually uses non-ASCII identifiers for
a serious project.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/AVCLMBIXWPNIIKRFMGTS5SETUCGAONLK/

Steven D'Aprano

unread,
Nov 15, 2021, 5:54:39 AM11/15/21
to pytho...@python.org
On Mon, Nov 15, 2021 at 12:33:54PM +0400, Abdur-Rahmaan Janhangeer wrote:

> Yet another issue is adding vulnerabilities in plain sight.
> Human code reviewers will see this:
>
> if user.admin == "something":
>
> Static analysers will see
>
> if user.admin == "something<hidden chars>":

Okay, you have a string literal with hidden characters. Assuming that
your editor actually renders them as invisible characters, rather than
"something???" or "something□□□" or "something���" or equivalent.

Now what happens? where do you go from there to a vunerability or
backdoor? I think it might be a bit obvious that there is something
funny going on if I see:

if (user.admin == "root" and check_password_securely()
or user.admin == "root"
# Second string has hidden characters, do not remove it.
):
elevate_privileges()

even without the comment :-)

In another thread, Serhiy already suggested we ban invisible control
characters (other than whitespace) in comments and strings.

https://mail.python.org/archives/list/pytho...@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/

I think that is a good idea.

But beyond the C0 and C1 control characters, we should be conservative
about banning "hidden characters" without a *concrete* threat. For
example, variation selectors are "hidden", but they change the visual
look of emoji and other characters. Even if you think that being able to
set the skin tone of your emoji or choose different national flags using
variation selectors is pure frippery, they are also necessary for
Mongolian and some CJK ideographs.

http://unicode.org/reports/tr28/tr28-3.html#13_7_variation_selectors

I'm not sure about bidirectional controls; I have to leave that to
people with more experience in bidirectional text than I do. I think
that many editors in common use don't support bidirectional text, or at
least the ones I use don't seem to support it fully or correctly. But
for what little it is worth, my feeling is that people who use RTL or
bidirectional strings and have editors that support them will be annoyed
if we ban them from strings for the comfort of people who may never in
their life come across a string containing such bidirectional text.

But, if there is a concrete threat beyond "it looks weird", that it
another issue.


> but will not flag it as it's up to the user to verify the logic of
> things

There is no reason why linters and code checkers shouldn't check for
invisible characters, Unicode confusables or mixed script identifiers
and flag them. The interpreter shouldn't concern itself with such purely
stylistic issues unless there is a concrete threat that can only be
handled by the interpreter itself.


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/KSIBL3KMONIETBKXSBPPMA27MACWIH33/

Abdur-Rahmaan Janhangeer

unread,
Nov 15, 2021, 6:21:30 AM11/15/21
to Steven D'Aprano, Python Dev
Greetings,


> Now what happens? where do you go from there to a vunerability or
backdoor? I think it might be a bit obvious that there is something
funny going on if I see:

    if (user.admin == "root" and check_password_securely()
            or user.admin == "root"
            # Second string has hidden characters, do not remove it.
            ):
        elevate_privileges()


Well, it's not so obvious. From Ross Anderson and Nicholas Boucher

See appendix H. for Python.

with implementations: 


Rely precisely on bidirectional control chars and/or replacing look alikes

> There is no reason why linters and code checkers shouldn't check for
invisible characters, Unicode confusables or mixed script identifiers
and flag them. The interpreter shouldn't concern itself with such purely
stylistic issues unless there is a concrete threat that can only be
handled by the interpreter itself.


I mean current linters. But it will be good to check those for sure.
As a programmer, i don't want a language which bans unicode stuffs.
If there's something that should be fixed, it's the unicode standard, maybe
defining a sane mode where weird unicode stuffs are not allowed. Can also
be from language side in the event where it's not being considered in the standard 
itself. 

I don't see it as a language fault nor as a client fault as they are considering
the unicode docs but the response was mixed with some languages decided to patch it 
from their side, some linters implementing detection for it as well as some editors flagging
it and rendering it as the exploit intended.

Steven D'Aprano

unread,
Nov 15, 2021, 6:45:09 AM11/15/21
to pytho...@python.org
On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:

> I am, however, surprised and disappointed by the NKFC normalization.
>
> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want
> them to get normalized.

Hmmm... would you really want these to all be different identifiers?

𝕭 𝓑 𝑩 𝐁 B

You're assuming the reader of the code has the right typeface to view
them (rather than as mere boxes), and that their eyesight is good enough
to distinguish the variations even if their editor applies bold or
italic as part of syntax highlighting. That's very bold of you :-)

In any case, the question of NFKC versus NFC was certainly considered,
but unfortunately PEP 3131 doesn't document why NFKC was chosen.

https://www.python.org/dev/peps/pep-3131/

Before we change the normalisation rules, it would probably be a good
idea to trawl through the archives of the mailing list and work out why
NFKC was chosen in the first place, or contact Martin von Löwis and see
if he remembers.


> Then there's the question of when this normalization happens (and when it
> doesn't). If one is doing any kind of metaprogramming, even just using
> getattr() and setattr(), things could get very confusing:

For ordinary identifiers, they are normalised at some point during
compilation or interpretation. It probably doesn't matter exactly when.

Strings should *not* be normalised when using subscripting on a dict,
not even on globals():

https://bugs.python.org/issue42680

I'm not sure about setattr and getattr. I think that they should be
normalised. But apparently they aren't:

>>> from types import SimpleNamespace
>>> obj = SimpleNamespace(B=1)
>>> setattr(obj, '𝕭', 2)
>>> obj
namespace(B=1, 𝕭=2)
>>> obj.B
1
>>> obj.𝕭
1

See also here:

https://bugs.python.org/issue35105



--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/7XZJPFED3YJSJ73YSPWCQPN6NLTNEMBI/

Chris Angelico

unread,
Nov 15, 2021, 6:46:19 AM11/15/21
to Python Dev
On Mon, Nov 15, 2021 at 10:22 PM Abdur-Rahmaan Janhangeer
<arj.p...@gmail.com> wrote:
>
> Greetings,
>
>
> > Now what happens? where do you go from there to a vunerability or
> backdoor? I think it might be a bit obvious that there is something
> funny going on if I see:
>
> if (user.admin == "root" and check_password_securely()
> or user.admin == "root"
> # Second string has hidden characters, do not remove it.
> ):
> elevate_privileges()
>
>
> Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
> src: https://trojansource.codes/trojan-source.pdf
>
> See appendix H. for Python.
>
> with implementations:
>
> https://github.com/nickboucher/trojan-source/tree/main/Python
>
> Rely precisely on bidirectional control chars and/or replacing look alikes

The point of those kinds of attacks is that syntax highlighters and
related code review tools would misinterpret them. So I pulled them
all up in both GitHub's view and the editor I personally use (SciTE,
albeit a fairly old version now). GitHub specifically flags it as a
possible exploit in a couple of cases, but also syntax highlights the
return keyword appropriately. SciTE doesn't give any sort of warnings,
but again, correctly highlights the code - early-return shows "return"
as a keyword, invisible-function shows the name "is_" as the function
name and the rest not, homoglyph-function shows a quite
distinct-looking letter that definitely isn't an H.

The problems here are not Python's, they are code reviewers', and that
means they're really attacks against the code review tools. It's no
different from using the variable m in one place and rn in another,
and hoping that code review uses a proportionally-spaced font that
makes those look similar. So to count as a viable attack, there needs
to be at least one tool that misparses these; so far, I haven't found
one, but if I do, wouldn't it be more appropriate to raise the bug
report against the tool?

> > There is no reason why linters and code checkers shouldn't check for
> invisible characters, Unicode confusables or mixed script identifiers
> and flag them. The interpreter shouldn't concern itself with such purely
> stylistic issues unless there is a concrete threat that can only be
> handled by the interpreter itself.
>
>
> I mean current linters. But it will be good to check those for sure.
> As a programmer, i don't want a language which bans unicode stuffs.
> If there's something that should be fixed, it's the unicode standard, maybe
> defining a sane mode where weird unicode stuffs are not allowed. Can also
> be from language side in the event where it's not being considered in the standard
> itself.

Uhhm..... "weird unicode stuffs"? Please clarify.

> I don't see it as a language fault nor as a client fault as they are considering
> the unicode docs but the response was mixed with some languages decided to patch it
> from their side, some linters implementing detection for it as well as some editors flagging
> it and rendering it as the exploit intended.

I see it as an editor issue (or code review tool, as the case may be).
You'd be hard-pressed to get something past code review if it looks to
everyone else like you slipped a "return" statement at the end of a
docstring.

So far, I've seen fewer problems from "weird unicode stuffs" than from
the quoted-printable encoding, and that's an attack that involves
nothing but ASCII text. It's also an attack that far more code review
tools seem to be vulnerable to.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/OUPC6LGFXIILBTNEC4FYTERBX7VKQHDX/

Marc-Andre Lemburg

unread,
Nov 15, 2021, 7:09:10 AM11/15/21
to Steven D'Aprano, pytho...@python.org
On 15.11.2021 12:36, Steven D'Aprano wrote:
> On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
>
>> I am, however, surprised and disappointed by the NKFC normalization.
>>
>> For example, in writing math we often use different scripts to mean
>> different things (e.g. TeX's Blackboard Bold). So if I were to use
>> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want
>> them to get normalized.
>
> Hmmm... would you really want these to all be different identifiers?
>
> 𝕭 𝓑 𝑩 𝐁 B
>
> You're assuming the reader of the code has the right typeface to view
> them (rather than as mere boxes), and that their eyesight is good enough
> to distinguish the variations even if their editor applies bold or
> italic as part of syntax highlighting. That's very bold of you :-)
>
> In any case, the question of NFKC versus NFC was certainly considered,
> but unfortunately PEP 3131 doesn't document why NFKC was chosen.
>
> https://www.python.org/dev/peps/pep-3131/
>
> Before we change the normalisation rules, it would probably be a good
> idea to trawl through the archives of the mailing list and work out why
> NFKC was chosen in the first place, or contact Martin von Löwis and see
> if he remembers.

This was raised in the discussion, but never conclusively answered:

https://mail.python.org/pipermail/python-3000/2007-May/007995.html

NFKC is the standard normalization form when you want remove any
typography related variants/hints from the text before comparing
strings. See http://www.unicode.org/reports/tr15/

I guess that's why Martin chose this form, since the point
was to maintain readability, even if different variants of a
character are used in the source code. A "B" in the source code
should be interpreted as an ASCII B, even when written
as 𝕭 𝓑 𝑩 or 𝐁.

This simplifies writing code and does away with many of the
security issues you could otherwise run into (where e.g. the
absence of an identifier causes the application flow to
be different).

>> Then there's the question of when this normalization happens (and when it
>> doesn't).

It happens in the parser when reading a non-ASCII identifier
(see Parser/pegen.c), so only applies to source code, not attributes
you dynamically add to e.g. class or module namespaces.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 15 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/SNN2WZ3MOH5IACSZVHGS6DKTNMKO5JBV/

Stephen J. Turnbull

unread,
Nov 15, 2021, 11:19:41 AM11/15/21
to Abdur-Rahmaan Janhangeer, Python Dev
Abdur-Rahmaan Janhangeer writes:

> As a programmer, i don't want a language which bans unicode stuffs.

But that's what Unicode says should be done (see below).

> If there's something that should be fixed, it's the unicode standard,

Unicode is not going to get "fixed". Most features are important for
some natural language or other. One could argue that (for example)
math symbols that are adopted directly from some character repertoire
should not have been -- I did so elsewhere, although not terribly
seriously.

> maybe defining a sane mode where weird unicode stuffs are not
> allowed.

Unicode denies responsibility for that by permitting arbitrary
subsetting. It does have a couple of (very broad) subsets predefined,
ie, the normalization formats.


_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/63FDIQQNJKCH7C3NMEN3ECRHTA7JHJ2W/

Terry Reedy

unread,
Nov 15, 2021, 12:29:25 PM11/15/21
to pytho...@python.org
On 11/15/2021 5:45 AM, Steven D'Aprano wrote:

> In another thread, Serhiy already suggested we ban invisible control
> characters (other than whitespace) in comments and strings.

He said in string *literals*. One would put them in stromgs by using
visible escape sequences.

>>> '\033' is '\x1b' is '\u001b'
True
If one is outputting terminal control sequences, making the escape char
visible is a good idea anyway. It would be easier if '\e' worked. (But
see below.)

> But beyond the C0 and C1 control characters, we should be conservative
> about banning "hidden characters" without a *concrete* threat. For
> example, variation selectors are "hidden", but they change the visual
> look of emoji and other characters.
I can imagine that a complete emoji point and click input method might
have one select the emoji and the variation and output the pair
together. An option to output the selection character as the
appropriate python-specific '\unnnn' is unlikely, and even if there
were, who would know what it meant? Users would want the selected
variation visible if the editor supported such.

If terminal escape sequences were also selected by point and click, my
comment above would change.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/4IMXVQFZI3VDHA4D2YZD4KTBU7GSEFPW/

Abdur-Rahmaan Janhangeer

unread,
Nov 15, 2021, 12:36:21 PM11/15/21
to Chris Angelico, Python Dev
> GitHub specifically flags it as a
possible exploit in a couple of cases, but also syntax highlights the
return keyword appropriately. 

My guess is that Github did patch it afterwards as the paper does list Github
as vulnerable

> Uhhm..... "weird unicode stuffs"? Please clarify.

Wriggly texts just because they appear different

Well, it's tool based but maybe compiler checks aka checks from
the language side is something that should be insisted upon too to
patch inconsistent checks across editors.

The reason i was saying it's related to encodings is that when languages
are impacted en masse, maybe it hints to a revision in the unicode standards
at the very least warnings. As Steven above even before i posted the paper 
was hinting towards the vulnerability so maybe those in charge of the unicode 
standards should study and predict angles of attacks.

Steven D'Aprano

unread,
Nov 15, 2021, 6:07:18 PM11/15/21
to pytho...@python.org
On Mon, Nov 15, 2021 at 12:28:01PM -0500, Terry Reedy wrote:
> On 11/15/2021 5:45 AM, Steven D'Aprano wrote:
>
> >In another thread, Serhiy already suggested we ban invisible control
> >characters (other than whitespace) in comments and strings.
>
> He said in string *literals*. One would put them in stromgs by using
> visible escape sequences.

Thanks Terry for the clarification, of course I didn't mean to imply
that we should ban control characters in strings completely. Only actual
control characters embedded in string literals in the source, just as we
already currently ban them outside of comments and strings.


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/XCPSQYKOX4YXDIAACDLL3I5OYWFGFLD7/

Steven D'Aprano

unread,
Nov 15, 2021, 7:49:18 PM11/15/21
to pytho...@python.org
On Mon, Nov 15, 2021 at 03:20:26PM +0400, Abdur-Rahmaan Janhangeer wrote:

> Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
> src: https://trojansource.codes/trojan-source.pdf

Thanks for the link. But it discusses a whole range of Unicode attacks,
and the specific attack you mentioned (Invisible Character Attacks) is
described in section D page 7 as "unlikely to work in practice".

As they say, compilers and interpreters in general already display
errors, or at least a warning, for invisible characters in code.

In addition, there is the difficulty that its not just enough to use
invisible characters to call a different function, you have to smuggle
in the hostile function that you actually want to call.

It does seem that the Trojan-Source attack listed in the paper is new,
but others (such as the homoglyph attacks that get most people's
attention) are neither new nor especially easy to actually exploit.
Unicode has been warning about it for many years. We discussed it in PEP
3131. This is not new, and not easy to exploit.

Perhaps that's why there are no, or very few, actual exploits of this in
the wild. Homoglyph attacks against user-names and URLs, absolutely, but
homoglyph attacks against source code are a different story.

Yes, you can cunningly have two classes like Α and A and the Python
interpreter will treat them as distinct, but you still have to smuggle
in your hostile code in Α (greek Alpha) without anyone noticing, and you
have to avoid anyone asking why you have two classes with the same name.
And that's the hard part.

We don't need Unicode for homoglyph attacks. func0 and funcO may look
identical, or nearly identical, but you still have to smuggle in your
hostile code into funcO without anyone noticing, and that's why there
are so few real-world homoglyph attacks.

Whereas the Trojan Source attacks using BIDI controls does seem to be
genuinely exploitable.


--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/FSHGS4AOAGTWKSWAADZWH5L2GGBWHHXE/

Steven D'Aprano

unread,
Nov 15, 2021, 8:11:33 PM11/15/21
to pytho...@python.org
On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:

> The problems here are not Python's, they are code reviewers', and that
> means they're really attacks against the code review tools.

I think that's a bit strong. Boucher and Anderson's paper describes
multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI
attacks does seem to be a novel attack, and probably exploitable.

But unfortunately it seems to be the Unicode confusables or homoglyph
attack that seems to be getting most of the attention, and that's not
new, it is as old as ASCII, and not so easily exploitable. Being able to
have А (Cyrillic) Α (Greek alpha) and A (Latin) in the same code base
makes for a nice way to write obfuscated code, but it's *obviously*
obfuscated and not so easy to smuggle in hostile code.

Whereas the BIDI attacks do (apparently) make it easy to smuggle in
code: using invisible BIDI control codes, you can introduce source code
where the way the editor renders the code, and the way the coder reads
it, is different from the way the interpreter or compiler runs it.

That is, I think, new and exploitable: something that looks like a
comment is actually code that the interpreter runs, and something that
looks like code is actually a string or comment which is not executed,
but editors may syntax-colour it as if it were code.

Obviously we can mitigate against this by improving the editors (at the
very least, all editors should have a Show Invisible Characters option).
Linters and code checks should also flag problematic code containing
BIDI codes, or attacks against docstrings.

Beyond that, it is not clear to me what, if anything, we should do in
response to this new class of Trojan Source attacks, beyond documenting
it.

--
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/SXF2BG47UZTI7QM7GB3XCTGEV576UZOE/

Chris Angelico

unread,
Nov 15, 2021, 8:59:44 PM11/15/21
to pytho...@python.org
On Tue, Nov 16, 2021 at 12:13 PM Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:
>
> > The problems here are not Python's, they are code reviewers', and that
> > means they're really attacks against the code review tools.
>
> I think that's a bit strong. Boucher and Anderson's paper describes
> multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI
> attacks does seem to be a novel attack, and probably exploitable.

The BIDI attacks basically amount to making this:

def func():
"""This is a docstring"""; return

look like this:

def func():
"""This is a docstring; return"""

If you see something that looks like the second, but the word "return"
is syntax-highlighted as a keyword instead of part of the string, the
attack has failed. (Or if you ignore that, then your code review is
flawed, and you're letting malicious code in.) The attack depends for
its success on some human approving some piece of code that doesn't do
what they think it does, and that means it has to look like what it
doesn't do - which is an attack against what the code looks like,
since what it does is very well defined.

> Whereas the BIDI attacks do (apparently) make it easy to smuggle in
> code: using invisible BIDI control codes, you can introduce source code
> where the way the editor renders the code, and the way the coder reads
> it, is different from the way the interpreter or compiler runs it.

Right: the way the editor renders the code, that's the essential part.
That's why I consider this an attack against some editor (or set of
editors). When you find an editor that is vulnerable to this, file a
bug report against that editor.

The way the coder reads it will be heavily based upon the way the
editor colours it.

> That is, I think, new and exploitable: something that looks like a
> comment is actually code that the interpreter runs, and something that
> looks like code is actually a string or comment which is not executed,
> but editors may syntax-colour it as if it were code.

Right. Exactly my point: editors may syntax-colour it incorrectly.

That's why I consider this not an attack on the language, but on the
editor. As long as the editor parses it the exact same way that the
interpreter does, there isn't a problem.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/3X6K5YYBRATECDRTN57XNT3QNP2J6ZBG/

Jim J. Jewett

unread,
Nov 16, 2021, 6:55:05 PM11/16/21
to pytho...@python.org
Compatibility variants can look different, but they can also look identical. Allowing any non-ASCII characters was worrisome because of the security implications of confusables. Squashing compatibility characters seemed the more conservative choice at the time. Stestagg's example:
е = lambda е, e: е if е > e else e
shows it wasn't perfect, but adding more invisible differences does have risks, even beyond the backwards incompatibility and the problem with (hopefully rare, but are we sure?) editors that don't distinguish between them in the way a programming language would prefer.

I think (but won't swear) that there were also several problematic characters that really should have been treated as (at most) glyph variants, but ... weren't. If I Recall Correctly, the largest number were Arabic presentation forms, but there were also a few characters that were in Unicode only to support round-trip conversion with a legacy charset, even if that charset had been declared buggy. In at least a few of these cases, it seemed likely that a beginning user would expect them to be equivalent.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/GNT3AG2SCVLMCJAZXSTIWFKKAYG25E7O/

Jim J. Jewett

unread,
Nov 16, 2021, 7:29:05 PM11/16/21
to pytho...@python.org
Stephen J. Turnbull wrote:
> Christopher Barker writes:

> > For example, in writing math we often use different scripts to mean
> > different things (e.g. TeX's Blackboard Bold). So if I were to use
> > some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> > want them to get normalized.

Agreed, for careful writers. But Stephen's answer about people using the wrong one and expecting it to work means that normalization is probably the lesser of evils for most people, and the ones who don't want it normalized are more likely to be able to specify custom processing when it is important enough. (The compatibility characters aren't normalized in strings, largely because that should still be possible.)

> In fact, I think adding these symbols to Unicode was a bad idea; they
> should be handled at a higher level in the linguistic stack (by
> semantic markup).

When I was a math student, these were clearly different symbols, with much less relation to each other than a mere case difference.
So by the Unicode consortium's goals, they are independent characters that should each be defined. I admit that isn't ideal for most use cases outside of math, but ... supporting those other cases is what compatibility normalization is for.

> It's also a UX problem. At slightly higher layer in the stack, I'm
> used to using Japanese input methods to input sigma and pi which
> produce characters in the Greek block, and at least the upper case
> forms that denote sum and product have separate characters in the math
> operators block. I understand why people who literally write
> mathematics in Greek might want those not normalized, but I sure am
> going to keep using "Greek sigma", not "math sigma"! The probability
> that I'm going to have a Greek uppercase sigma in my papers is nil,
> the probability of a summation symbol near unity. But the summation
> symbol is not easily available, I have to scroll through all the
> preceding Unicode blocks to find Mathematical Operators. So I am
> perfectly happy with uppercase Greek sigma for that role (as is
> XeTeX!!)

I think that is mostly a backwards compatibility problem; XeTeX itself had to worry about compatibility with TeX (which preceded Unicode) and with the fonts actually available and then with earlier versions of XeTeX.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/JNFLAQUKNCWCJSMBNJZGHVD5ZELOTU6G/

Jim J. Jewett

unread,
Nov 16, 2021, 8:28:39 PM11/16/21
to pytho...@python.org
Steven D'Aprano wrote:
> I think
> that many editors in common use don't support bidirectional text, or at
> least the ones I use don't seem to support it fully or correctly. ...
> But, if there is a concrete threat beyond "it looks weird", that it
> another issue.

Based on the original post (and how it looked in my web browser, after various automated reformattings, it seems that one of the failure modes that buggy editors have is that

stuff can be part of the code, even though it looks like part of a comment, or vice versa

This problem might be limited to only some of the bidi controls, and there might even be a workaround specific to # ... but it is an issue. I do not currently have an opinion on how important of an issue it is, or how adequate the workarounds are.

-jJ
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ECO4R655UGPCVFFVAOQZ3DUZVHQY75BX/

Stephen J. Turnbull

unread,
Nov 16, 2021, 10:59:55 PM11/16/21
to Jim J. Jewett, pytho...@python.org
Executive summary:

I guess the bottom line is that I'm sympathetic to both the NFC and
NFKC positions.

I think that wetware is such that people will go to the trouble of
picking out a letter-like symbol from a palette rarely, and in my
environment that's not going to happen at all because I use Japanese
phonetic input to get most symbols ("sekibun" = integral, "siguma" =
sigma), and I don't use calligraphic R for the real line, I use
\newcommand{\R}{{\cal R}}, except on a physical whiteboard, where I
use blackboard bold (go figure that one out!) So to my mind the
letter-like block in Unicode is a failed experiemnt.

Jim J. Jewett writes:

> When I was a math student, these were clearly different symbols,
> with much less relation to each other than a mere case difference.

Arguable. The letter-like symbols block has script (cursive),
blackboard bold, and Fraktur versions of R. I've seen all of them as
well as plain Roman, bold, italic and bold italic facts used to denote
the real line, and I've personally used most of them for that purpose
depending on availability of fonts and input methods and medium (ie,
computer text vs. hand-written). I've also seen several of them used
for reaction functions or spaces thereof in game theory (although
blackboard bold and Fraktur seem to be used uniquely for the real
line). Clearly the common denominator is the uppercase latin letter
"R", and the glyph being recognizably "R" is necessary and sufficient
to each of those purposes. The story for uppercase sigma as sum is
somewhat similar: sum is by far not the only use of that letter,
although I don't know of any other operator symbol for sum over a set
or series (outside of programming languages, which I think we can
discount).

I agree that we should consider math to be a separate language, but it
doesn't have a consistent script independent of the origins of the
symbols. Even today none of my engineering and economics students can
type any symbols except those in the JIS repertoire, which they type
by original name ("siguma", "ramuda", "arefu", "yajirushi" == arrow,
etc, "sekibun" == integration does bring up the integral sign in at
least some modern input methods, but it doesn't have a script name,
while "kasann" == addition does not bring up sigma, although "siguma"
does, and "essu" brings up sigma -- but only in "ASCII emoji" strings,
go figure). I have seen students use fullwidth R for the real line,
though, but distinguishing that is a deprecated compatibility feature
of Unicode (and of Japanese practice -- even in very formal university
documents such as grade reports for a final doctoral examination I've
seen numbers and names containing mixed half-width and full-width
ASCII).

So I think "letter-like" was a reasonable idea (I'm pretty sure this
block goes back to the '90s but I'm too lazy to check), but it hasn't
turned out well, and I doubt it ever will.

> So by the Unicode consortium's goals, they are independent
> characters that should each be defined. I admit that isn't ideal
> for most use cases outside of math,

I don't think it even makes sense *inside* of math for the letter-like
symbols. The nature of math means that any "R" will be grabbed for
something whose name starts with "r" as soon as that's convenient.
Something like the integral sign (which is a stretched "S" for "sum"),
OK -- although category theory uses that for "ends" which still don't
look anything like integrals even if you turn them inside out, rotate
90 degrees, and paint them blue.

> > It's also a UX problem. At slightly higher layer in the stack, I'm
> > used to using Japanese input methods to input sigma and pi which
> > produce characters in the Greek block, and at least the upper case
> > forms that denote sum and product have separate characters in the math
> > operators block.
>
> I think that is mostly a backwards compatibility problem; XeTeX
> itself had to worry about compatibility with TeX (which preceded
> Unicode) and with the fonts actually available and then with
> earlier versions of XeTeX.

IMO, the analogy fails because the backward compatibility issue for
Unicode is in the wetware, not in the software.

Steve

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/YTIIFIF75RMWP5J3GCSXWVXSUP5SX7AA/

Eryk Sun

unread,
Nov 18, 2021, 5:01:33 AM11/18/21
to Terry Reedy, pytho...@python.org
On 11/13/21, Terry Reedy <tjr...@udel.edu> wrote:
> On 11/13/2021 4:35 PM, pt...@austin.rr.com wrote:
>>
>> _𝓟Ⅼ𝖠𝙲𝗘ℋ𝒪Lᴰ𝑬𝕽﹏𝕷𝔼𝗡 = 12
>>
>> def _𝔰ʰ𝓸ʳ𝕥𝙚𝑛(𝔰, p𝑟𝔢fi𝖝𝕝𝚎𝑛, sᵤ𝑓𝗳𝗂𝑥𝗹ₑ𝚗):
>>
>> ˢ𝗸i𝗽 = 𝐥e𝘯(𝖘) - pr𝚎𝖋𝐢x𝗅ᵉ𝓷 - 𝒔𝙪ffi𝘅𝗹𝙚ₙ
>>
>> if ski𝘱 > _𝐏𝗟𝖠𝘊𝙴H𝕺L𝕯𝙀𝘙﹏L𝔈𝒩:
>>
>> 𝘴 = '%s[%d chars]%s' % (𝙨[:𝘱𝐫𝕖𝑓𝕚xℓ𝒆𝕟], ₛ𝚔𝒊p, 𝓼[𝓁𝒆𝖓(𝚜) -
>> 𝙨𝚞𝒇fix𝙡ᵉ𝘯:])
>>
>> return ₛ
>>
> * Does not at all work in CommandPrompt

It works for me when pasted into the REPL using the console in Windows
10. I pasted the code into a raw multiline string assignment and then
executed the string with exec(). The only issue is that most of the
pasted characters are displayed using the font's default glyph since
the console host doesn't have font fallback support. Even Windows
Terminal doesn't have font fallback support yet in the command-line
editing mode that Python's REPL uses. But Windows Terminal does
implement font fallback for normal output rendering, so if you assign
the pasted text to string `s`, then print(s) should display properly.

> even after supposedly changing to a utf-8 codepage with 'chcp 65000'.

Changing the console code page is unnecessary with Python 3.6+, which
uses the console's wide-character API. Also, even though it's
irrelevant for the REPL, UTF-8 is code page 65001. Code page 65000 is
UTF-7.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/7FGNJ7TMASDOMQAS2LSSQAD2PPURT5W6/

Steve Holden

unread,
Nov 29, 2021, 4:17:48 AM11/29/21
to Python Dev, Paul McGuire, Alex Martelli, Alex Martelli, Anna Martelli Ravenscroft
On Mon, Nov 15, 2021 at 8:42 AM Kyle Stanley <aero...@gmail.com> wrote:
On Sat, Nov 13, 2021 at 5:04 PM <pt...@austin.rr.com> wrote:

 

def 𝚑𝓮𝖑𝒍𝑜():

[... Python code it's easy to believe isn't grammatical ...]

    return


0_o color me impressed, I did not think that would be legal syntax. Would be interesting to include in a textbook, if for nothing else other than to academically demonstrate that it is possible, as I suspect many are not aware.

I'm afraid the best Paul, Alex, Anna and I can hope to do is bring it to the attention of readers of Python in a Nutshell's fourth edition (on current plans, hitting the shelves about the same time as 3.11, please tell your friends ;-) ). Sadly, I'm not aware of any academic classes that use the Nutshell as a course text, so it seems unlikely to gain the attention of academic communities.

Given the wider reach of this list, however, one might hope that by the time the next edition comes out this will be old news due to the publication of blogs and the like. With luck, a small fraction of the programming community will become better-informed about Unicode and the design or programming languages. It's interesting that the egalitarian wish to allow use of native "alphabetics" has turned out to be such a viper's nest. 

Particular thanks to Stephen J. Turnbull for his thoughtful and well-informed contribution above.

Kind regards,
Steve

Christopher Barker

unread,
Nov 29, 2021, 12:49:00 PM11/29/21
to Steve Holden, Alex Martelli, Alex Martelli, Anna Martelli Ravenscroft, Paul McGuire, Python Dev
On Mon, Nov 29, 2021 at 1:21 AM Steve Holden <st...@holdenweb.com> wrote:
It's interesting that the egalitarian wish to allow use of native "alphabetics" has turned out to be such a viper's nest. 

Indeed.

However, is there no way to restrict identifiers at least to the alphabets of natural languages? Maybe it wouldn’t help much, but does anyone need to use letter-like symbols designed for math expressions? I would say maybe, but certainly not have them auto-converted to the “normal” letter? 

For that matter, why have any auto-conversion all?

The answer may be that it’s too late to change now, but I don’t think I’ve seen a compelling (or any?) use case for that conversion.

-CHB


Particular thanks to Stephen J. Turnbull for his thoughtful and well-informed contribution above.

Kind regards,
Steve

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Reply all
Reply to author
Forward
0 new messages