[Python-Dev] pth file encoding

395 views
Skip to first unread message

Inada Naoki

unread,
Mar 15, 2021, 10:48:24 PM3/15/21
to Python-Dev
Hi, all.

I found .pth file is decoded by the default (i.e. locale-specific) encoding.
https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa822/Lib/site.py#L173

pth files contain:

* import statements
* paths

For import statement, UTF-8 is the default Python code encoding.
For paths, fsencoding is the right encoding. It is UTF-8 on Windows
(excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific
encoding in Linux.

What encoding should we use?

* UTF-8
* sys.getfilesystemencoding()
* Keep status-quo.

Regards,

--
Inada Naoki <songof...@gmail.com>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/RKXH7QGIBC3UNCLGUSCLWLZX2WM6IGWW/
Code of Conduct: http://python.org/psf/codeofconduct/

Brett Cannon

unread,
Mar 16, 2021, 3:58:06 PM3/16/21
to Inada Naoki, Python-Dev
On Mon, Mar 15, 2021 at 7:53 PM Inada Naoki <songof...@gmail.com> wrote:
Hi, all.

I found .pth file is decoded by the default (i.e. locale-specific) encoding.
https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa822/Lib/site.py#L173

pth files contain:

* import statements
* paths

For import statement, UTF-8 is the default Python code encoding.
For paths, fsencoding is the right encoding. It is UTF-8 on Windows
(excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific
encoding in Linux.

What encoding should we use?

* UTF-8
* sys.getfilesystemencoding()
* Keep status-quo.

What are packaging tools like pip and setuptools writing .pth files out as?

Inada Naoki

unread,
Mar 17, 2021, 12:58:28 AM3/17/21
to Brett Cannon, Python-Dev
OK. setuptools doesn't specify encoding at all. So locale-specific
encoding is used.
We can not fix it in short term.
--
Inada Naoki <songof...@gmail.com>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/B5EWSS6GT5O4HBUJTMCKWKZMTC6U6VTV/

Michał Górny

unread,
Mar 17, 2021, 4:12:33 AM3/17/21
to Inada Naoki, Brett Cannon, Python-Dev
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> OK. setuptools doesn't specify encoding at all. So locale-specific
> encoding is used.
> We can not fix it in short term.

How about writing paths as bytestrings in the long term? I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.

--
Best regards,
Michał Górny


_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/EKG7ELEWSG6ZPFYOVTCNVJCGV5W7S7J3/

Paul Moore

unread,
Mar 17, 2021, 4:35:32 AM3/17/21
to Michał Górny, Python-Dev
On Wed, 17 Mar 2021 at 08:13, Michał Górny <mgo...@gentoo.org> wrote:
>
> On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> > OK. setuptools doesn't specify encoding at all. So locale-specific
> > encoding is used.
> > We can not fix it in short term.
>
> How about writing paths as bytestrings in the long term? I think this
> should eliminate the necessity of knowing the correct encoding for
> the filesystem.

If I have a path in my Python program that is "a£b" (a unicode string)
and I want to write it to a .pth file, what encoding should I use to
"write it as a bytestring"? I don't understand what you;re trying to
suggest here.
Paul
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/YBE6D37V73OXZYNEW36JO24ZBD7EKAJQ/

Inada Naoki

unread,
Mar 17, 2021, 4:55:03 AM3/17/21
to Paul Moore, Michał Górny, Python-Dev
On Wed, Mar 17, 2021 at 5:33 PM Paul Moore <p.f....@gmail.com> wrote:
>
> On Wed, 17 Mar 2021 at 08:13, Michał Górny <mgo...@gentoo.org> wrote:
> >
> > On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> > > OK. setuptools doesn't specify encoding at all. So locale-specific
> > > encoding is used.
> > > We can not fix it in short term.
> >
> > How about writing paths as bytestrings in the long term? I think this
> > should eliminate the necessity of knowing the correct encoding for
> > the filesystem.
>
> If I have a path in my Python program that is "a£b" (a unicode string)
> and I want to write it to a .pth file, what encoding should I use to
> "write it as a bytestring"? I don't understand what you;re trying to
> suggest here.
> Paul

On Windows, it must be UTF-8. For example, we use `chcp 65001` in
`activate.bat` to support unicode path.
On Unix, raw path is bytestring. So paths can be written as-is. Python
decode it with fsencoding.

So I think this is the ideal solution. But this solution requires
platform-specific code in the site.py.
I don't think pth files are important enough for this complexity.

Sub-optimal idea is using UTF-8. It is the best encoding for Windows.
And most Unix systems use UTF-8 too.

Regards,

--
Inada Naoki <songof...@gmail.com>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/NWBYQHLUIIWU2U2MX4KZXJH4PBTNJYAW/

Paul Moore

unread,
Mar 17, 2021, 5:29:07 AM3/17/21
to Inada Naoki, Michał Górny, Python-Dev
On Wed, 17 Mar 2021 at 08:52, Inada Naoki <songof...@gmail.com> wrote:
> On Windows, it must be UTF-8. For example, we use `chcp 65001` in
> `activate.bat` to support unicode path.
> On Unix, raw path is bytestring. So paths can be written as-is. Python
> decode it with fsencoding.

Remember that .pth files contain executable code as well as paths, so
fsencoding is not correct for a .pth file as a whole.

> So I think this is the ideal solution. But this solution requires
> platform-specific code in the site.py.
> I don't think pth files are important enough for this complexity.

.pth files are pretty important in the packaging community. I'd
strongly support making their format and behaviour more precisely
defined.

> Sub-optimal idea is using UTF-8. It is the best encoding for Windows.
> And most Unix systems use UTF-8 too.

+1. IMO, UTF-8 is the only reasonable choice here.

The problem is with the transition - we need to find a way to deal
with existing `.pth` files, and with people using older version of
tools (like setuptools and pipx) that write `.pth` files (so we can't
assume, for example, that Python 3.12 will never see a .pth file using
the old-style encoding).

It's worth noting that using the default encoding is the *correct* way
of writing .pth files at the moment (as that's how site.py reads them
- see https://github.com/python/cpython/blob/master/Lib/site.py#L173)
so this is technically a file format change - tools writing .pth files
will *have* to include version-specific code if they want to support
multiple versions of Python. We need to be very clear about this -
it's not just a case of "tools need to specify the encoding".

Paul
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/MIZLKDTX2EXEHFKKHO33FRSO7EH62DGW/

Antoine Pitrou

unread,
Mar 17, 2021, 5:29:14 AM3/17/21
to pytho...@python.org
On Tue, 16 Mar 2021 11:44:13 +0900
Inada Naoki <songof...@gmail.com> wrote:
> Hi, all.
>
> I found .pth file is decoded by the default (i.e. locale-specific) encoding.
> https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa822/Lib/site.py#L173
>
> pth files contain:
>
> * import statements
> * paths
>
> For import statement, UTF-8 is the default Python code encoding.
> For paths, fsencoding is the right encoding. It is UTF-8 on Windows
> (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific
> encoding in Linux.
>
> What encoding should we use?
>
> * UTF-8
> * sys.getfilesystemencoding()
> * Keep status-quo.

You could add special markup to specify utf8 encoding:

# -*- encoding: utf8 -*-

If no markup is present, use locale encoding. If markup is present,
use utf8 encoding. Bail out if markup specifies something else than
utf8.

Then update all pth-producing tools to write utf8-encoded pth files
(at least on the Python versions that support the encoding markup).
In 15 years, you can switch to utf8 by default.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/J2IM4IQ3L3XEN6XBRFSDLQ2S2FORN3PP/

Paul Moore

unread,
Mar 17, 2021, 5:55:40 AM3/17/21
to Inada Naoki, Michał Górny, Python-Dev
On Wed, 17 Mar 2021 at 09:26, Paul Moore <p.f....@gmail.com> wrote:
> The problem is with the transition - we need to find a way to deal
> with existing `.pth` files, and with people using older version of
> tools (like setuptools and pipx) that write `.pth` files (so we can't
> assume, for example, that Python 3.12 will never see a .pth file using
> the old-style encoding).

Hmm, I just checked and pipx uses UTF-8 when writing .pth files. See
https://github.com/pipxproject/pipx/blob/master/src/pipx/venv.py#L176
(and lol, it was my mistake, I wrote that code -
https://github.com/pipxproject/pipx/pull/168). I'm inclined to report
that as a bug, even though it appears no-one has complained about it.
But that seems counter-productive given the context here.

Paul
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/V5AJK4WZY2JCGZVFI5KY3QD4DYVSSIBB/

Steve Dower

unread,
Mar 17, 2021, 1:33:16 PM3/17/21
to Michał Górny, Inada Naoki, Brett Cannon, Python-Dev
On 3/17/2021 8:00 AM, Michał Górny wrote:
> How about writing paths as bytestrings in the long term? I think this
> should eliminate the necessity of knowing the correct encoding for
> the filesystem.

That's what we're trying to do, the problem is that they start as
strings, and so we need to convert them to a bytestring.

That conversion is the encoding ;)

And yeah, for reading, I'd use a UTF-8 reader that falls back to locale
on failure (and restarts reading the file). But for writing, we need the
tools that create these files (including Notepad!) to use the encoding
we want.

Cheers,
Steve

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/MVD67FOAJRCNR2XXLJ4JDVFPYGZWYLDP/

Stefan Ring

unread,
Mar 17, 2021, 2:10:57 PM3/17/21
to Steve Dower, Michał Górny, Python-Dev
On Wed, Mar 17, 2021 at 6:37 PM Steve Dower <steve...@python.org> wrote:
>
> On 3/17/2021 8:00 AM, Michał Górny wrote:
> > How about writing paths as bytestrings in the long term? I think this
> > should eliminate the necessity of knowing the correct encoding for
> > the filesystem.
>
> That's what we're trying to do, the problem is that they start as
> strings, and so we need to convert them to a bytestring.
>
> That conversion is the encoding ;)
>
> And yeah, for reading, I'd use a UTF-8 reader that falls back to locale
> on failure (and restarts reading the file). But for writing, we need the
> tools that create these files (including Notepad!) to use the encoding
> we want.

A somewhat radical idea carrying this to the extreme would be to use
UTF-16 (LE) on Windows. After all, this _is_ the native file system
encoding, and Notepad will happily read and write it.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/WRAW4UI3X3WYMQ3FMIERDKTVD6WKD5S2/

Steve Dower

unread,
Mar 17, 2021, 3:28:34 PM3/17/21
to Stefan Ring, Michał Górny, Python-Dev
On 3/17/2021 6:08 PM, Stefan Ring wrote:
> A somewhat radical idea carrying this to the extreme would be to use
> UTF-16 (LE) on Windows. After all, this _is_ the native file system
> encoding, and Notepad will happily read and write it.

I'm not opposed to detecting a BOM by default (when no other encoding is
specified), but that won't help most UTF-8 files which these days come
with no marker at all.

I wouldn't change the default file encoding for writing though (except
to unmarked UTF-8, and only with the compatibility approach Inada is
working on). Everyone has basically come around to the idea that UTF-8
is the only needed encoding, and I'm sure if it had existed when Windows
decided to support a universal character set, it would have been chosen.
But with what we have now, UTF-16-LE is not a good choice for anything
apart from compatibility with Windows.

Cheers,
Steve

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/LTEJSNOH6EHESXSMXSW352JFG2SF7ZMX/

Ivan Pozdeev via Python-Dev

unread,
Mar 17, 2021, 3:43:19 PM3/17/21
to pytho...@python.org
On 17.03.2021 20:30, Steve Dower wrote:
> On 3/17/2021 8:00 AM, Michał Górny wrote:
>> How about writing paths as bytestrings in the long term?  I think this
>> should eliminate the necessity of knowing the correct encoding for
>> the filesystem.
>
> That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
>
> That conversion is the encoding ;)
>
> And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we
> need the tools that create these files (including Notepad!) to use the encoding we want.
>

I don't see a problem with using a file encoding specification like in Python source files.
Since site.py is under our control, we can introduce it easily.

We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the specification
can be removed, too).

> Cheers,
> Steve
>
> _______________________________________________
> Python-Dev mailing list -- pytho...@python.org
> To unsubscribe send an email to python-d...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/MVD67FOAJRCNR2XXLJ4JDVFPYGZWYLDP/
> Code of Conduct: http://python.org/psf/codeofconduct/

--
Regards,
Ivan

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/WZJ5EIP47AQV6X4MBN7427O4TNN5F4WY/

Steve Dower

unread,
Mar 17, 2021, 4:06:24 PM3/17/21
to Ivan Pozdeev, pytho...@python.org
On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:
> On 17.03.2021 20:30, Steve Dower wrote:
>> On 3/17/2021 8:00 AM, Michał Górny wrote:
>>> How about writing paths as bytestrings in the long term?  I think this
>>> should eliminate the necessity of knowing the correct encoding for
>>> the filesystem.
>>
>> That's what we're trying to do, the problem is that they start as
>> strings, and so we need to convert them to a bytestring.
>>
>> That conversion is the encoding ;)
>>
>> And yeah, for reading, I'd use a UTF-8 reader that falls back to
>> locale on failure (and restarts reading the file). But for writing, we
>> need the tools that create these files (including Notepad!) to use the
>> encoding we want.
>>
>
> I don't see a problem with using a file encoding specification like in
> Python source files.
> Since site.py is under our control, we can introduce it easily.
>
> We can opt to allow only UTF-8 here -- then we wait out a transitional
> period and disallow anything else than UTF-8 (then the specification can
> be removed, too).

The only thing we can introduce *easily* is an error when the
(exclusively third-party) tools that create them aren't up to date.
Getting everyone to specify the encoding we want is a much bigger
problem with a much slower solution.

This particular file is probably the worst case scenario, but preferring
UTF-8 and handling existing files with a fallback is the best we can do
(especially since an assumption of UTF-8 can be invalidated on a
particular file, whereas most locale encodings cannot). Once we openly
document that it should be UTF-8, tools will have a chance to catch up,
and eventually the fallback will become harmless.

Cheers,
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/5B53GCQNYXFBYAHSJKI6I34XAV6S67HN/

Ivan Pozdeev via Python-Dev

unread,
Mar 17, 2021, 6:41:01 PM3/17/21
to Steve Dower, pytho...@python.org

On 17.03.2021 23:04, Steve Dower wrote:
> On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:
>> On 17.03.2021 20:30, Steve Dower wrote:
>>> On 3/17/2021 8:00 AM, Michał Górny wrote:
>>>> How about writing paths as bytestrings in the long term?  I think this
>>>> should eliminate the necessity of knowing the correct encoding for
>>>> the filesystem.
>>>
>>> That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
>>>
>>> That conversion is the encoding ;)
>>>
>>> And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we
>>> need the tools that create these files (including Notepad!) to use the encoding we want.
>>>
>>
>> I don't see a problem with using a file encoding specification like in Python source files.
>> Since site.py is under our control, we can introduce it easily.
>>
>> We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the
>> specification can be removed, too).
>
> The only thing we can introduce *easily* is an error when the (exclusively third-party) tools that create them aren't up to date. Getting
> everyone to specify the encoding we want is a much bigger problem with a much slower solution.

I don't see a problem with either.
If we want to standardize something, we have to encourage, then ultimately enforce compliance, this way or another.

>
> This particular file is probably the worst case scenario, but preferring UTF-8 and handling existing files with a fallback is the best we
> can do (especially since an assumption of UTF-8 can be invalidated on a particular file, whereas most locale encodings cannot). Once we
> openly document that it should be UTF-8, tools will have a chance to catch up, and eventually the fallback will become harmless.
>
> Cheers,
> Steve

--
Regards,
Ivan

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/LN3MHC7O7NHBCCROZGZJOZ5DY76KFLJP/

Dan Stromberg

unread,
Mar 20, 2021, 1:12:57 AM3/20/21
to Michał Górny, Python-Dev
On Wed, Mar 17, 2021 at 1:11 AM Michał Górny <mgo...@gentoo.org> wrote:
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> OK. setuptools doesn't specify encoding at all. So locale-specific
> encoding is used.
> We can not fix it in short term.

How about writing paths as bytestrings in the long term?  I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.
On Linux and many Unixes, there is no "correct" filesystem encoding.  ASCII and UTF-8 are probably the most common encodings for individual files, maybe even large collections of files, but nevertheless, paths are bytestrings.  Treating paths as UTF-8 works fine for most files, but once in a while there'll be a filename that fails to convert, and that's not the fault of the filename.

For example, what happens if you need a file to be named touch "Ma$(echo | tr '\012' '\361')ana" ?

For a presentation application (for EG), assuming UTF-8 is probably fine, maybe even a good thing.  But for a filesystem backup tool, it's important to not assume an encoding so you can back up and restore all filenames irrespective of what the files' creators intended encodingwise.

Reply all
Reply to author
Forward
0 new messages