[Python-ideas] Changing the default text encoding of pathlib

27 views
Skip to first unread message

Inada Naoki

unread,
Jan 24, 2021, 9:34:17 PM1/24/21
to python-ideas
My previous thread is hijacked about "auto guessing" idea, so I split
this thread for pathlib.

Path.open() was added in Python 3.4. Path.read_text() and
Path.write_text() was added in Python 3.5.
Their history is shorter than built-in open(). Changing its default
encoding should be easier than built-in open and TextIOWrapper.
New default encodings are:

* read_text() default encoding is "utf-8-sig"
* write_text() default encoding is "utf-8"
* open() default encoding is "utf-8-sig" when mode is "r" or None,
"utf-8" otherwise.

Of course, we need a regular deprecation period.
When encoding is omitted, they emit DeprecationWarning (or
EncodingWarning which is a subclass of DeprecationWarning) in three
versions (Python 3.10~3.12).

How do you think this idea?
Should we "change all at once" rather than "step-by-step"?

Regards,
--
Inada Naoki <songof...@gmail.com>
_______________________________________________
Python-ideas mailing list -- python...@python.org
To unsubscribe send an email to python-id...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at https://mail.python.org/archives/list/python...@python.org/message/J5VR56YRXA3PVPUH3KM72OX7SUBAZUKL/
Code of Conduct: http://python.org/psf/codeofconduct/

Christopher Barker

unread,
Jan 25, 2021, 3:02:11 PM1/25/21
to Inada Naoki, python-ideas
On Sun, Jan 24, 2021 at 6:33 PM Inada Naoki <songof...@gmail.com> wrote:
My previous thread is hijacked about "auto guessing" idea,

yes -- I'm a bit confused by that -- are folks advocating for making some sort of encoding detection the default? or available as an option in the stdlib? -- in any case, Ithink that could be an independent proposal.

First: I really want to see this get pushed forward and get done, one way or another -- using a system setting as a default is a really bad idea in this day of interconnected computers.

But back to PEP 597, and how to get there:

1) We need to start with a consensus about where we want Python to be in N versions. That is not specifically laid out in the PEP but it does imply that in the sometime-long-in-the-future:

- TextIOWrapper will have utf-8 as the default, rather than `locale.getpreferredencoding(False)`
this behaviour will then be inherited by:
- `open()` without a binary flag in the mode

- `Path.read_text`
- there will be a string that can be passed to encoding that will indicate that the system default should be used.

(and any other utility functions that use TextIOWrapper)

Forgive me if there is already a consensus on this -- but this discussion has brought up some thoughts.

1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices.

2) Inada proposed an open_text() function be introduced as a stepping stone, with the new behaviour. This led to one person asking if that would imply a open_binary() function as well. An answer to that was no -- as no one is suggesting any changes to open()'s behavior for binary files.
However, I kind of like the idea. We now have two (at least) different file objects potentially returned by open(): TextIOWrapper, and BufferedReader/Writer. And the TextIOWrapper has some pretty different behavior. I *think* that in virtually all cases, when the code is written, the author knows whether they want a binary or text file, so it may make sense to have two different open() functions, rather than having the Type returned be a function of what mode flags are passed.

This would make it easier for people (and tools) to reason about the code with static analysis:

e.g.:

open_text().read() would return a string
open_binary().read() would return bytes

This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again.

Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky.


On to:

> Path.open() was added in Python 3.4. Path.read_text() and
Path.write_text() was added in Python 3.5.
Their history is shorter than built-in open(). Changing its default
encoding should be easier than built-in open and TextIOWrapper.
New default encodings are:

* read_text() default encoding is "utf-8-sig"
* write_text() default encoding is "utf-8"
* open() default encoding is "utf-8-sig" when mode is "r" or None,
"utf-8" otherwise.

How do you think this idea?

+1 there is a lot less legacy with Path -- we can move faster. And I honestly still wonder if making utf-8 the default with cause or fix more bugs :-)

A thought on that -- there is currently both kinds of code "in the wild":
 (A) code that uses the default, when they really want utf-8 -- currently a bug, won't be a bug in the future.
 (B) code that uses the default when it really does want the system encoding. -- currently correct, will become a bug in the future

It's anyone's guess which of these is more common, but one thing to consider is that (A) is a hidden bug that might reveal itself in the hands of end users who knows when in the future. Whereas (B) will be a bug that is likely to reveal itself fairly quickly (though perhaps also in the (confused) hands of end users as well)

-Chris B

--
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

Paul Moore

unread,
Jan 25, 2021, 3:44:32 PM1/25/21
to Christopher Barker, python-ideas
On Mon, 25 Jan 2021 at 20:02, Christopher Barker <pyth...@gmail.com> wrote:
> using a system setting as a default is a really bad idea in this day of interconnected computers.

I'd mildly dispute this. There are (significant) downsides with the
default behaviour being system-dependent, yes, but there are *also*
disadvantages in having Python not behave consistently with other
tools/programs on the same system.

However, on POSIX, things are generally consistent, and *already*
default to UTF-8. So the proposal is mostly going to affect Windows.
And on Windows, there's not much consistency even on a single machine
at the moment. Between OEM and ANSI codepages, and other tools that
default to UTF-8 "because that's the future", there's not much
platform consistency for Python to conform to anyway...

> But back to PEP 597, and how to get there:
>
> 1) We need to start with a consensus about where we want Python to be in N versions. That is not specifically laid out in the PEP but it does imply that in the sometime-long-in-the-future:
>
> - TextIOWrapper will have utf-8 as the default, rather than `locale.getpreferredencoding(False)`
> this behaviour will then be inherited by:
> - `open()` without a binary flag in the mode
>
> - `Path.read_text`
> - there will be a string that can be passed to encoding that will indicate that the system default should be used.
>
> (and any other utility functions that use TextIOWrapper)
>
> Forgive me if there is already a consensus on this -- but this discussion has brought up some thoughts.

There's a fundamental assumption here that I think needs to be made
explicit. Which is that we're assuming that whatever N happens to be,
we anticipate that `locale.getpreferredencoding(False)` will still be
something other than UTF-8. That's *already* false on most POSIX
systems, and TBH I get the impression that Microsoft is pushing quite
hard to move Windows 10 to a UTF-8 by default position (although
"fast" in Microsoft terms may still be slow to the rest of us ;-))

So I think that the real question here is "do we want to move Python
to "UTF8-by-default" faster than the OS vendors are going? And I think
that the answer to that is much less obvious. It probably also depends
heavily on your locale - I doubt it's an accident that Inada-san¹ is
proposing this, and he's from Japan :-) Personally, as an English
speaker based in the UK, I'll be happy when UTF-8 is the default
everywhere, but I can live with the status quo until that happens. But
I'm not the main target for this change.

> 1) As TextIOWrapper is an "implementation detail" for most Python developers, maybe it shouldn't have a default encoding at all, and leave the default implementation(s) up to the helper functions, like open() and Path.read_text() -- that would mean changes in more places, but would allow different utility functions to make different choices.

*shrug*. That sounds plausible, but it's a backward compatibility
break that doesn't offer any significant benefits, so I suspect it's
not worth doing in practice.

> 2) Inada proposed an open_text() function be introduced as a stepping stone, with the new behaviour. This led to one person asking if that would imply a open_binary() function as well. An answer to that was no -- as no one is suggesting any changes to open()'s behavior for binary files.
> However, I kind of like the idea. We now have two (at least) different file objects potentially returned by open(): TextIOWrapper, and BufferedReader/Writer. And the TextIOWrapper has some pretty different behavior. I *think* that in virtually all cases, when the code is written, the author knows whether they want a binary or text file, so it may make sense to have two different open() functions, rather than having the Type returned be a function of what mode flags are passed.
>
> This would make it easier for people (and tools) to reason about the code with static analysis:
>
> e.g.:
>
> open_text().read() would return a string
> open_binary().read() would return bytes

These are good arguments for having explicit open_text and open_binary
functions. I don't *like* the idea, because they feel unnecessarily
verbose to me, but I can accept that this might just be because I'm
used to open().

I do think that having open_text, but *not* having open_binary, would
be a bit confusing. Particularly as pathlib has read_text and
read_binary, so it would be inconsistent as well.

> This would also make the path to a future with different defaults smoother -- plain "open" gets deprecated -- any new code uses one of the open_* functions, and that new code will never need to be changed again.
>
> Back in the day, a single open() function made more sense. After all, the only difference in the result for binary mode was that linefeed translation was turned off (and the C legacy of course). In fact, this did lead to errors, when folks accidentally left off the 'b', and tested only on *nix systems. That, at least, is less of an issue now; as the text and binary objects are more different, you are far more likely to get errors right away -- but still at run time -- static analysis is still tricky.

This, on the other hand, I'm unequivocally against. The sheer quantity
of breakage that would be caused by deprecating open() makes this a
complete non-starter. Even if we only "deprecate in documentation",
we'd be invalidating huge amounts of advice, books and training
materials.

> On to:
>
> > Path.open() was added in Python 3.4. Path.read_text() and
>>
>> Path.write_text() was added in Python 3.5.
>> Their history is shorter than built-in open(). Changing its default
>> encoding should be easier than built-in open and TextIOWrapper.
>> New default encodings are:
>>
>> * read_text() default encoding is "utf-8-sig"
>> * write_text() default encoding is "utf-8"
>> * open() default encoding is "utf-8-sig" when mode is "r" or None,
>> "utf-8" otherwise.
>
>> How do you think this idea?
>
> +1 there is a lot less legacy with Path -- we can move faster. And I honestly still wonder if making utf-8 the default with cause or fix more bugs :-)

But having open(filename) do something different than
Path(filename).open() seems like it's asking for trouble. It would be
a source of a lot of unexpected bugs for people migrating from
filenames as strings to pathlib, and the *last* thing you want during
a migration is having to track down unexpected behavioural differences
you hadn't planned for.

> A thought on that -- there is currently both kinds of code "in the wild":
> (A) code that uses the default, when they really want utf-8 -- currently a bug, won't be a bug in the future.
> (B) code that uses the default when it really does want the system encoding. -- currently correct, will become a bug in the future
>
> It's anyone's guess which of these is more common, but one thing to consider is that (A) is a hidden bug that might reveal itself in the hands of end users who knows when in the future. Whereas (B) will be a bug that is likely to reveal itself fairly quickly (though perhaps also in the (confused) hands of end users as well)

There's also (C) code that uses the default, where that default is
already UTF-8. Which is probably most non-Windows systems. Those have
no bug, and this change will make no difference to them.

Also, (A) is "currently a bug, won't be a bug when the system encoding
switches to UTF-8", whereas (B) is "currently correct, will remain
correct when the system default becomes UTF-8". So switching Python's
default can be seen as:

(A) removes an existing bug a bit sooner.
(B) introduces a bug which will go away again when the system switches
to UTF-8 or the user changes their code.
(C) makes no difference.

Frankly, I don't think there's a good answer here, and there will
likely be as many opinions as there are participants in the
discussion.

Paul

¹ I'm not 100% clear on what the polite form of address is for
Japanese names, please let me know if I should be using a different
form :-)

_______________________________________________
Python-ideas mailing list -- python...@python.org
To unsubscribe send an email to python-id...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at https://mail.python.org/archives/list/python...@python.org/message/VKDWSFDU4WTP3BTPO3LQKVQQDKGOPWDU/
Reply all
Reply to author
Forward
0 new messages