[Python-ideas] Fix default encodings on Windows

779 views
Skip to first unread message

Steve Dower

unread,
Aug 10, 2016, 2:16:56 PM8/10/16
to python...@python.org
I suspect there's a lot of discussion to be had around this topic, so I
want to get it started. There are some fairly drastic ideas here and I
need help figuring out whether the impact outweighs the value.

Some background: within the Windows API, the preferred encoding is
UTF-16. This is a 16-bit format that is typed as wchar_t in the APIs
that use it. These APIs are generally referred to as the *W APIs
(because they have a W suffix).

There are also (broadly deprecated) APIs that use an 8-bit format
(char), where the encoding is assumed to be "the user's active code
page". These are *A APIs. AFAIK, there are no cases where a *A API
should be preferred over a *W API, and many newer APIs are *W only.

In general, Python passes byte strings into the *A APIs and text strings
into the *W APIs.

Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which
translates to "the system's active code page". As this encoding
generally cannot represent all paths on Windows, it is deprecated and
Unicode strings are recommended instead. This, however, means you need
to write significantly different code between POSIX (use bytes) and
Windows (use text).

ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and
updating path_converter() (Python/posixmodule.c; likely similar code in
other places) to decode incoming byte strings would allow us to
undeprecate byte strings and add the requirement that they *must* be
encoded with sys.getfilesystemencoding(). I assume that this would allow
cross-platform code to handle paths similarly by encoding to whatever
the sys module says they should and using bytes consistently (starting
this thread is meant to validate/refute my assumption).

(Yes, I know that people on POSIX should just change to using Unicode
and surrogateescape. Unfortunately, rather than doing that they complain
about Windows and drop support for the platform. If you want to keep
hitting them with the stick, go ahead, but I'm inclined to think the
carrot is more valuable here.)

Similarly, locale.getpreferredencoding() on Windows returns a legacy
value - the user's active code page - which should generally not be used
for any reason. The one exception is as a default encoding for opening
files when no other information is available (e.g. a Unicode BOM or
explicit encoding argument). BOMs are very common on Windows, since the
default assumption is nearly always a bad idea.

Making open()'s default encoding detect a BOM before falling back to
locale.getpreferredencoding() would resolve many issues, but I'm also
inclined towards making the fallback utf-8, leaving
locale.getpreferredencoding() solely as a way to get the active system
codepage (with suitable warnings about it only being useful for
back-compat). This would match the behavior that the .NET Framework has
used for many years - effectively, utf_8_sig on read and utf_8 on write.

Finally, the encoding of stdin, stdout and stderr are currently
(correctly) inferred from the encoding of the console window that Python
is attached to. However, this is typically a codepage that is different
from the system codepage (i.e. it's not mbcs) and is almost certainly
not Unicode. If users are starting Python from a console, they can use
"chcp 65001" first to switch to UTF-8, and then *most* functionality
works (input() has some issues, but those can be fixed with a slight
rewrite and possibly breaking readline hooks).

It is also possible for Python to change the current console encoding to
be UTF-8 on initialize and change it back on finalize. (This would leave
the console in an unexpected state if Python segfaults, but console
encoding is probably the least of anyone's worries at that point.) So
I'm proposing actively changing the current console to be Unicode while
Python is running, and hence sys.std[in|out|err] will default to utf-8.

So that's a broad range of changes, and I have little hope of figuring
out all the possible issues, back-compat risks, and flow-on effects on
my own. Please let me know (either on-list or off-list) how a change
like this would affect your projects, either positively or negatively,
and whether you have any specific experience with these changes/fixes
and think they should be approached differently.


To summarise the proposals (remembering that these would only affect
Python 3.6 on Windows):

* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths
* make the default open() encoding check for a BOM or else use utf-8
* [ALTERNATIVE] make the default open() encoding check for a BOM or else
use sys.getpreferredencoding()
* force the console encoding to UTF-8 on initialize and revert on finalize

So what are your concerns? Suggestions?

Thanks,
Steve
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Paul Moore

unread,
Aug 10, 2016, 2:44:49 PM8/10/16
to Steve Dower, Python-Ideas
On 10 August 2016 at 19:10, Steve Dower <steve...@python.org> wrote:
> To summarise the proposals (remembering that these would only affect Python
> 3.6 on Windows):
>
> * change sys.getfilesystemencoding() to return 'utf-8'
> * automatically decode byte paths assuming they are utf-8
> * remove the deprecation warning on byte paths
> * make the default open() encoding check for a BOM or else use utf-8
> * [ALTERNATIVE] make the default open() encoding check for a BOM or else use
> sys.getpreferredencoding()
> * force the console encoding to UTF-8 on initialize and revert on finalize
>
> So what are your concerns? Suggestions?

I presume you'd be targeting 3.7 for this change. Broadly, I'm +1 on
all of this. Personally, I'm moving to UTF-8 everywhere, so it seems
OK to me, but I suspect defaulting open() to UTF-8 in the absence of a
BOM might cause issues for people. Most text editors still (AFAIK) use
the ANSI codepage by default, and it's the one place where an
identifying BOM isn't possible. So your alternative may be a safer
choice. On the other hand, files from Unix (via say github) would
typically be UTF-8 without BOM, so it becomes a question of choosing
the best compromise. I'm inclined to go for cross-platform and UTF-8
and clearly document the change. We might want a more convenient short
form for open(filename, "r", encoding=sys.getpreferredencoding()),
though, to ease the transition... We'd also need to consider how the
new default encoding would interact with PYTHONIOENCODING.

For the console, does this mean that the win_unicode_console module
will no longer be needed when these changes go in?

Sorry, not much in the way of direct experience or information I can
add, but a strong +1 on the change (and I'd be happy to help where
needed).

Paul

Random832

unread,
Aug 10, 2016, 2:47:12 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
> To summarise the proposals (remembering that these would only affect
> Python 3.6 on Windows):
>
> * change sys.getfilesystemencoding() to return 'utf-8'
> * automatically decode byte paths assuming they are utf-8
> * remove the deprecation warning on byte paths

Why? What's the use case?

> * make the default open() encoding check for a BOM or else use utf-8
> * [ALTERNATIVE] make the default open() encoding check for a BOM or else
> use sys.getpreferredencoding()

For reading, I assume. When opened for writing, it should probably be
utf-8-sig [if it's not mbcs] to match what Notepad does. What about
files opened for appending or updating? In theory it could ingest the
whole file to see if it's valid UTF-8, but that has a time cost.

Notepad, if there's no BOM, checks the first 256 bytes of the file for
whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
and can get it wrong for certain very short files [i.e. the infamous
"this app can break"]

What to do on opening a pipe or device? [Is os.fstat able to detect
these cases?]

Maybe the BOM detection phase should be deferred until the first read.
What should encoding be at that point if this is done? Is there a
"utf-any" encoding that can handle all five BOMs? If not, should there
be? how are "utf-16" and "utf-32" files opened for appending or updating
handled today?

> * force the console encoding to UTF-8 on initialize and revert on
> finalize

Why not implement a true unicode console? What if sys.stdin/stdout are
pipes (or non-console devices such as a serial port)?

Steve Dower

unread,
Aug 10, 2016, 3:09:40 PM8/10/16
to Paul Moore, Python-Ideas
On 10Aug2016 1144, Paul Moore wrote:
> I presume you'd be targeting 3.7 for this change.

Does 3.6 seem too aggressive? I think I have time to implement the
changes before beta 1, as it's mostly changing default values and
mopping up resulting breaks. (Doing something like reimplementing files
using the Win32 API rather than the CRT would be too big a task for 3.6.)

> Most text editors still (AFAIK) use
> the ANSI codepage by default, and it's the one place where an
> identifying BOM isn't possible. So your alternative may be a safer
> choice. On the other hand, files from Unix (via say github) would
> typically be UTF-8 without BOM, so it becomes a question of choosing
> the best compromise. I'm inclined to go for cross-platform and UTF-8
> and clearly document the change.

That last point was my thinking. Notepad's default is just as bad as
Python's default right now, but basically everyone acknowledges that
it's bad. I don't think we should prevent Python from behaving better
because one Windows tool doesn't.

> We might want a more convenient short
> form for open(filename, "r", encoding=sys.getpreferredencoding()),
> though, to ease the transition... We'd also need to consider how the
> new default encoding would interact with PYTHONIOENCODING.

PYTHONIOENCODING doesn't affect locale.getpreferredencoding() (but it
does affect sys.std*.encoding).

> For the console, does this mean that the win_unicode_console module
> will no longer be needed when these changes go in?

That's the hope, though that module approaches the solution differently
and may still uses. An alternative way for us to fix this whole thing
would be to bring win_unicode_console into the standard library and use
it by default (or probably whenever PYTHONIOENCODING is not specified).

> Sorry, not much in the way of direct experience or information I can
> add, but a strong +1 on the change (and I'd be happy to help where
> needed).

Testing with obscure filenames and strings is where help will be needed
most :)

Cheers,
Steve

Steve Dower

unread,
Aug 10, 2016, 3:23:15 PM8/10/16
to Random832, python...@python.org
On 10Aug2016 1146, Random832 wrote:
> On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
>> To summarise the proposals (remembering that these would only affect
>> Python 3.6 on Windows):
>>
>> * change sys.getfilesystemencoding() to return 'utf-8'
>> * automatically decode byte paths assuming they are utf-8
>> * remove the deprecation warning on byte paths
>
> Why? What's the use case?

Allowing library developers who support POSIX and Windows to just use
bytes everywhere to represent paths.

>> * make the default open() encoding check for a BOM or else use utf-8
>> * [ALTERNATIVE] make the default open() encoding check for a BOM or else
>> use sys.getpreferredencoding()
>
> For reading, I assume. When opened for writing, it should probably be
> utf-8-sig [if it's not mbcs] to match what Notepad does. What about
> files opened for appending or updating? In theory it could ingest the
> whole file to see if it's valid UTF-8, but that has a time cost.

Writing out the BOM automatically basically makes your files
incompatible with other platforms, which rarely expect a BOM. By
omitting it but writing and reading UTF-8 we ensure that Python can
handle its own files on any platform, while potentially upsetting some
older applications on Windows or platforms that don't assume UTF-8 as a
default.

> Notepad, if there's no BOM, checks the first 256 bytes of the file for
> whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
> and can get it wrong for certain very short files [i.e. the infamous
> "this app can break"]

Yeah, this is a pretty horrible idea :) I don't want to go there by
default, but people can install chardet if they want the functionality.

> What to do on opening a pipe or device? [Is os.fstat able to detect
> these cases?]

We should be able to detect them, but why treat them any differently
from a file? Right now they're just as broken as they will be after the
change if you aren't specifying 'b' or an encoding - probably more
broken, since at least you'll get less encoding errors when the encoding
is UTF-8.

> Maybe the BOM detection phase should be deferred until the first read.
> What should encoding be at that point if this is done? Is there a
> "utf-any" encoding that can handle all five BOMs? If not, should there
> be? how are "utf-16" and "utf-32" files opened for appending or updating
> handled today?

Yes, I think it would be. I suspect we'd have to leave the encoding
unknown until the first read, and perhaps force it to utf-8-sig if
someone asks before we start. I don't *think* this is any less
predictable than the current behaviour, given it only applies when
you've left out any encoding specification, but maybe it is.

It probably also entails opening the file descriptor in bytes mode,
which might break programs that pass the fd directly to CRT functions.
Personally I wish they wouldn't, but it's too late to stop them now.

>> * force the console encoding to UTF-8 on initialize and revert on
>> finalize
>
> Why not implement a true unicode console? What if sys.stdin/stdout are
> pipes (or non-console devices such as a serial port)?

Mostly because it's much more work. As I mentioned in my other post, an
alternative would be to bring win_unicode_console into the stdlib and
enable it by default (which considering the package was largely
developed on bugs.p.o is probably okay, but we'd probably need to
rewrite it in C, which is basically implementing a true Unicode console).

You're right that changing the console encoding after launching Python
is probably going to mess with pipes. We can detect whether the streams
are interactive or not and adjust accordingly, but that's going to get
messy if you're only piping in/out and stdin/out end up with different
encodings. I'll put some more thought into this part.

Thanks,
Steve

Paul Moore

unread,
Aug 10, 2016, 3:24:13 PM8/10/16
to Steve Dower, Python-Ideas
On 10 August 2016 at 20:08, Steve Dower <steve...@python.org> wrote:
> On 10Aug2016 1144, Paul Moore wrote:
>>
>> I presume you'd be targeting 3.7 for this change.
>
> Does 3.6 seem too aggressive? I think I have time to implement the changes
> before beta 1, as it's mostly changing default values and mopping up
> resulting breaks. (Doing something like reimplementing files using the Win32
> API rather than the CRT would be too big a task for 3.6.)

I guess I just assumed it was a bigger change than that. I don't
object to it going into 3.6 as such, but it might need longer for any
debates to die down. I guess that comes down to how big this thread
gets, though.

Personally, I'd be OK with it being in 3.6, we'll see if others think
it's too aggressive :-)

> Testing with obscure filenames and strings is where help will be needed most
> :)

I'm happy to invent hard cases for you, but I'm in the UK. For real
use, the Euro symbol's about as obscure as we get around here ;-)

Paul

Random832

unread,
Aug 10, 2016, 3:27:34 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
> Testing with obscure filenames and strings is where help will be needed
> most :)

How about filenames with invalid surrogates? For added fun, consider
that the file system encoding is normally used with surrogateescape.

Steve Dower

unread,
Aug 10, 2016, 3:40:11 PM8/10/16
to python...@python.org
On 10Aug2016 1226, Random832 wrote:
> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
>> Testing with obscure filenames and strings is where help will be needed
>> most :)
>
> How about filenames with invalid surrogates? For added fun, consider
> that the file system encoding is normally used with surrogateescape.

This is where it gets extra fun, since surrogateescape is not normally
used on Windows because we receive paths as Unicode text and pass them
back as Unicode text without ever encoding or decoding them.

Currently a broken filename (such as '\udee1.txt') can be correctly seen
with os.listdir('.') but not os.listdir(b'.') (because Windows will
return it as '?.txt'). It can be passed to open(), but encoding the name
to utf-8 or utf-16 fails, and I doubt there's any encoding that is going
to succeed.

As far as I can tell, if you get a weird name in bytes today you are
broken, and there is no way to be unbroken without doing the actual
right thing and converting paths on POSIX into Unicode with
surrogateescape. So our official advice has to stay the same - treating
paths as text with smuggled bytes is the *only* way to be truly correct.
But unless we also deprecate byte paths on POSIX, we'll never get there.
(Now there's a dangerous idea ;) )

Cheers,
Steve

Random832

unread,
Aug 10, 2016, 4:09:53 PM8/10/16
to Steve Dower, python...@python.org
On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
> > Why? What's the use case? [byte paths]
>
> Allowing library developers who support POSIX and Windows to just use
> bytes everywhere to represent paths.

Okay, how is that use case impacted by it being mbcs instead of utf-8?

What about only doing the deprecation warning if non-ascii bytes are
present in the value?

> > For reading, I assume. When opened for writing, it should probably be
> > utf-8-sig [if it's not mbcs] to match what Notepad does. What about
> > files opened for appending or updating? In theory it could ingest the
> > whole file to see if it's valid UTF-8, but that has a time cost.
>
> Writing out the BOM automatically basically makes your files
> incompatible with other platforms, which rarely expect a BOM.

Yes but you're not running on other platforms, you're running on the
platform you're running on. If files need to be moved between platforms,
converting files with a BOM to without ought to be the responsibility of
the same tool that converts CRLF line endings to LF.

> By
> omitting it but writing and reading UTF-8 we ensure that Python can
> handle its own files on any platform, while potentially upsetting some
> older applications on Windows or platforms that don't assume UTF-8 as a
> default.

Okay, you haven't addressed updating and appending. I realized after
posting that updating should be in binary, but that leaves appending.
Should we detect BOMs and/or attempt to detect the encoding by other
means in those cases?

> > Notepad, if there's no BOM, checks the first 256 bytes of the file for
> > whether it's likely to be utf-16 or mbcs [utf-8 isn't considered AFAIK],
> > and can get it wrong for certain very short files [i.e. the infamous
> > "this app can break"]
>
> Yeah, this is a pretty horrible idea :)

Eh, maybe the utf-16 because it can give some hilariously bad results,
but using it to differentiate between utf-8 and mbcs might not be so
bad. But what to do if all we see is ascii?

> > What to do on opening a pipe or device? [Is os.fstat able to detect
> > these cases?]
>
> We should be able to detect them, but why treat them any differently
> from a file?

Eh, I was mainly concerned about if the first few bytes aren't a BOM?
What about blocking waiting for them? But if this is delayed until the
first read then it's fine.

> It probably also entails opening the file descriptor in bytes mode,
> which might break programs that pass the fd directly to CRT functions.
> Personally I wish they wouldn't, but it's too late to stop them now.

The only thing O_TEXT does rather than O_BINARY is convert CRLF line
endings (and maybe end on ^Z), and I don't think we even expose the
constants for the CRT's unicode modes.

eryk sun

unread,
Aug 10, 2016, 4:18:26 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016 at 6:10 PM, Steve Dower <steve...@python.org> wrote:
> Similarly, locale.getpreferredencoding() on Windows returns a legacy value -
> the user's active code page - which should generally not be used for any
> reason. The one exception is as a default encoding for opening files when no
> other information is available (e.g. a Unicode BOM or explicit encoding
> argument). BOMs are very common on Windows, since the default assumption is
> nearly always a bad idea.

The CRT doesn't allow UTF-8 as a locale encoding because Windows
itself doesn't allow this. So locale.getpreferredencoding() can't
change, but in practice it can be ignored.

Speaking of locale, Windows Python should call setlocale(LC_CTYPE, "")
in pylifecycle.c in order to work around an inconsistency between
LC_TIME and LC_CTYPE in the the default "C" locale. The former is ANSI
while the latter is effectively Latin-1, which leads to mojibake in
time.tzname and elsewhere. Calling setlocale(LC_CTYPE, "") is already
done on most Unix systems, so this would actually improve
cross-platform consistency.

> Finally, the encoding of stdin, stdout and stderr are currently (correctly)
> inferred from the encoding of the console window that Python is attached to.
> However, this is typically a codepage that is different from the system
> codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users
> are starting Python from a console, they can use "chcp 65001" first to
> switch to UTF-8, and then *most* functionality works (input() has some
> issues, but those can be fixed with a slight rewrite and possibly breaking
> readline hooks).

Using codepage 65001 for output is broken prior to Windows 8 because
WriteFile/WriteConsoleA returns (as an output parameter) the number of
decoded UTF-16 codepoints instead of the number of bytes written,
which makes a buffered writer repeatedly write garbage at the end of
each write in proportion to the number of non-ASCII characters. This
can be worked around by decoding to get the UTF-16 size before each
write, or by just blindly assuming that a console write always
succeeds in writing the entire buffer. In this case the console should
be detected by GetConsoleMode(). isatty() isn't right for this since
it's true for all character devices, which includes NUL among others.

Codepage 65001 is broken for non-ASCII input (via
ReadFile/ReadConsoleA) in all versions of Windows that I've tested,
including Windows 10. By attaching a debugger to conhost.exe you can
see how it fails in WideCharToMultiByte because it assumes one byte
per character. If you try to read 10 bytes, it assumes you're trying
to read 10 UTF-16 'characters' into a 10 byte buffer, which fails for
UTF-8 when even a single non-ASCII character is read. The
ReadFile/ReadConsoleA call returns that it successfully read 0 bytes,
which is interpreted as EOF. This cannot be worked around. The only
way to read the full range of Unicode from the console is via the
wide-character APIs ReadConsoleW and ReadConsoleInputW.

IMO, Python needs a C implementation of the win_unicode_console
module, using the wide-character APIs ReadConsoleW and WriteConsoleW.
Note that this sets sys.std*.encoding as UTF-8 and transcodes, so
Python code never has to work directly with UTF-16 encoded text.

Paul Moore

unread,
Aug 10, 2016, 4:59:41 PM8/10/16
to eryk sun, Python-Ideas
On 10 August 2016 at 21:16, eryk sun <ery...@gmail.com> wrote:
> IMO, Python needs a C implementation of the win_unicode_console
> module, using the wide-character APIs ReadConsoleW and WriteConsoleW.
> Note that this sets sys.std*.encoding as UTF-8 and transcodes, so
> Python code never has to work directly with UTF-16 encoded text.

+1 on this (and if this means we need to wait till 3.7, so be it). I'd
originally thought this was what Steve was proposing.

Paul

Chris Angelico

unread,
Aug 10, 2016, 5:32:04 PM8/10/16
to python-ideas
On Thu, Aug 11, 2016 at 6:09 AM, Random832 <rand...@fastmail.com> wrote:
> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>> > Why? What's the use case? [byte paths]
>>
>> Allowing library developers who support POSIX and Windows to just use
>> bytes everywhere to represent paths.
>
> Okay, how is that use case impacted by it being mbcs instead of utf-8?

AIUI, the data flow would be: Python bytes object -> decode to Unicode
text -> encode to UTF-16 -> Windows API. If you do the first
transformation using mbcs, you're guaranteed *some* result (all
Windows codepages have definitions for all byte values, if I'm not
mistaken), but a hard-to-predict one - and worse, one that can change
based on system settings. Also, if someone naively types
"bytepath.decode()", Python will default to UTF-8, *not* to the system
codepage.

I'd rather a single consistent default encoding.

> What about only doing the deprecation warning if non-ascii bytes are
> present in the value?

-1. Data-dependent warnings just serve to strengthen the feeling that
"weird characters" keep breaking your programs, instead of writing
your program to cope with all characters equally. It's like being
racist against non-ASCII characters :)

On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve...@python.org> wrote:
> To summarise the proposals (remembering that these would only affect Python
> 3.6 on Windows):
>
> * change sys.getfilesystemencoding() to return 'utf-8'
> * automatically decode byte paths assuming they are utf-8
> * remove the deprecation warning on byte paths

+1 on these.

> * make the default open() encoding check for a BOM or else use utf-8

-0.5. Is there any precedent for this kind of data-based detection
being the default? An explicit "utf-sig" could do a full detection,
but even then it's not perfect - how do you distinguish UTF-32LE from
UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
assume UTF-16", or do you say "files starting U+0000 are rare, so
we'll assume UTF-32"?

> * [ALTERNATIVE] make the default open() encoding check for a BOM or else use
> sys.getpreferredencoding()

-1. Same concerns as the above, plus I'd rather use the saner default.

> * force the console encoding to UTF-8 on initialize and revert on finalize

-0 for Python itself; +1 for Python's interactive interpreter.
Programs that mess with console settings get annoying when they crash
out and don't revert properly. Unless there is *no way* that you could
externally kill the process without also bringing the terminal down,
there's the distinct possibility of messing everything up.

Would it be possible to have a "sys.setconsoleutf8()" that changes the
console encoding and slaps in an atexit() to revert? That would at
least leave it in the hands of the app.

Overall I'm +1 on shifting from eight-bit encodings to UTF-8. Don't be
held back by what Notepad does.

ChrisA

Brett Cannon

unread,
Aug 10, 2016, 6:16:45 PM8/10/16
to Steve Dower, python...@python.org


On Wed, 10 Aug 2016 at 11:16 Steve Dower <steve...@python.org> wrote:
[SNIP]


Finally, the encoding of stdin, stdout and stderr are currently
(correctly) inferred from the encoding of the console window that Python
is attached to. However, this is typically a codepage that is different
from the system codepage (i.e. it's not mbcs) and is almost certainly
not Unicode. If users are starting Python from a console, they can use
"chcp 65001" first to switch to UTF-8, and then *most* functionality
works (input() has some issues, but those can be fixed with a slight
rewrite and possibly breaking readline hooks).

It is also possible for Python to change the current console encoding to
be UTF-8 on initialize and change it back on finalize. (This would leave
the console in an unexpected state if Python segfaults, but console
encoding is probably the least of anyone's worries at that point.) So
I'm proposing actively changing the current console to be Unicode while
Python is running, and hence sys.std[in|out|err] will default to utf-8.

So that's a broad range of changes, and I have little hope of figuring
out all the possible issues, back-compat risks, and flow-on effects on
my own. Please let me know (either on-list or off-list) how a change
like this would affect your projects, either positively or negatively,
and whether you have any specific experience with these changes/fixes
and think they should be approached differently.


To summarise the proposals (remembering that these would only affect
Python 3.6 on Windows):

[SNIP]

* force the console encoding to UTF-8 on initialize and revert on finalize

Don't have enough Windows experience to comment on the other parts of this proposal, but for the console encoding I am a hearty +1 as I'm tired of Unicode characters failing to show up in the REPL.

eryk sun

unread,
Aug 10, 2016, 7:05:33 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016 at 8:09 PM, Random832 <rand...@fastmail.com> wrote:
> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>>
>> Allowing library developers who support POSIX and Windows to just use
>> bytes everywhere to represent paths.
>
> Okay, how is that use case impacted by it being mbcs instead of utf-8?

Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
locales that use a DBCS codepage such as 932. If a sequence is
invalid, it gets passed to the filesystem as the default Unicode
character, so it won't successfully roundtrip. In the following
example b"\x81\xad", which isn't defined in CP932, gets mapped to the
codepage's default Unicode character, Katakana middle dot, which
encodes back as b"\x81E":

>>> locale.getpreferredencoding()
'cp932'
>>> open(b'\x81\xad', 'w').close()
>>> os.listdir('.')
['・']
>>> unicodedata.name(os.listdir('.')[0])
'KATAKANA MIDDLE DOT'
>>> '・'.encode('932')
b'\x81E'

This isn't a problem for single-byte codepages, since every byte value
uniquely maps to a Unicode code point, even if it's simply b'\x81' =>
u"\x81". Obviously there's still the general problem of dealing with
arbitrary Unicode filenames created by other programs, since the ANSI
API can only return a best-fit encoding of the filename, which is
useless for actually accessing the file.

>> It probably also entails opening the file descriptor in bytes mode,
>> which might break programs that pass the fd directly to CRT functions.
>> Personally I wish they wouldn't, but it's too late to stop them now.
>
> The only thing O_TEXT does rather than O_BINARY is convert CRLF line
> endings (and maybe end on ^Z), and I don't think we even expose the
> constants for the CRT's unicode modes.

Python 3 uses O_BINARY when opening files, unless you explicitly call
os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags
if the platform defines it.

The Windows CRT reads the BOM for the Unicode modes O_WTEXT,
O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires
opening the file twice, the first time with read access. See
configure_text_mode() in "Windows
Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp".

Python doesn't expose or use these Unicode text-mode constants. That's
for the best because in Unicode mode the CRT invokes the invalid
parameter handler when a buffer doesn't have an even number of bytes,
i.e. a multiple of sizeof(wchar_t). Python could copy how
configure_text_mode() handles the BOM, except it shouldn't write a BOM
for new UTF-8 files.

Random832

unread,
Aug 10, 2016, 7:31:34 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
> Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
> locales that use a DBCS codepage such as 932.

Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
unless you intend to use surrogateescape (which you could also do with
mbcs).

Is there any particular reason to expect all bytes paths in this
scenario to be valid UTF-8?

> Python 3 uses O_BINARY when opening files, unless you explicitly call
> os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags
> if the platform defines it.

Fair enough. I wasn't sure, particularly considering that python does
expose O_BINARY, O_TEXT, and msvcrt.setmode.

I'm not sure I approve of os.open not also adding it (or perhaps adding
it only if O_TEXT is not explicitly added), but... meh.

> Python could copy how
> configure_text_mode() handles the BOM, except it shouldn't write a BOM
> for new UTF-8 files.

I disagree. I think that *on windows* it should, just like *on windows*
it should write CR-LF for line endings.

Steve Dower

unread,
Aug 10, 2016, 7:41:25 PM8/10/16
to Chris Angelico, python-ideas
On 10Aug2016 1431, Chris Angelico wrote:
> On Thu, Aug 11, 2016 at 6:09 AM, Random832 <rand...@fastmail.com> wrote:
>> On Wed, Aug 10, 2016, at 15:22, Steve Dower wrote:
>>>> Why? What's the use case? [byte paths]
>>>
>>> Allowing library developers who support POSIX and Windows to just use
>>> bytes everywhere to represent paths.
>>
>> Okay, how is that use case impacted by it being mbcs instead of utf-8?
>
> AIUI, the data flow would be: Python bytes object -> decode to Unicode
> text -> encode to UTF-16 -> Windows API. If you do the first
> transformation using mbcs, you're guaranteed *some* result (all
> Windows codepages have definitions for all byte values, if I'm not
> mistaken), but a hard-to-predict one - and worse, one that can change
> based on system settings. Also, if someone naively types
> "bytepath.decode()", Python will default to UTF-8, *not* to the system
> codepage.
>
> I'd rather a single consistent default encoding.

I'm proposing to make that single consistent default encoding utf-8. It
sounds like we're in agreement?

>> What about only doing the deprecation warning if non-ascii bytes are
>> present in the value?
>
> -1. Data-dependent warnings just serve to strengthen the feeling that
> "weird characters" keep breaking your programs, instead of writing
> your program to cope with all characters equally. It's like being
> racist against non-ASCII characters :)

Agreed. This won't happen.

> On Thu, Aug 11, 2016 at 4:10 AM, Steve Dower <steve...@python.org> wrote:
>> To summarise the proposals (remembering that these would only affect Python
>> 3.6 on Windows):
>>
>> * change sys.getfilesystemencoding() to return 'utf-8'
>> * automatically decode byte paths assuming they are utf-8
>> * remove the deprecation warning on byte paths
>
> +1 on these.
>
>> * make the default open() encoding check for a BOM or else use utf-8
>
> -0.5. Is there any precedent for this kind of data-based detection
> being the default? An explicit "utf-sig" could do a full detection,
> but even then it's not perfect - how do you distinguish UTF-32LE from
> UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
> assume UTF-16", or do you say "files starting U+0000 are rare, so
> we'll assume UTF-32"?

The BOM exists solely for data-based detection, and the UTF-8 BOM is
different from the UTF-16 and UTF-32 ones. So we either find an exact
BOM (which IIRC decodes as a no-op spacing character, though I have a
feeling some version of Unicode redefined it exclusively for being the
marker) or we use utf-8.

But the main reason for detecting the BOM is that currently opening
files with 'utf-8' does not skip the BOM if it exists. I'd be quite
happy with changing the default encoding to:

* utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
* utf-8 when writing (so the BOM is *not* written)

This provides the best compatibility when reading/writing files without
making any guesses. We could reasonably extend this to read utf-16 and
utf-32 if they have a BOM, but that's an extension and not necessary for
the main change.

>> * force the console encoding to UTF-8 on initialize and revert on finalize
>
> -0 for Python itself; +1 for Python's interactive interpreter.
> Programs that mess with console settings get annoying when they crash
> out and don't revert properly. Unless there is *no way* that you could
> externally kill the process without also bringing the terminal down,
> there's the distinct possibility of messing everything up.

The main problem here is that if the console is not forced to UTF-8 then
it won't render any of the characters correctly.

> Would it be possible to have a "sys.setconsoleutf8()" that changes the
> console encoding and slaps in an atexit() to revert? That would at
> least leave it in the hands of the app.

Yes, but if the app is going to opt-in then I'd suggest the
win_unicode_console package, which won't require any particular changes.

It sounds like we'll have to look into effectively merging that package
into the core. I'm afraid that'll come with a much longer tail of bugs
(and will quite likely break code that expects to use file descriptors
to access stdin/out), but it's the least impactful way to do it.

Cheers,
Steve

Steve Dower

unread,
Aug 10, 2016, 7:49:25 PM8/10/16
to Random832, python...@python.org
On 10Aug2016 1630, Random832 wrote:
> On Wed, Aug 10, 2016, at 19:04, eryk sun wrote:
>> Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
>> locales that use a DBCS codepage such as 932.
>
> Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
> unless you intend to use surrogateescape (which you could also do with
> mbcs).
>
> Is there any particular reason to expect all bytes paths in this
> scenario to be valid UTF-8?

On Windows, all paths are effectively UCS-2 (they are defined as UTF-16,
but surrogate pairs don't seem to be validated, which IIUC means it's
really UCS-2), so while the majority can be encoded as valid UTF-8,
there are some paths which cannot. (These paths are going to break many
other tools though, such as PowerShell, so we won't be in bad company if
we can't handle them properly in edge cases).

surrogateescape is irrelevant because it's only for decoding from bytes.
An alternative approach would be to replace mbcs with a ucs-2 encoding
that is basically just a blob of the path that was returned from Windows
(using the Unicode APIs). None of the manipulation functions would work
on this though, since nearly every second character would be \x00, but
it's the only way (besides using str) to maintain full fidelity for
every possible path name.

Compromising on UTF-8 is going to increase consistency across platforms
and across different Windows installations without increasing the rate
of errors above what we currently see (given that invalid characters are
currently replaced with '?'). It's not a 100% solution, but it's a 99%
solution where the 1% is not handled well by anyone.

Cheers,
Steve

eryk sun

unread,
Aug 10, 2016, 7:50:51 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016 at 11:30 PM, Random832 <rand...@fastmail.com> wrote:
> Er... utf-8 doesn't work reliably with arbitrary bytes paths either,
> unless you intend to use surrogateescape (which you could also do with
> mbcs).
>
> Is there any particular reason to expect all bytes paths in this
> scenario to be valid UTF-8?

The problem is more so that data is lost without an error when using
the legacy ANSI API. If the path is invalid UTF-8, Python will at
least raise an exception when decoding it. To work around this, the
developers may decide they need to just bite the bullet and use
Unicode, or maybe there could be legacy Latin-1 and ANSI modes enabled
by an environment variable or sys flag.

Chris Angelico

unread,
Aug 10, 2016, 8:41:49 PM8/10/16
to python-ideas
On Thu, Aug 11, 2016 at 9:40 AM, Steve Dower <steve...@python.org> wrote:
> On 10Aug2016 1431, Chris Angelico wrote:
>> I'd rather a single consistent default encoding.
>
> I'm proposing to make that single consistent default encoding utf-8. It
> sounds like we're in agreement?

Yes, we are. I was disagreeing with Random's suggestion that mbcs
would also serve. Defaulting to UTF-8 everywhere is (a) consistent on
all systems, regardless of settings; and (b) consistent with
bytes.decode() and str.encode(), both of which default to UTF-8.

>> -0.5. Is there any precedent for this kind of data-based detection
>> being the default? An explicit "utf-sig" could do a full detection,
>> but even then it's not perfect - how do you distinguish UTF-32LE from
>> UTF-16LE that starts with U+0000? Do you say "UTF-32 is rare so we'll
>> assume UTF-16", or do you say "files starting U+0000 are rare, so
>> we'll assume UTF-32"?
>
>
> The BOM exists solely for data-based detection, and the UTF-8 BOM is
> different from the UTF-16 and UTF-32 ones. So we either find an exact BOM
> (which IIRC decodes as a no-op spacing character, though I have a feeling
> some version of Unicode redefined it exclusively for being the marker) or we
> use utf-8.
>
> But the main reason for detecting the BOM is that currently opening files
> with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with
> changing the default encoding to:
>
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)
>
> This provides the best compatibility when reading/writing files without
> making any guesses. We could reasonably extend this to read utf-16 and
> utf-32 if they have a BOM, but that's an extension and not necessary for the
> main change.

AIUI the utf-8-sig encoding is happy to decode something that doesn't
have a signature, right? If so, then yes, I would definitely support
that mild mismatch in defaults. Chew up that UTF-8 aBOMination and
just use UTF-8 as is.

I've almost never seen files stored in UTF-32 (even UTF-16 isn't all
that common compared to UTF-8), so I wouldn't stress too much about
that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth
doing, but it could easily be retrofitted (that byte sequence won't
decode as UTF-8).

>>> * force the console encoding to UTF-8 on initialize and revert on
>>> finalize
>>
>>
>> -0 for Python itself; +1 for Python's interactive interpreter.
>> Programs that mess with console settings get annoying when they crash
>> out and don't revert properly. Unless there is *no way* that you could
>> externally kill the process without also bringing the terminal down,
>> there's the distinct possibility of messing everything up.
>
>
> The main problem here is that if the console is not forced to UTF-8 then it
> won't render any of the characters correctly.

Ehh, that's annoying. Is there a way to guarantee, at the process
level, that the console will be returned to "normal state" when Python
exits? If not, there's the risk that people run a Python program and
then the *next* program gets into trouble.

But if that happens only on abnormal termination ("I killed Python
from Task Manager, and it left stuff messed up so I had to close the
console"), it's probably an acceptable risk. And the benefit sounds
well worthwhile. Revising my recommendation to +0.9.

ChrisA

eryk sun

unread,
Aug 10, 2016, 9:56:54 PM8/10/16
to python-ideas
On Wed, Aug 10, 2016 at 11:40 PM, Steve Dower <steve...@python.org> wrote:
> It sounds like we'll have to look into effectively merging that package into
> the core. I'm afraid that'll come with a much longer tail of bugs (and will
> quite likely break code that expects to use file descriptors to access
> stdin/out), but it's the least impactful way to do it.

Programs that use sys.std*.encoding but use the file descriptor seem
like a weird case to me. Do you have an example?

Steven D'Aprano

unread,
Aug 10, 2016, 11:15:06 PM8/10/16
to python...@python.org
On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:

> On 10Aug2016 1431, Chris Angelico wrote:
> >>* make the default open() encoding check for a BOM or else use utf-8
> >
> >-0.5. Is there any precedent for this kind of data-based detection
> >being the default?

There is precedent: the Python interpreter will accept a BOM instead of
an encoding cookie when importing .py files.


[Chris]
> >An explicit "utf-sig" could do a full detection,
> >but even then it's not perfect - how do you distinguish UTF-32LE from
> >UTF-16LE that starts with U+0000?

BOMs are a heuristic, nothing more. If you're reading arbitrary files
could start with anything, then of course they can guess wrong. But then
if I dumped a bunch of arbitrary Unicode codepoints in your lap and
asked you to guess the language, you would likely get it wrong too :-)

[Chris]
> >Do you say "UTF-32 is rare so we'll
> >assume UTF-16", or do you say "files starting U+0000 are rare, so
> >we'll assume UTF-32"?

The way I have done auto-detection based on BOMs is you start by reading
four bytes from the file in binary mode. (If there are fewer than four
bytes, it cannot be a text file with a BOM.) Compare those first four
bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second*
(otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs
(big-endian and little-endian). Then check for UTF-8, and if you're
really keen, UTF-7 and UTF-1.

def bom2enc(bom, default=None):
"""Return encoding name from a four-byte BOM."""
if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif bom.startswith(b'\xEF\xBB\xBF'):
return 'utf_8_sig'
elif bom.startswith(b'\x2B\x2F\x76'):
if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
return 'utf_7'
elif bom.startswith(b'\xF7\x64\x4C'):
return 'utf_1'
elif default is None:
raise ValueError('no recognisable BOM signature')
else:
return default



[Steve Dower]
> The BOM exists solely for data-based detection, and the UTF-8 BOM is
> different from the UTF-16 and UTF-32 ones. So we either find an exact
> BOM (which IIRC decodes as a no-op spacing character, though I have a
> feeling some version of Unicode redefined it exclusively for being the
> marker) or we use utf-8.

The Byte Order Mark is always U+FEFF encoded into whatever bytes your
encoding uses. You should never use U+FEFF except as a BOM, but of
course arbitrary Unicode strings might include it in the middle of the
string Just Because. In that case, it may be interpreted as a legacy
"ZERO WIDTH NON-BREAKING SPACE" character. But new content should never
do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF
inside the body of your file or string as an unsupported character.

http://www.unicode.org/faq/utf_bom.html#BOM


[Steve]
> But the main reason for detecting the BOM is that currently opening
> files with 'utf-8' does not skip the BOM if it exists. I'd be quite
> happy with changing the default encoding to:
>
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)

Sounds reasonable to me.

Rather than hard-coding that behaviour, can we have a new encoding that
does that? "utf-8-readsig" perhaps.


[Steve]
> This provides the best compatibility when reading/writing files without
> making any guesses. We could reasonably extend this to read utf-16 and
> utf-32 if they have a BOM, but that's an extension and not necessary for
> the main change.

The use of a BOM is always a guess :-) Maybe I just happen to have a
Latin1 file that starts with "", or a Mac Roman file that starts with
"Ôªø". Either case will be wrongly detected as UTF-8. That's the risk
you take when using a heuristic.

And if you don't want to use that heuristic, then you must specify the
actual encoding in use.


--
Steven D'Aprano

Nick Coghlan

unread,
Aug 10, 2016, 11:27:11 PM8/10/16
to Steve Dower, python...@python.org
On 11 August 2016 at 04:10, Steve Dower <steve...@python.org> wrote:
>
> I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.

My main reaction would be that if Drekin (Adam Bartoš) agrees the
changes natively solve the problems that
https://pypi.python.org/pypi/win_unicode_console works around, it's
probably a good idea.

The status quo is also sufficiently broken from both a native Windows
perspective and a cross-platform compatibility perspective that your
proposals are highly unlikely to make things *worse* :)

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Nick Coghlan

unread,
Aug 10, 2016, 11:29:44 PM8/10/16
to Steve Dower, python...@python.org
On 11 August 2016 at 13:26, Nick Coghlan <ncog...@gmail.com> wrote:
> On 11 August 2016 at 04:10, Steve Dower <steve...@python.org> wrote:
>>
>> I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.
>
> My main reaction would be that if Drekin (Adam Bartoš) agrees the
> changes natively solve the problems that
> https://pypi.python.org/pypi/win_unicode_console works around, it's
> probably a good idea.

Also, a reminder that Adam has a couple of proposals on the tracker
aimed at getting CPython to use a UTF-16-LE console on Windows:
http://bugs.python.org/issue22555#msg242943 (last two issue references
in that comment)

Chris Angelico

unread,
Aug 11, 2016, 12:09:49 AM8/11/16
to python-ideas
On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano <st...@pearwood.info> wrote:
> On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
>
>> On 10Aug2016 1431, Chris Angelico wrote:
>> >>* make the default open() encoding check for a BOM or else use utf-8
>> >
>> >-0.5. Is there any precedent for this kind of data-based detection
>> >being the default?
>
> There is precedent: the Python interpreter will accept a BOM instead of
> an encoding cookie when importing .py files.

Okay, that's good enough for me.

> [Chris]
>> >An explicit "utf-sig" could do a full detection,
>> >but even then it's not perfect - how do you distinguish UTF-32LE from
>> >UTF-16LE that starts with U+0000?
>
> BOMs are a heuristic, nothing more. If you're reading arbitrary files
> could start with anything, then of course they can guess wrong. But then
> if I dumped a bunch of arbitrary Unicode codepoints in your lap and
> asked you to guess the language, you would likely get it wrong too :-)

I have my own mental heuristics, but I can't recognize one Cyrillic
language from another. And some Slavic languages can be written with
either Latin or Cyrillic letters, just to further confuse matters. Of
course, "arbitrary Unicode codepoints" might not all come from one
language, and might not be any language at all.

(Do you wanna build a U+2603?)

> [Chris]
>> >Do you say "UTF-32 is rare so we'll
>> >assume UTF-16", or do you say "files starting U+0000 are rare, so
>> >we'll assume UTF-32"?
>
> The way I have done auto-detection based on BOMs is you start by reading
> four bytes from the file in binary mode. (If there are fewer than four
> bytes, it cannot be a text file with a BOM.)

Interesting. Are you assuming that a text file cannot be empty?
Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF
0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with
less than one character in them?

> Compare those first four
> bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second*
> (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs
> (big-endian and little-endian). Then check for UTF-8, and if you're
> really keen, UTF-7 and UTF-1.

For a default file-open encoding detection, I would minimize the
number of options. The UTF-7 BOM could be the beginning of a file
containing Base 64 data encoded in ASCII, which is a very real
possibility.

> elif bom.startswith(b'\x2B\x2F\x76'):
> if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
> return 'utf_7'

So I wouldn't include UTF-7 in the detection. Nor UTF-1. Both are
rare. Even UTF-32 doesn't necessarily have to be included. When was
the last time you saw a UTF-32LE-BOM file?

> [Steve]
>> But the main reason for detecting the BOM is that currently opening
>> files with 'utf-8' does not skip the BOM if it exists. I'd be quite
>> happy with changing the default encoding to:
>>
>> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
>> * utf-8 when writing (so the BOM is *not* written)
>
> Sounds reasonable to me.
>
> Rather than hard-coding that behaviour, can we have a new encoding that
> does that? "utf-8-readsig" perhaps.

+1. Makes the documentation easier by having the default value for
encoding not depend on the value for mode.

ChrisA

Random832

unread,
Aug 11, 2016, 12:58:30 AM8/11/16
to python...@python.org
On Wed, Aug 10, 2016, at 17:31, Chris Angelico wrote:
> AIUI, the data flow would be: Python bytes object

Nothing _starts_ as a Python bytes object. It has to be read from
somewhere or encoded in the source code as a literal. The scenario is
very different for "defined internally within the program" (how are
these not gonna be ASCII) vs "user input" (user input how? from the
console? from tkinter? how'd that get converted to bytes?) vs "from a
network or something like a tar file where it represents a path on some
other system" (in which case it's in whatever encoding that system used,
or *maybe* an encoding defined as part of the network protocol or file
format).

The use case has not been described adequately enough to answer my
question.

Paul Moore

unread,
Aug 11, 2016, 4:47:19 AM8/11/16
to Random832, Python-Ideas
On 11 August 2016 at 00:30, Random832 <rand...@fastmail.com> wrote:
>> Python could copy how
>> configure_text_mode() handles the BOM, except it shouldn't write a BOM
>> for new UTF-8 files.
>
> I disagree. I think that *on windows* it should, just like *on windows*
> it should write CR-LF for line endings.

Tools like git and hg, and cross platform text editors, handle
transparently managing the differences between line endings for you.
But nothing much handles BOM stripping/adding automatically. So while
in theory the two cases are similar, in practice lack of tool support
means that if we start adding BOMs on Windows (and requiring them so
that we can detect UTF8) then we'll be setting up new interoperability
problems for Python users, for little benefit.

Paul

Paul Moore

unread,
Aug 11, 2016, 5:08:32 AM8/11/16
to Chris Angelico, python-ideas
On 11 August 2016 at 01:41, Chris Angelico <ros...@gmail.com> wrote:
> I've almost never seen files stored in UTF-32 (even UTF-16 isn't all
> that common compared to UTF-8), so I wouldn't stress too much about
> that. Recognizing FE FF or FF FE and decoding as UTF-16 might be worth
> doing, but it could easily be retrofitted (that byte sequence won't
> decode as UTF-8).

I see UTF-16 relatively often as a result of redirecting stdout in
Powershell and forgetting that it defaults (stupidly, IMO) to UTF-16.

>> The main problem here is that if the console is not forced to UTF-8 then it
>> won't render any of the characters correctly.
>
> Ehh, that's annoying. Is there a way to guarantee, at the process
> level, that the console will be returned to "normal state" when Python
> exits? If not, there's the risk that people run a Python program and
> then the *next* program gets into trouble.

There's also the risk that Python programs using subprocess.Popen
start the subprocess with the console in a non-standard state. Should
we be temporarily restoring the console codepage in that case? How
does the following work?

<start>
set codepage to UTF-8
...
set codepage back
spawn subprocess X, but don't wait for it
set codepage to UTF-8
...
... At this point what codepage does Python see? What codepage does
process X see? (Note that they are both sharing the same console).
...
<end>
restore codepage

Paul

Steven D'Aprano

unread,
Aug 11, 2016, 10:29:44 AM8/11/16
to python...@python.org
On Thu, Aug 11, 2016 at 02:09:00PM +1000, Chris Angelico wrote:
> On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano <st...@pearwood.info> wrote:

> > The way I have done auto-detection based on BOMs is you start by reading
> > four bytes from the file in binary mode. (If there are fewer than four
> > bytes, it cannot be a text file with a BOM.)
>
> Interesting. Are you assuming that a text file cannot be empty?

Hmmm... not consciously, but I guess I was.

If the file is empty, how do you know it's text?

> Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF
> 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with
> less than one character in them?

I'll have to think about it some more :-)


> For a default file-open encoding detection, I would minimize the
> number of options. The UTF-7 BOM could be the beginning of a file
> containing Base 64 data encoded in ASCII, which is a very real
> possibility.

I'm coming from the assumption that you're reading unformated text in an
unknown encoding, rather than some structured format.

But we're getting off topic here. In context of Steve's suggestion, we
should only autodetect UTF-8. In other words, if there's a UTF-8 BOM,
skip it, otherwise treat the file as UTF-8.


> When was the last time you saw a UTF-32LE-BOM file?

Two minutes ago, when I looked at my test suite :-P


--
Steve

Random832

unread,
Aug 11, 2016, 10:53:40 AM8/11/16
to python...@python.org
On Thu, Aug 11, 2016, at 10:25, Steven D'Aprano wrote:
> > Interesting. Are you assuming that a text file cannot be empty?
>
> Hmmm... not consciously, but I guess I was.
>
> If the file is empty, how do you know it's text?

Heh. That's the *other* thing that Notepad does wrong in the opinion of
people coming from the Unix world - a Windows text file does not need to
end with a [CR]LF, and normally will not.

> But we're getting off topic here. In context of Steve's suggestion, we
> should only autodetect UTF-8. In other words, if there's a UTF-8 BOM,
> skip it, otherwise treat the file as UTF-8.

I think there's still room for UTF-16. It's two of the four encodings
supported by Notepad, after all.

Steve Dower

unread,
Aug 11, 2016, 11:33:21 AM8/11/16
to Random832, python...@python.org
Unless someone else does the implementation, I'd rather add a utf8-readsig encoding that initially only skips a utf8 BOM - notably, you always get the same encoding, it just sometimes skips the first three bytes.

I think we can change this later to detect and switch to utf16 without it being disastrous, though we've made it this far without it and frankly there are good reasons to "encourage" utf8 over utf16.

My big concern is the console... I think that change is inevitably going to have to break someone, but I need to map out the possibilities first to figure out just how bad it'll be.

Top-posted from my Windows Phone

From: Random832
Sent: ‎8/‎11/‎2016 7:54
To: python...@python.org
Subject: Re: [Python-ideas] Fix default encodings on Windows

Chris Angelico

unread,
Aug 11, 2016, 12:25:10 PM8/11/16
to python-ideas
On Fri, Aug 12, 2016 at 1:31 AM, Steve Dower <steve...@python.org> wrote:
> My big concern is the console... I think that change is inevitably going to
> have to break someone, but I need to map out the possibilities first to
> figure out just how bad it'll be.

Obligatory XKCD: https://xkcd.com/1172/

Subprocess invocation has been mentioned. What about logging? Will
there be issues with something that attempts to log to both console
and file?

ChrisA

Adam Bartoš

unread,
Aug 11, 2016, 2:34:55 PM8/11/16
to python...@python.org
On 11 August 2016 at 04:10, Steve Dower <steve.dower at python.org> wrote:
>
> I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.

My main reaction would be that if Drekin (Adam Bartoš) agrees the
changes natively solve the problems that
https://pypi.python.org/pypi/win_unicode_console
 works around, it's
probably a good idea.

The status quo is also sufficiently broken from both a native Windows
perspective and a cross-platform compatibility perspective that your
proposals are highly unlikely to make things *worse* :)

Cheers,
Nick.

The main idea of win_unicode_console is simple: to use WinAPI functions ReadConsoleW and WriteConsoleW to communicate with the interactive console on Windows and to wrap this in standard Python IO hierarchy – that's why sys.std*.encoding would be 'utf-16-le': it corresponds to widechar strings used by Windows wide APIs. But this is only about sys.std*.encoding, which I think is not so imporant. AFAIK sys.std*.encoding should be used only when you want to communicate in bytes (which I think is not a good idea), so it tells you, which encoding is sys.std*.buffer assuming. In fact sys.std* may even not have the buffer attribute, so its encoding attribute would be useless in that case.

Unfortunatelly, sys.std*.encoding is used in some other places – namely by the consumers of the old PyOS_Readline API (the tokenizer and input) use it to decode the bytes returned. Actually, the consumers assume differente encodings (sys.stdin.encoding vs. sys.stdout.encoding), so it is impossible to write a correct readline hook when the encodings are not the same. So I think it would be nice to have Python and string-based implementation of readline hooks – sys.readlinehook attribute, which would use sys.std* by default on Windows and GNU readline on Unix.

Nevertheless, I think it is a good idea to have more 'utf-8' defaults (or 'utf-8-readsig' for open()). I don't know whether it helps with the console issue to open the standard streams in 'utf-8'.

Adam Bartoš

Adam Bartoš

unread,
Aug 11, 2016, 2:42:00 PM8/11/16
to python...@python.org
Eryk Sun wrote:
IMO, Python needs a C implementation of the win_unicode_console
module, using the wide-character APIs ReadConsoleW and WriteConsoleW.
Note that this sets sys.std*.encoding as UTF-8 and transcodes, so
Python code never has to work directly with UTF-16 encoded text.

The transcoding wrappers with 'utf-8' encoding are used just as a work around the fact that Python tokenizer cannot use utf-16-le and that the readlinehook machinery is unfortunately bytes-based. The tanscoding wrapper just has encoding 'utf-8' and no buffer attribute, so there is no actual transcoding in sys.std* objects. It's just a signal for PyOS_Readline consumers, and the transcoding occurs in a custom readline hook. Nothing like this would be needed if PyOS_Readline was replaced by some Python API wrapper around sys.readlinehook that would be Unicode string based.

Adam Bartoš
 

eryk sun

unread,
Aug 12, 2016, 8:32:54 AM8/12/16
to python...@python.org
Thu, Aug 11, 2016 at 6:41 PM, Adam Bartoš <dre...@gmail.com> wrote:
> The transcoding wrappers with 'utf-8' encoding are used just as a work
> around the fact that Python tokenizer cannot use utf-16-le and that the
> readlinehook machinery is unfortunately bytes-based. The tanscoding wrapper
> just has encoding 'utf-8' and no buffer attribute, so there is no actual
> transcoding in sys.std* objects. It's just a signal for PyOS_Readline
> consumers, and the transcoding occurs in a custom readline hook. Nothing
> like this would be needed if PyOS_Readline was replaced by some Python API
> wrapper around sys.readlinehook that would be Unicode string based.

If win_unicode_console gets added to the standard library, I think it
should provide at least a std*.buffer interface that transcodes
between UTF-16 and UTF-8 (with errors='replace'), to make this as much
of a drop-in replacement as possible. I know it's not required. For
example, IDLE doesn't implement this. But I'm also sure there's code
out there that uses stdout.buffer, including in the standard library.
It's mostly test code (not including cases for piping output from a
child process) and simple script interfaces, but if we don't have to
break people's code, we really shouldn't.

eryk sun

unread,
Aug 12, 2016, 8:40:18 AM8/12/16
to python-ideas
On Thu, Aug 11, 2016 at 9:07 AM, Paul Moore <p.f....@gmail.com> wrote:
> set codepage to UTF-8
> ...
> set codepage back
> spawn subprocess X, but don't wait for it
> set codepage to UTF-8
> ...
> ... At this point what codepage does Python see? What codepage does
> process X see? (Note that they are both sharing the same console).

The input and output codepages are global data in conhost.exe. They
aren't tracked for each attached process (unlike input history and
aliases). That's how chcp.com works in the first place. Otherwise its
calls to SetConsoleCP and SetConsoleOutputCP would be pointless.

But IMHO all talk of using codepage 65001 is a waste of time. I think
the trailing garbage output with this codepage in Windows 7 is
unacceptable. And getting EOF for non-ASCII input is a show stopper.
The problem occurs in conhost. All you get is the EOF result from
ReadFile/ReadConsoleA, so it can't be worked around. This kills the
REPL and raises EOFError for input(). ISTM the only people who think
codepage 65001 actually works are those using Windows 8+ who
occasionally need to print non-OEM text and never enter (or paste)
anything but ASCII text.

Steve Dower

unread,
Aug 12, 2016, 9:35:36 AM8/12/16
to eryk sun, python-ideas
I was thinking we would end up using the console API for input but stick with the standard handles for output, mostly to minimize the amount of magic switching we have to do. But since we can just switch the entire stream object in __std*__ once at startup if nothing is redirected it probably isn't that much of a simplification.

I have some airport/aeroplane time today where I can experiment.


Top-posted from my Windows Phone

From: eryk sun
Sent: ‎8/‎12/‎2016 5:40
To: python-ideas

Subject: Re: [Python-ideas] Fix default encodings on Windows

Paul Moore

unread,
Aug 12, 2016, 9:42:41 AM8/12/16
to eryk sun, python-ideas
On 12 August 2016 at 13:38, eryk sun <ery...@gmail.com> wrote:
>> ... At this point what codepage does Python see? What codepage does
>> process X see? (Note that they are both sharing the same console).
>
> The input and output codepages are global data in conhost.exe. They
> aren't tracked for each attached process (unlike input history and
> aliases). That's how chcp.com works in the first place. Otherwise its
> calls to SetConsoleCP and SetConsoleOutputCP would be pointless.

That's what I expected, but hadn't had time to confirm (your point
about chcp didn't occur to me). Thanks.

> But IMHO all talk of using codepage 65001 is a waste of time. I think
> the trailing garbage output with this codepage in Windows 7 is
> unacceptable. And getting EOF for non-ASCII input is a show stopper.
> The problem occurs in conhost. All you get is the EOF result from
> ReadFile/ReadConsoleA, so it can't be worked around. This kills the
> REPL and raises EOFError for input(). ISTM the only people who think
> codepage 65001 actually works are those using Windows 8+ who
> occasionally need to print non-OEM text and never enter (or paste)
> anything but ASCII text.

Agreed, mucking with global state that subprocesses need was
sufficient for me, but the other issues you mention seem conclusive. I
understand Steve's point about being an improvement over 100% wrong,
but we've lived with the current state of affairs long enough that I
think we should take whatever time is needed to do it right, rather
than briefly postponing the inevitable with a partial solution.

Paul

PS I've spent the last week on a different project trying to "save
time" with partial solutions to precisely this issue, so apologies if
I'm in a particularly unforgiving mood about it right now :-(

Random832

unread,
Aug 12, 2016, 10:21:36 AM8/12/16
to python...@python.org
On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
> * force the console encoding to UTF-8 on initialize and revert on
> finalize
>
> So what are your concerns? Suggestions?

As far as I know, the single biggest problem caused by the status quo
for console encoding is "some string containing characters not in the
console codepage is printed out; unhandled UnicodeEncodeError". Is there
any particular reason not to use errors='replace'?

Is there any particular reason for the REPL, when printing the repr of a
returned object, not to replace characters not in the stdout encoding
with backslash sequences?

Does Python provide any mechanism to access the built-in "best fit"
mappings for windows codepages (which mostly consist of removing accents
from latin letters)?

Random832

unread,
Aug 12, 2016, 11:34:22 AM8/12/16
to python...@python.org
On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
> That's the hope, though that module approaches the solution differently
> and may still uses. An alternative way for us to fix this whole thing
> would be to bring win_unicode_console into the standard library and use
> it by default (or probably whenever PYTHONIOENCODING is not specified).

I have concerns about win_unicode_console:
- For the "text_transcoded" streams, stdout.encoding is utf-8. For the
"text" streams, it is utf-16.
- There is no object, as far as I can find, which can be used as an
unbuffered unicode I/O object.
- raw output streams silently drop the last byte if an odd number of
bytes are written.
- The sys.stdout obtained via streams.enable does not support .buffer /
.buffer.raw / .detach
- All of these objects provide a fileno() interface.
- When using os.read/write for data that represents text, the data still
should be encoded in the console encoding and not in utf-8 or utf-16.

How is it important to preserve the validity of the conventional advice
for "putting stdin/stdout in binary mode" using .buffer or .detach? I
suspect this is mainly used for programs intended to have their output
redirected, but today it 'kind of works' to run such a program on the
console and inspect its output. How important is it for
os.read/write(stdxxx.fileno()) to be consistent with stdxxx.encoding?

Should errors='surrogatepass' be used? It's unlikely, but not
impossible, to paste an invalid surrogate into the console. With
win_unicode_console, this results in a UnicodeDecodeError and, if this
happened during a readline, disables the readline hook.

Is it possible to break this by typing a valid surrogate pair that falls
across a buffer boundary?

Adam Bartoš

unread,
Aug 12, 2016, 12:27:38 PM8/12/16
to python...@python.org
On Fri Aug 12 11:33:35 EDT 2016, Random832 wrote:

> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote: >> That's the hope, though that module approaches the solution differently >> and may still uses. An alternative way for us to fix this whole thing >> would be to bring win_unicode_console into the standard library and use >> it by default (or probably whenever PYTHONIOENCODING is not specified). > > I have concerns about win_unicode_console: > - For the "text_transcoded" streams, stdout.encoding is utf-8. For the > "text" streams, it is utf-16.
UTF-16 it the "native" encoding since it corresponds to the wide chars used by Read/WriteConsoleW. The UTF-8 is used just as a signal for the consumers of PyOS_Readline.

> - There is no object, as far as I can find, which can be used as an
> unbuffered unicode I/O object.

There is no buffer just on those wrapping streams because the bytes I have are not in UTF-8. Adding one would mean a fake buffer that just decodes and writes to the text stream. AFAIK there is no guarantee that sys.std* objects have buffer attribute and any code relying on that is incorrect. But I inderstand that there may be such code and we may want to be compatible.


> - raw output streams silently drop the last byte if an odd number of > bytes are written.

That's not true, it doesn't write an odd number of bytes, but returns the correct number of bytes written. If only one byte is given, it raises a ValueError.


> - The sys.stdout obtained via streams.enable does not support .buffer / > .buffer.raw / .detach > - All of these objects provide a fileno() interface.

Is this wrong? If I remember, I provide it because of some check -- maybe in input() -- to be viewed as a stdio stream.


> - When using os.read/write for data that represents text, the data still > should be encoded in the console encoding and not in utf-8 or utf-16.

I don't know what to do with this. Generally I wouldn't use bytes to communicate textual data.


Regards,
Adam Bartoš

Chris Barker

unread,
Aug 12, 2016, 1:06:56 PM8/12/16
to Paul Moore, python-ideas
On Fri, Aug 12, 2016 at 6:41 AM, Paul Moore <p.f....@gmail.com> wrote:
 I
understand Steve's point about being an improvement over 100% wrong,
but we've lived with the current state of affairs long enough that I
think we should take whatever time is needed to do it right,

Sure -- but his is such a freakin' mess that there may well not BE a "right" solution.

In which case, something IS better than nothing.

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Paul Moore

unread,
Aug 12, 2016, 1:19:52 PM8/12/16
to Chris Barker, python-ideas
On 12 August 2016 at 18:05, Chris Barker <chris....@noaa.gov> wrote:
> On Fri, Aug 12, 2016 at 6:41 AM, Paul Moore <p.f....@gmail.com> wrote:
>>
>> I
>> understand Steve's point about being an improvement over 100% wrong,
>> but we've lived with the current state of affairs long enough that I
>> think we should take whatever time is needed to do it right,
>
>
> Sure -- but his is such a freakin' mess that there may well not BE a "right"
> solution.
>
> In which case, something IS better than nothing.

Using Unicode APIs for console IO *is* better. Powershell does it, and
it works there. All I'm saying is that we should focus on that as our
"improved solution", rather than looking at CP_UTF8 as a "quick and
dirty" solution, as there's no evidence that people need "quick and
dirty" (they have win_unicode_console if the current state of affairs
isn't sufficient for them).

I'm not arguing that we do nothing. Are you saying we should use
CP_UTF8 *in preference* to wide character APIs? Or that we should
implement CP_UTF8 first and then wide chars later? Or are we in
violent agreement that we should implement wide chars?

Paul

Random832

unread,
Aug 12, 2016, 2:35:59 PM8/12/16
to python...@python.org
On Fri, Aug 12, 2016, at 12:24, Adam Bartoš wrote:
> There is no buffer just on those wrapping streams because the bytes I
> have are not in UTF-8. Adding one would mean a fake buffer that just
> decodes and writes to the text stream. AFAIK there is no guarantee
> that sys.std* objects have buffer attribute and any code relying on
> that is incorrect. But I inderstand that there may be such code and we
> may want to be compatible.

Yes that's what I meant, I just think it needs to be considered if we're
thinking about making it (or something like it) the default python
sys.std*. Maybe the decision will be that maintaining compatibility with
these cases isn't important.

> > - The sys.stdout obtained via streams.enable does not support
> > .buffer / .buffer.raw / .detach
> > - All of these objects provide a fileno() interface.
>
> Is this wrong? If I remember, I provide it because of some check --
> maybe in input() -- to be viewed as a stdio stream.

I don't know if it's *wrong* per se (same with the no buffer/raw thing
etc), I'm just concerned about the possible effects on code that is
written against the current implementation.

tritiu...@sdamon.com

unread,
Aug 12, 2016, 4:11:39 PM8/12/16
to Paul Moore, eryk sun, python-ideas


> -----Original Message-----
> From: Python-ideas [mailto:python-ideas-bounces+tritium-
> list=sdamo...@python.org] On Behalf Of Paul Moore
> Sent: Friday, August 12, 2016 9:42 AM
> To: eryk sun <ery...@gmail.com>
> Cc: python-ideas <python...@python.org>
> Subject: Re: [Python-ideas] Fix default encodings on Windows
>
> On 12 August 2016 at 13:38, eryk sun <ery...@gmail.com> wrote:
> >> ... At this point what codepage does Python see? What codepage does
> >> process X see? (Note that they are both sharing the same console).
> >
> > The input and output codepages are global data in conhost.exe. They
> > aren't tracked for each attached process (unlike input history and
> > aliases). That's how chcp.com works in the first place. Otherwise its
> > calls to SetConsoleCP and SetConsoleOutputCP would be pointless.
>
> That's what I expected, but hadn't had time to confirm (your point
> about chcp didn't occur to me). Thanks.
>
> > But IMHO all talk of using codepage 65001 is a waste of time. I think
> > the trailing garbage output with this codepage in Windows 7 is
> > unacceptable. And getting EOF for non-ASCII input is a show stopper.
> > The problem occurs in conhost. All you get is the EOF result from
> > ReadFile/ReadConsoleA, so it can't be worked around. This kills the
> > REPL and raises EOFError for input(). ISTM the only people who think
> > codepage 65001 actually works are those using Windows 8+ who
> > occasionally need to print non-OEM text and never enter (or paste)
> > anything but ASCII text.
>
> Agreed, mucking with global state that subprocesses need was
> sufficient for me, but the other issues you mention seem conclusive. I
> understand Steve's point about being an improvement over 100% wrong,
> but we've lived with the current state of affairs long enough that I
> think we should take whatever time is needed to do it right, rather
> than briefly postponing the inevitable with a partial solution.

For the love of all that is holy and good, ignore that sentiment. We need
ANY AND ALL improvements to this miserable console experience.

Chris Barker

unread,
Aug 12, 2016, 5:15:45 PM8/12/16
to python-ideas
On Fri, Aug 12, 2016 at 10:19 AM, Paul Moore <p.f....@gmail.com> wrote:
> In which case, something IS better than nothing.

 
I'm not arguing that we do nothing. Are you saying we should use

CP_UTF8 *in preference* to wide character APIs? Or that we should
implement CP_UTF8 first and then wide chars later?

Honestly, I don't understand the details enough to argue either way.
 
Or are we in
violent agreement that we should implement wide chars?

probably -- to the extend I understand the issues :-)

But I am arguing that anything that makes it "better" that actually gets implemented is better than a "right" solution that no one has the time to  make it happen, or that we can't agree on anyway.

-CHB


Victor Stinner

unread,
Aug 12, 2016, 7:04:44 PM8/12/16
to Steve Dower, python-ideas

Hello,

I'm in holiday and I'm writing on a phone, so sorry in advance for the short answer.

In short: we should drop support for the bytes API. Just use Unicode on all platforms, especially for filenames.

Sorry but most of these changes look like very bad ideas. Or maybe I misunderstood something. Windows bytes API are broken in different ways, in short your proposal is to put another layer on top of it to try to workaround issues.

Unicode is complex. Unicode issues are hard to debug. Adding a new layer makes debugging even harder. Is the bug in the input data? In the layer? In the final Windows function?

In my experience on UNIX, the most important part is the interoperability with other applications. I understand that Python 2 will speak ANSI code page but Python 3 will speak UTF-8. I don't understand how it can work. Almsot all Windows applications speak the ANSI code page (I'm talking about stdin, stdout, pipes, ...).

Do you propose to first try to decode from UTF-8 or fallback on decoding from the ANSI code page? What about encoding? Always encode to UTF-8?

About BOM: I hate them. Many applications don't understand them. Again, think about Python 2. I recall vaguely that the Unicode strandard suggests to not use BOM (I have to check).

I recall a bug in gettext. The tool doesn't understand BOM. When I opened the file in vim, the BOM was invisible (hidden). I had to use hexdump to understand the issue!

BOM introduces issues very difficult to debug :-/ I also think that it goes in the wrong direction in term of interoperability.

For the Windows console: I played with all Windows functions, tried all fonts and many code pages. I also read technical blog articles of Microsoft employees. I gave up on this issue. It doesn't seem possible to support fully Unicode the Windows console (at least the last time I checked). By the way, it seems like Windows functions have bugs, and the code page 65001 fixes a few issues but introduces new issues...

Victor


Le 10 août 2016 20:16, "Steve Dower" <steve...@python.org> a écrit :
I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.

Some background: within the Windows API, the preferred encoding is UTF-16. This is a 16-bit format that is typed as wchar_t in the APIs that use it. These APIs are generally referred to as the *W APIs (because they have a W suffix).

There are also (broadly deprecated) APIs that use an 8-bit format (char), where the encoding is assumed to be "the user's active code page". These are *A APIs. AFAIK, there are no cases where a *A API should be preferred over a *W API, and many newer APIs are *W only.

In general, Python passes byte strings into the *A APIs and text strings into the *W APIs.

Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which translates to "the system's active code page". As this encoding generally cannot represent all paths on Windows, it is deprecated and Unicode strings are recommended instead. This, however, means you need to write significantly different code between POSIX (use bytes) and Windows (use text).

ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c; likely similar code in other places) to decode incoming byte strings would allow us to undeprecate byte strings and add the requirement that they *must* be encoded with sys.getfilesystemencoding(). I assume that this would allow cross-platform code to handle paths similarly by encoding to whatever the sys module says they should and using bytes consistently (starting this thread is meant to validate/refute my assumption).

(Yes, I know that people on POSIX should just change to using Unicode and surrogateescape. Unfortunately, rather than doing that they complain about Windows and drop support for the platform. If you want to keep hitting them with the stick, go ahead, but I'm inclined to think the carrot is more valuable here.)

Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason. The one exception is as a default encoding for opening files when no other information is available (e.g. a Unicode BOM or explicit encoding argument). BOMs are very common on Windows, since the default assumption is nearly always a bad idea.

Making open()'s default encoding detect a BOM before falling back to locale.getpreferredencoding() would resolve many issues, but I'm also inclined towards making the fallback utf-8, leaving locale.getpreferredencoding() solely as a way to get the active system codepage (with suitable warnings about it only being useful for back-compat). This would match the behavior that the .NET Framework has used for many years - effectively, utf_8_sig on read and utf_8 on write.

Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks).

It is also possible for Python to change the current console encoding to be UTF-8 on initialize and change it back on finalize. (This would leave the console in an unexpected state if Python segfaults, but console encoding is probably the least of anyone's worries at that point.) So I'm proposing actively changing the current console to be Unicode while Python is running, and hence sys.std[in|out|err] will default to utf-8.

So that's a broad range of changes, and I have little hope of figuring out all the possible issues, back-compat risks, and flow-on effects on my own. Please let me know (either on-list or off-list) how a change like this would affect your projects, either positively or negatively, and whether you have any specific experience with these changes/fixes and think they should be approached differently.


To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):

* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths

* make the default open() encoding check for a BOM or else use utf-8
* [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding()

* force the console encoding to UTF-8 on initialize and revert on finalize

So what are your concerns? Suggestions?

Thanks,
Steve

Victor Stinner

unread,
Aug 12, 2016, 7:14:14 PM8/12/16
to Steve Dower, python-ideas

Le 10 août 2016 20:16, "Steve Dower" <steve...@python.org> a écrit :
> So what are your concerns? Suggestions?

Add a new option specific to Windows to switch to UTF-8 everywhere, use BOM, whatever you want, *but* don't change the defaults.

IMO mbcs encoding is the least worst encoding for the default.

I have an idea of a similar option for UNIX: ignore user preference (LC_ALL, LC_CTYPE, LANG environment variables) and force UTF-8. It's a common request on UNIX where UTF-8 is now the encoding of almost all systems, whereas the C library continues to use ASCII when the POSIX locale is used (which occurs in many cases).

Perl already has such utf8 option.

Victor

eryk sun

unread,
Aug 12, 2016, 10:45:36 PM8/12/16
to python...@python.org
On Fri, Aug 12, 2016 at 2:20 PM, Random832 <rand...@fastmail.com> wrote:
> On Wed, Aug 10, 2016, at 14:10, Steve Dower wrote:
>> * force the console encoding to UTF-8 on initialize and revert on
>> finalize
>>
>> So what are your concerns? Suggestions?
>
> As far as I know, the single biggest problem caused by the status quo
> for console encoding is "some string containing characters not in the
> console codepage is printed out; unhandled UnicodeEncodeError". Is there
> any particular reason not to use errors='replace'?

If that's all you want then you can set PYTHONIOENCODING=:replace.
Prepare to be inundated with question marks.

Python's 'cp*' encodings are cross-platform, so they don't call
Windows NLS APIs. If you want a best-fit encoding, then 'mbcs' is the
only choice. Use chcp.com to switch to your system's ANSI codepage and
set PYTHONIOENCODING=mbcs:replace.

An 'oem' encoding could be added, but I'm no fan of these best-fit
encodings. Writing question marks at least hints that the output is
wrong.

> Is there any particular reason for the REPL, when printing the repr of a
> returned object, not to replace characters not in the stdout encoding
> with backslash sequences?

sys.displayhook already does this. It falls back on
sys_displayhook_unencodable if printing the repr raises a
UnicodeEncodeError.

> Does Python provide any mechanism to access the built-in "best fit"
> mappings for windows codepages (which mostly consist of removing accents
> from latin letters)?

As mentioned above, for output this is only available with 'mbcs'. For
reading input via ReadFile or ReadConsoleA (and thus also C _read,
fread, and fgets), the console already encodes its UTF-16 input buffer
using a best-fit encoding to the input codepage. So there's no error
in the following example, even though the result is wrong:

>>> sys.stdin.encoding
'cp437'
>>> s = 'Ä€'
>>> s, ord(s)
('A', 65)

Jumping back to the codepage 65001 discussion, here's a function to
simulate the bad output that Windows Vista and 7 users see:

def write(text):
writes = []
n = 0
buffer = text.replace('\n', '\r\n').encode('utf-8')
while buffer:
decoded = buffer.decode('utf-8', 'replace')
buffer = buffer[len(decoded):]
writes.append(decoded.replace('\r', '\n'))
return ''.join(writes)

For example:

>>> greek = 'αβγδεζηθι\n'
>>> write(greek)
'αβγδεζηθι\n\n�ηθι\n\n�\n\n'

It gets worse with characters that require 3 bytes in UTF-8:

>>> devanagari = 'ऄअआइईउऊऋऌ\n'
>>> write(devanagari)
'ऄअआइईउऊऋऌ\n\n�ईउऊऋऌ\n\n��ऋऌ\n\n��\n\n'

This problem doesn't exit in Windows 8+ because the old LPC-based
communication (LPC is an undocumented protocol that's used extensively
for IPC between Windows subsystems) with the console was rewritten to
use a kernel driver (condrv.sys). Now it works like any other device
by calling NtReadFile, NtWriteFile, and NtDeviceIoControlFile.
Apparently in the rewrite someone fixed the fact that the conhost code
that handles WriteFile and WriteConsoleA was incorrectly returning the
number of UTF-16 codes written instead of the number of bytes.

Unfortunately the rewrite also broke Ctrl+C handling because ReadFile
no longer sets the last error to ERROR_OPERATION_ABORTED when a
console read is interrupted by Ctrl+C. I'm surprised so few Windows
users have noticed or cared that Ctrl+C kills the REPL and misbehaves
with input() in the Windows 8/10 console. The source of the Ctrl+C bug
is an incorrect NTSTATUS code STATUS_ALERTED, which should be
STATUS_CANCELLED. The console has always done this wrong, but before
the rewrite there was common code for ReadFile and ReadConsole that
handled STATUS_ALERTED specially. It's still there in ReadConsole, so
Ctrl+C handling works fine in Unicode programs that use ReadConsoleW
(e.g. cmd.exe, powershell.exe). It also works fine if
win_unicode_console is enabled.

Finally, here's a ctypes example in Windows 10.0.10586 that shows the
unsolvable problem with non-ASCII input when using codepage 65001:

import ctypes, msvcrt
conin = open(r'\\.\CONIN$', 'r+')
hConin = msvcrt.get_osfhandle(conin.fileno())
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
nread = (ctypes.c_uint * 1)()

ASCII-only input works:

>>> buf = (ctypes.c_char * 100)()
>>> kernel32.ReadFile(hConin, buf, 100, nread, None)
spam
1
>>> nread[0], buf.value
(6, b'spam\r\n')

But it returns EOF if "a" is replaced by Greek "α":

>>> buf = (ctypes.c_char * 100)()
>>> kernel32.ReadFile(hConin, buf, 100, nread, None)
spαm
1
>>> nread[0], buf.value
(0, b'')

Notice that the read is successful but nread is 0. That signifies EOF.
So the REPL will just silently quit as if you entered Ctrl+Z, and
input() will raise EOFError. This can't be worked around. The problem
is in conhost.exe, which assumes a request for N bytes wants N UTF-16
codes from the input buffer. This can only work with ASCII in UTF-8.

Stephen J. Turnbull

unread,
Aug 13, 2016, 4:13:27 AM8/13/16
to Steve Dower, python...@python.org
Steve Dower writes:

> ISTM that changing sys.getfilesystemencoding() on Windows to
> "utf-8" and updating path_converter() (Python/posixmodule.c;

I think this proposal requires the assumption that strings intended to
be interpreted as file names invariably come from the Windows APIs. I
don't think that is true: Makefiles and similar, configuration files,
all typically contain filenames. Zipfiles (see below). Python is
frequently used as a glue language, so presumably receives such file
name information as (more or less opaque) bytes objects over IPC
channels. These just aren't under OS control, so the assumption will
fail.

Supporting Windows users in Japan means dealing with lots of crap
produced by standard-oblivious software. Eg, Shift JIS filenames in
zipfiles. AFAICT Windows itself never does that, but the majority of
zipfiles I get from colleagues have Shift JIS in the directory (and
it's the great majority if you assume that people who use ASCII
transliterations are doing so because they know that non-Windows-users
can't handle Shift JIS file names in zipfiles).

So I believe bytes-oriented software must expect non-UTF-8 file names
in Japan. UTF-8 may have penetration in the rest of the world, but
the great majority of my Windows-using colleagues in Japan still
habitually and by preference use Shift JIS in text files. I suppose
that includes files that are used by programs, and thus file names,
and probably extends to most Windows users here.

I suspect a similar situation holds in China, where AIUI "GB is not
just a good idea, it's the law,"[1] and possibly Taiwan (Big 5) and Korea
(KSC) as those standards have always provided the benefits of (nearly)
universal repertoires[2].

> and add the requirement that [bytes file names] *must* be encoded
> with sys.getfilesystemencoding().

To the extent that this *can* work, it *already* works. Trying to
enforce a particular encoding will simply break working code that
depends on sys.getfilesystemencoding() matching the encoding that
other programs use.

You have no carrot. These changes enforce an encoding on bytes for
Windows APIs but can't do so for data, and so will make file-names-
are-just-bytes programmers less happy with Python, not more happy.

The exception is the proposed console changes, because there you *do*
perform all I/O with OS APIs. But I don't know anything about the
Windows console except that nobody seems happy with it.

> Similarly, locale.getpreferredencoding() on Windows returns a
> legacy value - the user's active code page - which should generally
> not be used for any reason.

This is even less supportable, because it breaks much code that used
to work without specifying an encoding.

Refusing to respect the locale preferred encoding would force most
Japanese scripters to specify encodings where they currently accept
the system default, I suspect. On those occasions my Windows-using
colleagues deliver text files, they are *always* encoded in Shift JIS.
University databases the deliver CSV files allow selecting Shift JIS
or UTF-8, and most people choose Shift JIS. And so on. In Japan,
Shift JIS remains pervasive on Windows.

I don't think Japan is special in this, except in the pervasiveness of
Shift JIS. For everybody I think there will be more loss than benefit
imposed.

> BOMs are very common on Windows, since the default assumption is
> nearly always a bad idea.

I agree (since 1990!) that Shift JIS by default is a bad idea, but
there's no question that it is still overwhelmingly popular. I
suspect UTF-8 signatures are uncommon, too, as most UTF-8 originates
on Mac or *nix platforms.

> This would match the behavior that the .NET Framework has used for
> many years - effectively, utf_8_sig on read and utf_8 on write.

But .NET is a framework. It expects to be the world in which programs
exist, no? Python is very frequently used as a glue language, and I
suspect the analogy fails due to that distinction.


Footnotes:
[1] Strictly speaking, certain programs must support GB 18030. I
don't think it's legally required to be the default encoding.

[2] For example, the most restricted Japanese standard, JIS X 0208,
includes not only "full-width" versions of ASCII characters, but the
full Greek and Cyrillic alphabets, many math symbols, a full line
drawing set, and much more besides the native syllabary and Han
ideographs. The elderly Chinese GB 2312 not only includes Greek and
Cyrillic, and the various symbols, but also the Japanese syllabaries.
(And the more recent GB 18030 swallowed Unicode whole.)

Adam Bartoš

unread,
Aug 13, 2016, 6:29:34 AM8/13/16
to python...@python.org
On Fri Aug 12 19:03:38 EDT 2016 Victor Stinner wrote:
For the Windows console: I played with all Windows functions, tried all
fonts and many code pages. I also read technical blog articles of Microsoft
employees. I gave up on this issue. It doesn't seem possible to support
fully Unicode the Windows console (at least the last time I checked). By
the way, it seems like Windows functions have bugs, and the code page 65001
fixes a few issues but introduces new issues...
Do you mean that it doesn't seem possible to support Unicode on the Windows console by means of ANSI codepages? Because using the wide APIs seems to work (as win_unicode_console shows). There are some issues like non-BMP characters, which are encoded as surrogate pairs and the console doesn't understand them for display (shows two boxes), but this is just matter of display and not of corruption of the actual data (e.g. you can copy the text from console). Also, there seems to be no font to support all Unicode and AFAIK you cannot configure the console to use multiple fonts, but again this is a display issue of the console window itself rather than of the essential communication between Python and console.

Adam Bartoš

Adam Bartoš

unread,
Aug 13, 2016, 6:47:43 AM8/13/16
to python...@python.org
Stephen J. Turnbull writes: 
The exception is the proposed console changes, because there you *do*
perform all I/O with OS APIs.  But I don't know anything about the
Windows console except that nobody seems happy with it.
I'm quite happy with it. I mean, it's far from perfect, and when you look at discussions on Stack Overflow regarding Unicode on Windows console, almost everyone blames Windows console, but I think that software often doesn't communicate with it correctly. When Windows Unicode wide APIs are used, it just works.

Adam Bartoš

Random832

unread,
Aug 13, 2016, 8:24:27 AM8/13/16
to python...@python.org
On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
> Steve Dower writes:
> > ISTM that changing sys.getfilesystemencoding() on Windows to
> > "utf-8" and updating path_converter() (Python/posixmodule.c;
>
> I think this proposal requires the assumption that strings intended to
> be interpreted as file names invariably come from the Windows APIs. I
> don't think that is true: Makefiles and similar, configuration files,
> all typically contain filenames. Zipfiles (see below).

And what's going to happen if you shovel those bytes into the
filesystem without conversion on Linux, or worse, OSX? This problem
isn't unique to Windows.

> Python is frequently used as a glue language, so presumably receives
> such file name information as (more or less opaque) bytes objects over
> IPC channels.

They *can't* be opaque. Someone has to decide what they mean, and you as
the application developer might well have to step up and *be that
someone*. If you don't, someone else will decide for you.

> These just aren't under OS control, so the assumption will
> fail.
>
> So I believe bytes-oriented software must expect non-UTF-8 file names
> in Japan.

The only way to deal with data representing filenames and destined for
the filesystem on windows is to convert it, somehow, ultimately to
UTF-16-LE. Not doing so is impossible, it's only a question of what
layer it happens in. If you convert it using the wrong encoding, you
lose. The only way to deal with it on Mac OS X is to convert it to
UTF-8. If you don't, you lose. If you convert it using the wrong
encoding, you lose.

This proposal embodies an assumption that bytes from unknown sources
used as filenames are more likely to be UTF-8 than in the locale ACP
(i.e. "mbcs" in pythonspeak, and Shift-JIS in Japan). Personally, I
think the whole edifice is rotten, and choosing one encoding over
another isn't a solution; the only solution is to require the
application to make a considered decision about what the bytes mean and
pass its best effort at converting to a Unicode string to the API. This
is true on Windows, it's true on OSX, and I would argue it's pretty
close to being true on Linux except in a few very niche cases. So I
think for the filesystem encoding we should stay the course, continuing
to print a DeprecationWarning and maybe, just maybe, eventually actually
deprecating it.

On Windows and OSX, this "glue language" business of shoveling bytes
from one place to another without caring what they mean can only last as
long as they don't touch the filesystem.

> You have no carrot. These changes enforce an encoding on bytes for
> Windows APIs but can't do so for data, and so will make file-names-
> are-just-bytes programmers less happy with Python, not more happy.

I think the use case that the proposal has in mind is a
file-names-are-just-
bytes program (or set of programs) that reads from the filesystem,
converts to bytes for a file/network, and then eventually does the
reverse - either end may be on windows. Using UTF-8 will allow those to
make the round trip (strictly speaking, you may need surrogatepass, and
OSX does its weird normalization thing), using any other encoding
(except for perhaps GB18030) will not.

Steve Dower

unread,
Aug 13, 2016, 1:25:36 PM8/13/16
to python...@python.org
Just a heads-up that I've assigned http://bugs.python.org/issue1602 to
myself and started a patch for the console changes. Let's move the
console discussion back over there.

Hopefully it will show up in 3.6.0b1, but if you're prepared to apply a
patch and test on Windows, feel free to grab my work so far. There's a
lot of "making sure other things aren't broken" left to do.

Cheers,
Steve

Steve Dower

unread,
Aug 13, 2016, 1:46:28 PM8/13/16
to Random832, python...@python.org
On 13Aug2016 0523, Random832 wrote:
> On Sat, Aug 13, 2016, at 04:12, Stephen J. Turnbull wrote:
>> Steve Dower writes:
>> > ISTM that changing sys.getfilesystemencoding() on Windows to
>> > "utf-8" and updating path_converter() (Python/posixmodule.c;
>>
>> I think this proposal requires the assumption that strings intended to
>> be interpreted as file names invariably come from the Windows APIs. I
>> don't think that is true: Makefiles and similar, configuration files,
>> all typically contain filenames. Zipfiles (see below).
>
> And what's going to happen if you shovel those bytes into the
> filesystem without conversion on Linux, or worse, OSX? This problem
> isn't unique to Windows.

Yeah, this is basically my view too. If your path bytes don't come from
the filesystem, you need to know the encoding regardless. But it's very
reasonable to be able to round-trip. Currently, the following two lines
of code can have different behaviour on Windows (i.e. the latter fails
to open the file):

>>> open(os.listdir('.')[-1])
>>> open(os.listdir(b'.')[-1])

On Windows, the filesystem encoding is inherently Unicode, which means
you can't reliably round-trip filenames through the current code page.
Changing all of Python to use the Unicode APIs internally and making the
bytes encoding utf-8 (or utf-16-le, which would save a conversion)
resolves this and doesn't really affect

>> These just aren't under OS control, so the assumption will
>> fail.
>>
>> So I believe bytes-oriented software must expect non-UTF-8 file names
>> in Japan.

Even on Japanese Windows, non-UTF-8 file names must be encodable with
UTF-16 or they cannot exist on the file system. This moves the encoding
boundary into the application, which is where it needed to be anyway for
robust software - "Correct" path handling still requires decoding to
text, and if you know that your source is the encoded with the active
code page then byte_path.decode('mbcs', 'surrogateescape') is still valid.

Cheers,
Steve

Stephen J. Turnbull

unread,
Aug 13, 2016, 3:10:44 PM8/13/16
to Random832, python...@python.org
Random832 writes:

> And what's going to happen if you shovel those bytes into the
> filesystem without conversion on Linux, or worse, OSX?

Off topic. See Subject: field.

> This proposal embodies an assumption that bytes from unknown sources
> used as filenames are more likely to be UTF-8 than in the locale ACP

Then it's irrelevant: most bytes are not from "unknown sources",
they're from correspondents (or from yourself!) -- and for most users
most of the time, those correspondents share the locale encoding with
them. At least where I live, they use that encoding frequently.

> the only solution is to require the application to make a
> considered decision

That's not a solution. Code is not written with every decision
considered, and it never will be. The (long-run) solution is a la
Henry Ford: "you can encode text any way you want, as long as it's
UTF-8". Then it won't matter if people ever make considered decisions
about encoding! But trying to enforce that instead of letting it
evolve naturally (as it is doing) will cause unnecessary pain for
Python programmers, and I believe quite a lot of pain.

I used to be in the "make them speak UTF-8" camp. But in the 15 years
since PEP 263, experience has shown me that mostly it doesn't matter,
and that when it does matter, you have to deal with the large variety
of encodings anyway -- assuming UTF-8 is not a win. For use cases
that can be encoding-agnostic because all cooperating participants
share a locale encoding, making them explicitly specify the locale
encoding is just a matter of "misery loves company". Please, let's
not do things for that reason.

> I think the use case that the proposal has in mind is a
> file-names-are-just-bytes program (or set of programs) that reads
> from the filesystem, converts to bytes for a file/network, and then
> eventually does the reverse - either end may be on windows.

You have misspoken somewhere. The programs under discussion do not
"convert" input to bytes; they *receive* bytes, either from POSIX APIs
or from Windows *A APIs, and use them as is. Unless I am greatly
mistaken, Steve simply wants that to work as well on Windows as on
POSIX platforms, so that POSIX programmers who do encoding-agnostic
programming have one less barrier to supporting their software on
Windows. But you'll have to ask Steve to rule on that.

Steve

Steve Dower

unread,
Aug 13, 2016, 6:02:00 PM8/13/16
to Stephen J. Turnbull, Random832, python...@python.org
The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes.


Top-posted from my Windows Phone

From: Stephen J. Turnbull
Sent: ‎8/‎13/‎2016 12:11
To: Random832
Cc: python...@python.org

Subject: Re: [Python-ideas] Fix default encodings on Windows

Victor Stinner

unread,
Aug 14, 2016, 12:20:56 PM8/14/16
to Steve Dower, python-ideas

> The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes.

You get troubles when the filename comes a file, another application, a registry key, ... which is encoded to CP_ACP.

Do you plan to transcode all these data? (decode from CP_ACP, encode back to UTF-8)

Steve Dower

unread,
Aug 14, 2016, 1:50:45 PM8/14/16
to Victor Stinner, python-ideas
I plan to use only Unicode to interact with the OS and then utf8 within Python if the caller wants bytes.

Currently we effectively use Unicode to interact with the OS and then CP_ACP if the caller wants bytes.

All the *A APIs just decode strings and call the *W APIs, and encode the return values. I'm proposing that we move the decoding and encoding into Python and make it (nearly) lossless.

In practice, this means all *A APIs are banned within the CPython source, and if we get/need bytes we have to convert to text first using the FS encoding, which will be utf8.


Top-posted from my Windows Phone

From: Victor Stinner
Sent: ‎8/‎14/‎2016 9:20
To: Steve Dower
Cc: Stephen J. Turnbull; python-ideas; Random832

Subject: Re: [Python-ideas] Fix default encodings on Windows

Stephen J. Turnbull

unread,
Aug 15, 2016, 1:06:22 AM8/15/16
to Steve Dower, python-ideas
Steve Dower writes:

> I plan to use only Unicode to interact with the OS and then utf8
> within Python if the caller wants bytes.

This doesn't answer Victor's questions, or mine.

This proposal requires identifying and transcoding bytes that
represent text in encodings other than UTF-8.

1. How do you propose to identify "bytes that represent text (and
might be filenames)" if they did *not* originate in a filesystem or
console API?

2. How do you propose to identify the non-UTF-8 encoding, if you have
forced all variables signifying bytes encodings to UTF-8?

Additional considerations:

As far as I can see, this is just a recipe for a different way to get
mojibake. *The* way to avoid mojibake is to "let text be text"
*internally*. Developers who insist on processing text as bytes are
going to get what they deserve *in edge cases*. But mostly (ie, in
the mono-encoding environments of most users) it just (barely ;-) works.

And there are many use cases where you *can* process bytes that happen
to encode text as "just bytes" (eg, low-level networking code). These
cases have performance issues if the bytes-text-bytes-text-bytes
double-round-trip implied for *stream content* (vs the OS APIs you're
concerned with, which effectively round-trip text-bytes-text) is
imposed on them.

Steve Dower

unread,
Aug 15, 2016, 9:24:54 AM8/15/16
to Stephen J. Turnbull, python-ideas
I guess I'm not sure what your question is then.

Using text internally is of course the best way to deal with it. But for those who insist on using bytes, this change at least makes Windows a feasible target without requiring manual encoding/decoding at every boundary.


Top-posted from my Windows Phone
Sent: ‎8/‎14/‎2016 22:06
To: Steve Dower
Cc: Victor Stinner; python-ideas; Random832
Subject: RE: [Python-ideas] Fix default encodings on Windows

Random832

unread,
Aug 15, 2016, 9:41:54 AM8/15/16
to Steve Dower, Stephen J. Turnbull, python-ideas
On Mon, Aug 15, 2016, at 09:23, Steve Dower wrote:
> I guess I'm not sure what your question is then.
>
> Using text internally is of course the best way to deal with it. But for
> those who insist on using bytes, this change at least makes Windows a
> feasible target without requiring manual encoding/decoding at every
> boundary.

Why isn't it already? What's "not feasible" about requiring manual
encoding/decoding?

Basically your assumption is that people using Python on windows and
having to deal with files that contain filename data encoded as bytes
are more likely to be dealing with data that is either UTF-8 anyway
(coming from Linux or some other platform) or came from the current
version of Python (which will encode things in UTF-8 under the change)
than they are to deal with data that came from other Windows programs
that encoded things in the codepage used by them and by other Windows
users in the same country / who speak the same language.

Steve Dower

unread,
Aug 15, 2016, 12:36:22 PM8/15/16
to Random832, Stephen J. Turnbull, python-ideas
I'm still not sure we're talking about the same thing right now.

For `open(path_as_bytes).read()`, are we talking about the way path_as_bytes is passed to the file system? Or the codec used to decide the returned string?


Top-posted from my Windows Phone

From: Random832
Sent: ‎8/‎15/‎2016 6:41
To: Steve Dower; Stephen J. Turnbull
Cc: Victor Stinner; python-ideas

Subject: Re: [Python-ideas] Fix default encodings on Windows

Random832

unread,
Aug 15, 2016, 12:54:53 PM8/15/16
to Steve Dower, Stephen J. Turnbull, python-ideas
On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
> I'm still not sure we're talking about the same thing right now.
>
> For `open(path_as_bytes).read()`, are we talking about the way
> path_as_bytes is passed to the file system? Or the codec used to decide
> the returned string?

We are talking about the way path_as_bytes is passed to the filesystem,
and in particular what encoding path_as_bytes is *actually* in, when it
was obtained from a file or other stream opened in binary mode.

Steve Dower

unread,
Aug 15, 2016, 2:27:35 PM8/15/16
to Random832, Stephen J. Turnbull, python-ideas
On 15Aug2016 0954, Random832 wrote:
> On Mon, Aug 15, 2016, at 12:35, Steve Dower wrote:
>> I'm still not sure we're talking about the same thing right now.
>>
>> For `open(path_as_bytes).read()`, are we talking about the way
>> path_as_bytes is passed to the file system? Or the codec used to decide
>> the returned string?
>
> We are talking about the way path_as_bytes is passed to the filesystem,
> and in particular what encoding path_as_bytes is *actually* in, when it
> was obtained from a file or other stream opened in binary mode.

Okay good, we are talking about the same thing.

Passing path_as_bytes in that location has been deprecated since 3.3, so
we are well within our rights (and probably overdue) to make it a
TypeError in 3.6. While it's obviously an invalid assumption, for the
purposes of changing the language we can assume that no existing code is
passing bytes into any functions where it has been deprecated.

As far as I'm concerned, there are currently no filesystem APIs on
Windows that accept paths as bytes.


Given that, I'm proposing adding support for using byte strings encoded
with UTF-8 in file system functions on Windows. This allows Python users
to omit switching code like:

if os.name == 'nt':
f = os.stat(os.listdir('.')[-1])
else:
f = os.stat(os.listdir(b'.')[-1])

Or simply using the bytes variant unconditionally because they heard it
was faster (sacrificing cross-platform correctness, since it may not
correctly round-trip on Windows).

My proposal is to remove all use of the *A APIs and only use the *W
APIs. That completely removes the (already deprecated) use of bytes as
paths. I then propose to change the (unused on Windows)
sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into
filesystem functions by transcoding into UTF-16 and calling the *W APIs.

This completely removes the active codepage from the chain, allows paths
returned from the filesystem to correctly roundtrip via bytes in Python,
and allows those bytes paths to be manipulated at '\' characters.
(Frankly I don't mind what encoding we use, and I'd be quite happy to
force bytes paths to be UTF-16-LE encoded, which would also round-trip
invalid surrogate pairs. But that would prevent basic manipulation which
seems to be a higher priority.)

This does not allow you to take bytes from an arbitrary source and
assume that they are correctly encoded for the file system. Python 3.3,
3.4 and 3.5 have been warning that doing that is deprecated and the path
needs to be decoded to a known encoding first. At this stage, it's time
for us to either make byte paths an error, or to specify a suitable
encoding that can correctly round-trip paths.


If this does not answer the question, I'm going to need the question to
be explained more clearly for me.

Cheers,
Steve

Steve Dower

unread,
Aug 15, 2016, 2:38:52 PM8/15/16
to Random832, Stephen J. Turnbull, python-ideas
On 15Aug2016 1126, Steve Dower wrote:
> My proposal is to remove all use of the *A APIs and only use the *W
> APIs. That completely removes the (already deprecated) use of bytes as
> paths. I then propose to change the (unused on Windows)
> sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into
> filesystem functions by transcoding into UTF-16 and calling the *W APIs.

Of course, I meant sys.getfilesystemencoding() here. The C functions
have "FSDefault" in many of the names, which is why I guessed the wrong
Python variant.

Big Stone

unread,
Aug 15, 2016, 5:00:41 PM8/15/16
to python-ideas, rand...@fastmail.com, turnbull....@u.tsukuba.ac.jp, python...@python.org, steve...@python.org
hi,

As a Windows user facing unicode issues:
- the sooner the Microsoft world shifts to natural 'utf-8' the better,
- the problem is that in unix world, like SQLITE, utf-8 BOM (or 'utf-8-sig'), current Microsoft workaround, tends to be considered an error.
==> I don't think Steve proposal will make things more horrible than today, but rather a step in the right convergence path.

eryk sun

unread,
Aug 15, 2016, 9:20:32 PM8/15/16
to python-ideas
On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve...@python.org> wrote:
>
> (Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes
> paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate
> pairs. But that would prevent basic manipulation which seems to be a higher
> priority.)

The CRT manually decodes and encodes using the private functions
__acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use
either the ANSI or OEM codepage, depending on the value returned by
WinAPI AreFileApisANSI. CPython could follow suit. Doing its own
encoding and decoding would enable using filesystem functions that
will never get an [A]NSI version (e.g. GetFileInformationByHandleEx),
while still retaining backward compatibility.

Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning
when lpUsedDefaultChar is true. Filesystem decoding could use
MB_ERR_INVALID_CHARS and raise a warning and retry without this flag
for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This
could be implemented with a new "warning" handler for
PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new
'fsmbcs' encoding could be added that checks AreFileApisANSI to choose
betwen CP_ACP and CP_OEMCP.

Chris Barker - NOAA Federal

unread,
Aug 15, 2016, 9:35:46 PM8/15/16
to Steve Dower, python-ideas
> Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like:
>
> if os.name == 'nt':
> f = os.stat(os.listdir('.')[-1])
> else:
> f = os.stat(os.listdir(b'.')[-1])

REALLY? Do we really want to encourage using bytes as paths? IIUC,
anyone that wants to platform-independentify that code just needs to
use proper strings (or pat glib) for paths everywhere, yes?

I understand that pre-surrogate-escape, there was a need for bytes
paths, but those days are gone, yes?

So why, at this late date, kludge what should be a deprecated pattern
into the Windows build???

-CHB

> My proposal is to remove all use of the *A APIs and only use the *W APIs. That completely removes the (already deprecated) use of bytes as paths.

Yes, this is good.

> I then propose to change the (unused on Windows) sys.getfsdefaultencoding() to 'utf-8' and handle bytes being passed into filesystem functions by transcoding into UTF-16 and calling the *W APIs.

I'm really not sure utf-8 is magic enough to do this. Where do you
imagine that utf-8 is coming from as bytes???

AIUI, while utf-8 is almost universal in *nix for file system names,
folks do not want to count on it -- hence the use of bytes. And it is
far less prevalent in the Windows world...

> , allows paths returned from the filesystem to correctly roundtrip via bytes in Python,

That you could do with native bytes (UTF-16, yes?)

> . But that would prevent basic manipulation which seems to be a higher priority.)

Still think Unicode is the answer to that...

> At this stage, it's time for us to either make byte paths an error,

+1. :-)

CHB

Steve Dower

unread,
Aug 15, 2016, 9:40:34 PM8/15/16
to python...@python.org
On 15Aug2016 1819, eryk sun wrote:
> On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve...@python.org> wrote:
>>
>> (Frankly I don't mind what encoding we use, and I'd be quite happy to force bytes
>> paths to be UTF-16-LE encoded, which would also round-trip invalid surrogate
>> pairs. But that would prevent basic manipulation which seems to be a higher
>> priority.)
>
> The CRT manually decodes and encodes using the private functions
> __acrt_copy_path_to_wide_string and __acrt_copy_to_char. These use
> either the ANSI or OEM codepage, depending on the value returned by
> WinAPI AreFileApisANSI. CPython could follow suit. Doing its own
> encoding and decoding would enable using filesystem functions that
> will never get an [A]NSI version (e.g. GetFileInformationByHandleEx),
> while still retaining backward compatibility.
>
> Filesystem encoding could use WC_NO_BEST_FIT_CHARS and raise a warning
> when lpUsedDefaultChar is true. Filesystem decoding could use
> MB_ERR_INVALID_CHARS and raise a warning and retry without this flag
> for ERROR_NO_UNICODE_TRANSLATION (e.g. an invalid DBCS sequence). This
> could be implemented with a new "warning" handler for
> PyUnicode_EncodeCodePage and PyUnicode_DecodeCodePageStateful. A new
> 'fsmbcs' encoding could be added that checks AreFileApisANSI to choose
> betwen CP_ACP and CP_OEMCP.

None of that makes it less complicated or more reliable. Warnings based
on values are bad (they should be based on types) and using the *W APIs
exclusively is the right way to go. The question then is whether we
allow file system functions to return bytes, and if so, which encoding
to use. This then directly informs what the functions accept, for the
purposes of round-tripping.

*Any* encoding that may silently lose data is a problem, which basically
leaves utf-16 as the only option. However, as that causes other
problems, maybe we can accept the tradeoff of returning utf-8 and
failing when a path contains invalid surrogate pairs (which is extremely
rare by comparison to characters outside of CP_ACP)?

If utf-8 is unacceptable, we're back to the current situation and should
be removing the support for bytes that was deprecated three versions ago.

Cheers,
Steve

Nick Coghlan

unread,
Aug 15, 2016, 10:00:55 PM8/15/16
to Chris Barker - NOAA Federal, python-ideas
On 16 August 2016 at 11:34, Chris Barker - NOAA Federal
<chris....@noaa.gov> wrote:
>> Given that, I'm proposing adding support for using byte strings encoded with UTF-8 in file system functions on Windows. This allows Python users to omit switching code like:
>>
>> if os.name == 'nt':
>> f = os.stat(os.listdir('.')[-1])
>> else:
>> f = os.stat(os.listdir(b'.')[-1])
>
> REALLY? Do we really want to encourage using bytes as paths? IIUC,
> anyone that wants to platform-independentify that code just needs to
> use proper strings (or pat glib) for paths everywhere, yes?

The problem is that bytes-as-paths actually *does* work for Mac OS X
and systemd based Linux distros properly configured to use UTF-8 for
OS interactions. This means that a lot of backend network service code
makes that assumption, especially when it was originally written for
Python 2, and rather than making it work properly on Windows, folks
just drop Windows support as part of migrating to Python 3.

At an ecosystem level, that means we're faced with a choice between
implicitly encouraging folks to make their code *nix only, and finding
a way to provide a more *nix like experience when running on Windows
(where UTF-8 encoded binary data just works, and either other
encodings lead to mojibake or else you use chardet to figure things
out).

Steve is suggesting that the latter option is preferable, a view I
agree with since it lowers barriers to entry for Windows based
developers to contribute to primarily *nix focused projects.

> I understand that pre-surrogate-escape, there was a need for bytes
> paths, but those days are gone, yes?

No, UTF-8 encoded bytes are still the native language of network
service development: http://utf8everywhere.org/

It also helps with cases where folks are switching back and forth
between Python and other environments like JavaScript and Go where the
UTF-8 assumption is more prevalent.

> So why, at this late date, kludge what should be a deprecated pattern
> into the Windows build???

Promoting cross-platform consistency often leads to enabling patterns
that are considered a bad idea from a native platform perspective, and
this strikes me as an example of that (just as the binary/text
separation itself is a case where Python 3 diverged from the POSIX
text model to improve consistency across *nix, Windows, JVM and CLR
environments).

Cheers,
Nick.

eryk sun

unread,
Aug 16, 2016, 2:07:41 AM8/16/16
to python...@python.org
>> On Mon, Aug 15, 2016 at 6:26 PM, Steve Dower <steve...@python.org>
>> wrote:
>
> and using the *W APIs exclusively is the right way to go.

My proposal was to use the wide-character APIs, but transcoding CP_ACP
without best-fit characters and raising a warning whenever the default
character is used (e.g. substituting Katakana middle dot when creating
a file using a bytes path that has an invalid sequence in CP932). This
proposal was in response to the case made by Stephen Turnbull. If
using UTF-8 is getting such heavy pushback, I thought half a solution
was better than nothing, and it also sets up the infrastructure to
easily switch to UTF-8 if that idea eventually gains acceptance. It
could raise exceptions instead of warnings if that's preferred, since
bytes paths on Windows are already deprecated.

> *Any* encoding that may silently lose data is a problem, which basically
> leaves utf-16 as the only option. However, as that causes other problems,
> maybe we can accept the tradeoff of returning utf-8 and failing when a
> path contains invalid surrogate pairs

Are there any common sources of illegal UTF-16 surrogates in Windows
filenames? I see that WTF-8 (Wobbly) was developed to handle this
problem. A WTF-8 path would roundtrip back to the filesystem, but it
should only be used internally in a program.

Victor Stinner

unread,
Aug 16, 2016, 6:30:08 AM8/16/16
to eryk sun, python-ideas
2016-08-16 8:06 GMT+02:00 eryk sun <ery...@gmail.com>:
> My proposal was to use the wide-character APIs, but transcoding CP_ACP
> without best-fit characters and raising a warning whenever the default
> character is used (e.g. substituting Katakana middle dot when creating
> a file using a bytes path that has an invalid sequence in CP932).

A problem with all these proposal is that they *add* new code to the
CPython code base, code specific to Windows. There are very few core
developers (1 or 2?) who work on this code specific to Windows.


I would prefer to *drop* code specific to Windows rather that *adding*
(or changing) code specific to Windows, just to make the CPython code
base simpler to maintain.

It's already annoying enough. It's common that a Python function has
one implementation for all platforms except Windows, and a second
implementation specific to Windows.

An example: os.listdir()

* ~150 lines of C code for the Windows implementation
* ~100 lines of C code for the UNIX/BSD implementation
* The Windows implementation is splitted in two parts: Unicode and
bytes, so the code is basically duplicated (2 versions)

If you remove the bytes support, the Windows function is reduced to
100 lines (-50).


I'm not sure that modifying the API using byte would solve any issue
on Windows, and there is an obvious risk of regression (mojibake when
you concatenerate strings encoded to UTF-8 and to ANSI code page).

I'm in favor of forcing developers to use Unicode on Windows, which is
the correct way to use the Windows API. The side effect is that such
code works perfectly well on UNIX/BSD ;-) To be clear: drop the
deprecated code to support bytes on Windows.

I already proposed to drop bytes support on Windows and most answers
were "please keep them", so another option is to keep the "broken
code" as the status quo...

I really hate APIs using bytes on Windows because they use
WideCharToMultiByte() (encode unicode to bytes) in a mode which is
likely to lead to mojibake: unencodable characters are replaced with
"best fit characters" or "?".
https://unicodebook.readthedocs.io/operating_systems.html#encode-and-decode-functions


In a perfect world, I would also propose to deprecate bytes filenames
on UNIX, but I expect an insane flamewar on the definition of "UNIX",
history of UNIX, etc. (non technical discussion, since Unicode works
very well on Python 3...).

Victor

Paul Moore

unread,
Aug 16, 2016, 6:54:05 AM8/16/16
to Nick Coghlan, python-ideas
On 15 August 2016 at 19:26, Steve Dower <steve...@python.org> wrote:
> Passing path_as_bytes in that location has been deprecated since 3.3, so we
> are well within our rights (and probably overdue) to make it a TypeError in
> 3.6. While it's obviously an invalid assumption, for the purposes of
> changing the language we can assume that no existing code is passing bytes
> into any functions where it has been deprecated.
>
> As far as I'm concerned, there are currently no filesystem APIs on Windows
> that accept paths as bytes.

[...]

On 16 August 2016 at 03:00, Nick Coghlan <ncog...@gmail.com> wrote:
> The problem is that bytes-as-paths actually *does* work for Mac OS X
> and systemd based Linux distros properly configured to use UTF-8 for
> OS interactions. This means that a lot of backend network service code
> makes that assumption, especially when it was originally written for
> Python 2, and rather than making it work properly on Windows, folks
> just drop Windows support as part of migrating to Python 3.
>
> At an ecosystem level, that means we're faced with a choice between
> implicitly encouraging folks to make their code *nix only, and finding
> a way to provide a more *nix like experience when running on Windows
> (where UTF-8 encoded binary data just works, and either other
> encodings lead to mojibake or else you use chardet to figure things
> out).
>
> Steve is suggesting that the latter option is preferable, a view I
> agree with since it lowers barriers to entry for Windows based
> developers to contribute to primarily *nix focused projects.

So does this mean that you're recommending reverting the deprecation
of bytes as paths in favour of documenting that bytes as paths is
acceptable, but it will require an encoding of UTF-8 rather than the
current behaviour? If so, that raises some questions:

1. Is it OK to backtrack on a deprecation by changing the behaviour
like this? (I think it is, but others who rely on the current,
deprecated, behaviour may not).
2. Should we be making "always UTF-8" the behaviour on all platforms,
rather than just Windows (e.g., Unix systems which haven't got UTF-8
as their locale setting)? This doesn't seem to be a Windows-specific
question any more (I'm assuming that if bytes-as-paths are deprecated,
that's a cross-platform change, but see below).

Having said all this, I can't find the documentation stating that
bytes paths are deprecated - the open() documentation for 3.5 says
"file is either a string or bytes object giving the pathname (absolute
or relative to the current working directory) of the file to be opened
or an integer file descriptor of the file to be wrapped" and there's
no mention of a deprecation. Steve - could you provide a reference?

Paul

eryk sun

unread,
Aug 16, 2016, 9:11:00 AM8/16/16
to python-ideas
On Tue, Aug 16, 2016 at 10:53 AM, Paul Moore <p.f....@gmail.com> wrote:
>
> Having said all this, I can't find the documentation stating that
> bytes paths are deprecated - the open() documentation for 3.5 says
> "file is either a string or bytes object giving the pathname (absolute
> or relative to the current working directory) of the file to be opened
> or an integer file descriptor of the file to be wrapped" and there's
> no mention of a deprecation.

Bytes paths aren't deprecated on Unix -- only on Windows, and only for
the os functions. You can see the deprecation warning with -Wall:

>>> os.listdir(b'.')
__main__:1: DeprecationWarning: The Windows bytes API has been
deprecated, use Unicode filenames instead

AFAIK this isn't documented.

Since the Windows CRT's _open implementation uses MultiByteToWideChar
without the flag MB_ERR_INVALID_CHARS, bytes paths should also be
deprecated for io.open. The problem is that bad DBCS sequences are
mapped silently to the default Unicode character instead of raising an
error.

Stephen J. Turnbull

unread,
Aug 16, 2016, 9:50:04 AM8/16/16
to Nick Coghlan, python-ideas
Nick Coghlan writes:

> At an ecosystem level, that means we're faced with a choice between
> implicitly encouraging folks to make their code *nix only, and
> finding a way to provide a more *nix like experience when running
> on Windows (where UTF-8 encoded binary data just works, and either
> other encodings lead to mojibake or else you use chardet to figure
> things out).

Most of the time we do know what the encoding is, we can just ask
Windows (although Steve proposes to make Python fib about that, we
could add other APIs).

This change means that programs that until now could be encoding-
agnostic and just pass around bytes on Windows, counting on Python to
consistently convert those to the appropriate form for the API, can't
do that any more. They have to find out what the encoding is, and
transcode to UTF-8, or rewrite to do their processing as text. This
is a potential burden on existing user code.

I suppose that there are such programs, for the same reasons that
networking programs tend to use bytes I/O: ports from Python 2, an
(misplaced?) emphasis on performance, etc.

> Steve is suggesting that the latter option is preferable, a view I
> agree with since it lowers barriers to entry for Windows based
> developers to contribute to primarily *nix focused projects.

Sure, but do you have any idea what the costs might be? Aside from
the code burden mentioned above, there's a reputational issue. Just
yesterday I was having a (good-natured) Perl vs. Python discussion on
my LUG ML, and two developers volunteered that they avoid Python
because "the Python developers frequently break backward
compatibility". These memes tend to go off on their own anyway, but
this change will really feed that one.

> Promoting cross-platform consistency often leads to enabling
> patterns that are considered a bad idea from a native platform
> perspective, and this strikes me as an example of that (just as the
> binary/text separation itself is a case where Python 3 diverged
> from the POSIX text model to improve consistency across *nix,
> Windows, JVM and CLR environments).

I would say rather Python 3 chose an across-the-board better, more
robust model supporting internationalization and multilingualization
properly. POSIX's text model is suitable at best for a fragile
localization.

This change, OTOH, is a step backward we wouldn't consider except for
the intended effect on ease of writing networking code. That's
important, but I really don't think that's going to be the only major
effect, and I fear it won't be the most important effect.

Of course that's FUD -- I have no data on potential burden to existing
use cases, or harm to reputation. But neither do you and Steve. :-(

Steve Dower

unread,
Aug 16, 2016, 10:00:17 AM8/16/16
to Paul Moore, Nick Coghlan, python-ideas
Hmm, doesn't seem to be explicitly listed as a deprecation, though discussion form around that time makes it clear that everyone thought it was.

I also found this proposal to use strict mbcs to decode bytes for use against the file system, which is basically the same as what I'm proposing now apart from the more limited encoding: https://mail.python.org/pipermail/python-dev/2011-October/114203.html

It definitely results in less C code to maintain if we do the decode ourselves. We could use strict mbcs, but I'd leave the deprecation warnings in there. Or perhaps we provide an env var to use mbcs as the file system encoding but default to utf8 (I already have one for selecting legacy console encoding)? Callers should be asking the sys module for the encoding anyway, so I'd expect few libraries to be impacted, though applications might prefer it.


Top-posted from my Windows Phone

From: Paul Moore
Sent: ‎8/‎16/‎2016 3:54
To: Nick Coghlan
Cc: python-ideas

Subject: Re: [Python-ideas] Fix default encodings on Windows

Paul Moore

unread,
Aug 16, 2016, 10:01:14 AM8/16/16
to eryk sun, python-ideas
On 16 August 2016 at 14:09, eryk sun <ery...@gmail.com> wrote:
> On Tue, Aug 16, 2016 at 10:53 AM, Paul Moore <p.f....@gmail.com> wrote:
>>
>> Having said all this, I can't find the documentation stating that
>> bytes paths are deprecated - the open() documentation for 3.5 says
>> "file is either a string or bytes object giving the pathname (absolute
>> or relative to the current working directory) of the file to be opened
>> or an integer file descriptor of the file to be wrapped" and there's
>> no mention of a deprecation.
>
> Bytes paths aren't deprecated on Unix -- only on Windows, and only for
> the os functions. You can see the deprecation warning with -Wall:
>
> >>> os.listdir(b'.')
> __main__:1: DeprecationWarning: The Windows bytes API has been
> deprecated, use Unicode filenames instead

Thanks. So this remains a Windows-only issue (which is good).

> AFAIK this isn't documented.

It probably should be. Although if we're changing the deprecation to a
behaviour change, then maybe there's no point. But some of the
arguments here about breaking code are hinging on the idea that people
currently using the bytes API are using an (on the way to being)
unsupported feature and it's not really acceptable to take that
position if the deprecation wasn't announced. If the objections being
raised here (in the context of Japanese encodings and similar) would
apply equally to the bytes API being removed, then it seems to me that
we have a failure in our deprecation process, as those objections
should have been addressed when we started the deprecation.

Alternatively, if the deprecation of the os functions is OK, but it's
the deprecation of open (and presumably io.open) that's the issue,
then the whole process is somewhat problematic - it seems daft in the
long term to deprecate bytes paths in os functions like os.open and
yet allow them in the supposedly higher level io.open and the open
builtin. (And in the short term, it's illogical to me that the
deprecation isn't for open as well as the os functions).

I don't have a view on whether the cost to Japanese users is
sufficiently high that we should continue along the deprecation path
(or even divert to an enforced-UTF8 approach that's just as
problematic for them). But maybe it's worth a separate thread,
specifically focused on the use of bytes paths, rather than being
lumped in with other Windows encoding issues?

Paul

Random832

unread,
Aug 16, 2016, 11:00:50 AM8/16/16
to python...@python.org
On Tue, Aug 16, 2016, at 09:59, Paul Moore wrote:
> It probably should be. Although if we're changing the deprecation to a
> behaviour change, then maybe there's no point. But some of the
> arguments here about breaking code are hinging on the idea that people
> currently using the bytes API are using an (on the way to being)
> unsupported feature and it's not really acceptable to take that
> position if the deprecation wasn't announced. If the objections being
> raised here (in the context of Japanese encodings and similar) would
> apply equally to the bytes API being removed,

There also seems to be an undercurrent in the discussions we're having
now that using bytes paths and not unicode paths is somehow The Right
Thing for unix-like OSes, and that breaking it (in whatever way) on
windows causes code that Does The Right Thing on unix to require extra
work to port to windows. That's seemingly both the rationale for the
proposal itself and for the objections.

Chris Barker - NOAA Federal

unread,
Aug 16, 2016, 11:33:50 AM8/16/16
to Random832, python...@python.org
> There also seems to be an undercurrent in the discussions we're having
> now that using bytes paths and not unicode paths is somehow The Right
> Thing for unix-like OSes,

Almost -- from my perusing of discussions from the last few years,
there do seem to be some library developers and *nix affectionados
that DO think it's The Right Thing -- after all, a char* has always
worked, yes? But these folks also seem to think that a •nix system
with no way of knowing what the encoding of the names in the file
system (and could have more than one) is not "broken" in any way.

A note about "utf-8 everywhere": while maybe a good idea, it's my
understanding that *nix developers absolutely do not want utf-8 to be
assumed in the Python APIs. Rather, this is all about punting the
handling of encodings down to the application level, rather that the
OS and Library level. Which is more backward compatible, but otherwise
a horrible idea. And very much in conflict with Python 3's approach.

So it seems odd to assume utf-8 on Windows, where it is less ubiquitous.

Back to "The Right Thing" -- it's clear to me that everyone supporting
this proposal is vet much doing so because it's "The Pragmatic Thing".

But it seems folks porting from py2 need to explicitly convert the
calls from str to bytes anyway to get the bytes behavior. With
surrogate escapes, now you need to do nothing. So we're really
supporting code that was ported to py3 earlier in the game - but it
seems a bad idea to cement that hacks solution in place.

And if the file amen in question are coming from a byte stream
somehow, rather than file system API calls, then you really do need to
know the encoding -- yes really! If a developer wants to assume utf-8,
that's fine, but the developer should be making that decision, not
Python itself. And not on Windows only.

-CHB

Guido van Rossum

unread,
Aug 16, 2016, 11:37:35 AM8/16/16
to Chris Barker, Python-Ideas

I am going to mute this thread but I am worried about the outcome. Once there is agreement please check with me first.

--Guido (mobile)

Steve Dower

unread,
Aug 16, 2016, 11:57:50 AM8/16/16
to python...@python.org
I just want to clearly address two points, since I feel like multiple
posts have been unclear on them.

1. The bytes API was deprecated in 3.3 and it is listed in
https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs
is an unfortunate oversight, but it was certainly announced and the
warning has been there for three released versions. We can freely change
or remove the support now, IMHO.

2. Windows file system encoding is *always* UTF-16. There's no "assuming
mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what
encoding it is". We know exactly what the encoding is on every supported
version of Windows. UTF-16.

This discussion is for the developers who insist on using bytes for
paths within Python, and the question is, "how do we best represent
UTF-16 encoded paths in bytes?"

The choices are:

* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)

Currently we have the second option.

My preference is the fourth option, as it will cause the least breakage
of existing code and enable the most amount of code to just work in the
presence of non-ACP characters.

The fifth option is the best for round-tripping within Windows APIs.

The only code that will break with any change is code that was using an
already deprecated API. Code that correctly uses str to represent
"encoding agnostic text" is unaffected.

If you see an alternative choice to those listed above, feel free to
contribute it. Otherwise, can we focus the discussion on these (or any
new) choices?

Cheers,
Steve

Chris Barker

unread,
Aug 16, 2016, 12:08:18 PM8/16/16
to Random832, python...@python.org
Just to make sure this is clear, the Pragmatic logic is thus:

* There are more *nix-centric developers in the Python ecosystem than Windows-centric (or even Windows-agnostic) developers.

* The bytes path approach works fine on *nix systems.

* Whatever might be Right and Just -- the reality is that a number of projects, including important and widely used libraries and frameworks, use the bytes API for working with filenames and paths, etc.

Therefore, there is a lot of code that does not work right on Windows.

Currently, to get it to work right on Windows, you need to write Windows specific code, which many folks don't want or know how to do (or just can't support one way or the other).

So the Solution is to either:

 (A) get everyone to use Unicode  "properly", which will work on all platforms (but only on py3.5 and above?)

or

(B) kludge some *nix-compatible support for byte paths into Windows, that will work at least much of the time.

It's clear (to me at least) that (A) it the "Right Thing", but real world experience has shown that it's unlikely to happen any time soon.

Practicality beats Purity and all that -- this is a judgment call.

Have I got that right?

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Brendan Barnwell

unread,
Aug 16, 2016, 12:09:23 PM8/16/16
to python...@python.org
On 2016-08-16 08:56, Steve Dower wrote:
> I just want to clearly address two points, since I feel like multiple
> posts have been unclear on them.
>
> 1. The bytes API was deprecated in 3.3 and it is listed in
> https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs
> is an unfortunate oversight, but it was certainly announced and the
> warning has been there for three released versions. We can freely change
> or remove the support now, IMHO.

I strongly disagree with that. If using the code does not raise a
visible warning (because DeprecationWarning is silent by default), and
the documentation does not say it's deprecated, it hasn't actually been
deprecated. Deprecation is the communicative act of saying "don't do
this anymore". If that information is not communicated in the
appropriate places (e.g., the docs), the deprecation has not occurred.

--
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is no
path, and leave a trail."
--author unknown

Chris Barker

unread,
Aug 16, 2016, 12:14:08 PM8/16/16
to Steve Dower, Python-Ideas
Thanks for the clarity, Steve, a couple questions/thoughts:

The choices are:

* don't represent them at all (remove bytes API)

Would the bytes API be removed on *nix also?
 
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page

"Failure is not an option" -- These two seem like a plain old bad idea.

* convert and fail on invalid surrogate pairs

where would an invalid surrogate pair come from? never from a file system API call, yes?

* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)

would this be doing anything -- or just keeping whatever the Windows API takes/returns? i.e. exactly what is done on *nix?
 
The fifth option is the best for round-tripping within Windows APIs.

How is it better? only performance (i.e. no encoding/decoding required) -- or would it be more reliable as well?
 
-CHB

Random832

unread,
Aug 16, 2016, 12:35:55 PM8/16/16
to python...@python.org
On Tue, Aug 16, 2016, at 12:12, Chris Barker wrote:
> * convert and fail on invalid surrogate pairs
>
> where would an invalid surrogate pair come from? never from a file system
> API call, yes?

In principle it could, if the filesystem contains a file with an invalid
surrogate pair. Nothing else, in general, prevents such a file from
being created, though it's not easy to do so by accident.

Sven R. Kunze

unread,
Aug 16, 2016, 1:15:37 PM8/16/16
to python...@python.org
On 16.08.2016 18:06, Chris Barker wrote:
> It's clear (to me at least) that (A) it the "Right Thing", but real
> world experience has shown that it's unlikely to happen any time soon.
>
> Practicality beats Purity and all that -- this is a judgment call.

Maybe, but even when it takes a lot of time to get it right, I always
prefer the "Right Thing".

My past experience taught me that everything will always come back to
the "Right Thing" even partly as it is *surprise* the "Right Thing" (TM).


Question is: are we in a hurry? Has somebody too little time to wait for
the "Right Thing" to happen?


Sven

Steve Dower

unread,
Aug 16, 2016, 1:45:28 PM8/16/16
to Sven R. Kunze, python...@python.org
On 16Aug2016 1006, Sven R. Kunze wrote:
> Question is: are we in a hurry? Has somebody too little time to wait for
> the "Right Thing" to happen?

Not really in a hurry. It's just that I decided to attack a few of the
encoding issues on Windows and this was one of them.

Ideally I'd want to get the change in for 3.6.0b1 so there's plenty of
testing time. But we've been waiting many years for this already so I
guess a few more won't hurt. The current situation of making Linux
developers write different path handling code for Windows vs Linux (or
just use str for paths) is painful for some, but not as bad as the other
issues I want to fix.

Cheers,
Steve

Sven R. Kunze

unread,
Aug 16, 2016, 2:09:32 PM8/16/16
to Steve Dower, python...@python.org
On 16.08.2016 19:44, Steve Dower wrote:
> On 16Aug2016 1006, Sven R. Kunze wrote:
>> Question is: are we in a hurry? Has somebody too little time to wait for
>> the "Right Thing" to happen?
>
> Not really in a hurry. It's just that I decided to attack a few of the
> encoding issues on Windows and this was one of them.
>
> Ideally I'd want to get the change in for 3.6.0b1 so there's plenty of
> testing time. But we've been waiting many years for this already so I
> guess a few more won't hurt. The current situation of making Linux
> developers write different path handling code for Windows vs Linux (or
> just use str for paths) is painful for some, but not as bad as the
> other issues I want to fix.
>

I assume one overall goal will be Windows and Linux programs handling
paths the same way which I personally find a very good idea.

And as long as such long-term goals are properly communicated, people
are educated the right way and official deprecation phases are in place,
everything is good, I guess. :)


Sven

Paul Moore

unread,
Aug 16, 2016, 2:21:29 PM8/16/16
to Steve Dower, Python-Ideas
On 16 August 2016 at 16:56, Steve Dower <steve...@python.org> wrote:
> I just want to clearly address two points, since I feel like multiple posts
> have been unclear on them.
>
> 1. The bytes API was deprecated in 3.3 and it is listed in
> https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs is
> an unfortunate oversight, but it was certainly announced and the warning has
> been there for three released versions. We can freely change or remove the
> support now, IMHO.

For clarity, the statement was:

"""
issue 13374: The Windows bytes API has been deprecated in the os
module. Use Unicode filenames, instead of bytes filenames, to not
depend on the ANSI code page anymore and to support any filename.
"""

First of all, note that I'm perfectly OK with deprecating bytes paths.
However, this statement specifically does *not* say anything about use
of bytes paths outside of the os module (builtin open and the io
module being the obvious places). Secondly, it appears that
unfortunately the main Python documentation wasn't updated to state
this.

So while "we can freely change or remove the support now" may be true,
it's not that simple - the debate here is at least in part about
builtin open, and there's nothing anywhere that I can see that states
that bytes support in open has been deprecated. Maybe there should
have been, and maybe everyone involved at the time assumed that it
was, but that's water under the bridge.

> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
> it is". We know exactly what the encoding is on every supported version of
> Windows. UTF-16.
>
> This discussion is for the developers who insist on using bytes for paths
> within Python, and the question is, "how do we best represent UTF-16 encoded
> paths in bytes?"

People passing bytes to open() have in my view, already chosen not to
follow the standard advice of "decode incoming data at the boundaries
of your application". They may have good reasons for that, but it's
perfectly reasonable to expect them to take responsibility for
manually tracking the encoding of the resulting bytes values flowing
through their code. It is of course, also true that "works for me in
my environment" is a viable strategy - but the maintenance cost of
this strategy if things change (whether in Python, or in the
environment) is on the application developers - they are hoping that
cost is minimal, but that's a risk they choose to take.

> The choices are:
>
> * don't represent them at all (remove bytes API)
> * convert and drop characters not in the (legacy) active code page
> * convert and fail on characters not in the (legacy) active code page
> * convert and fail on invalid surrogate pairs
> * represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)

Actually, with the exception of the last one (which seems "obviously
not sensible") these all feel more to me like answers to the question
"how do we best interpret bytes provided to us as UTF-16?". It's a
subtle point, but IMO important. It's much easier to answer the
question you posed, but what people are actually concerned about is
interpreting bytes, not representing Unicode. The correct answer to
"how do we interpret bytes" is "in the face of ambiguity, refuse to
guess" - but people using the bytes API have *already* bought into the
current heuristic for guessing, so changing affects them.

> Currently we have the second option.
>
> My preference is the fourth option, as it will cause the least breakage of
> existing code and enable the most amount of code to just work in the
> presence of non-ACP characters.

It changes the encoding used to interpret bytes. While it preserves
more information in the "UTF-16 to bytes" direction, nobody really
cares about that direction. And in the "bytes to UTF-16" direction, it
changes the interpretation of basically all non-ASCII bytes. That's a
lot of breakage. Although as already noted, it's only breaking things
that currently work while relying on a (maybe) undocumented API (byte
paths to builtin open isn't actually documented) and on an arguably
bad default that nevertheless works for them.

> The fifth option is the best for round-tripping within Windows APIs.
>
> The only code that will break with any change is code that was using an
> already deprecated API. Code that correctly uses str to represent "encoding
> agnostic text" is unaffected.

Code using Unicode is unaffected, certainly. Ideally that means that
only a tiny minority of users should be affected. Are we over-reacting
to reports of standard practices in Japan? I've no idea.

> If you see an alternative choice to those listed above, feel free to
> contribute it. Otherwise, can we focus the discussion on these (or any new)
> choices?

Accept that we should have deprecated builtin open and the io module,
but didn't do so. Extend the existing deprecation of bytes paths on
Windows, to cover *all* APIs, not just the os module, But modify the
deprecation to be "use of the Windows CP_ACP code page (via the ...A
Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
the implied encoding for all bytes paths on Windows starting in Python
3.7". Document and publicise it much more prominently, as it is a
breaking change. Then leave it one release for people to prepare for
the change.

Oh, and (obviously) check back with Guido on his view - he's expressed
concern, but I for one don't have the slightest idea in this case what
his preference would be...

Paul

Victor Stinner

unread,
Aug 16, 2016, 7:05:12 PM8/16/16
to Steve Dower, python-ideas
2016-08-16 17:56 GMT+02:00 Steve Dower <steve...@python.org>:
> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
> it is". We know exactly what the encoding is on every supported version of
> Windows. UTF-16.

I think that you missed a important issue (or "use case") which is
called the "Makefile problem" by Mercurial developers:
https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22

I already explained it before, but maybe you misunderstood or just
missed it, so here is a more concrete example.

A runner.py script produces a bytes filename and sends it to a second
read_file.py script through stdin/stdout. The read_file.py script
opens the file using open(filename). The read_file.py script is run by
Python 2 which works naturally on bytes. The question is how the
runner.py produces (encodes) the filename.

runner.py (script run by Python 3.7):
---
import os, sys, subprocess, tempfile

filename = 'h\xe9.txt'
content = b'foo bar'
print("filename unicode: %a" % filename)

root = os.path.realpath(os.path.dirname(__file__))
script = os.path.join(root, 'read_file.py')

old_cwd = os.getcwd()

with tempfile.TemporaryDirectory() as tmpdir:
os.chdir(tmpdir)
with open(filename, 'wb') as fp:
fp.write(content)

filenameb = os.listdir(b'.')[0]
# Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
# what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
print("filename bytes: %a" % filenameb)

proc = subprocess.Popen(['py', '-2', script],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
stdout = proc.communicate(filenameb)[0]
print("File content: %a" % stdout)

os.chdir(old_cwd)
---

read_file.py (run by Python 2):
---
import sys
filename = sys.stdin.read()
# Python 2 calls the Windows C open() function
# which expects a filename encoded to the ANSI code page
with open(filename) as fp:
content = fp.read()
sys.stdout.write(content)
sys.stdout.flush()
---

read_file.py only works if the non-ASCII filename is encoded to the
ANSI code page.

The question is how you expect developers should handle such use case.

For example, are developers responsible to transcode communicate()
data (input and outputs) manually?

That's why I keep repeating that ANSI code page is the best *default*
encoding because it is the encoded expected by other applications.

I know that the ANSI code page is usually limited and caused various
painful issues when handling non-ASCII data, but it's the status quo
if you really want to handle data as bytes...

Sorry, I didn't read all emails of this long thread, so maybe I missed
your answer to this issue.

Victor

Steve Dower

unread,
Aug 16, 2016, 7:28:47 PM8/16/16
to Victor Stinner, python-ideas
On 16Aug2016 1603, Victor Stinner wrote:
> 2016-08-16 17:56 GMT+02:00 Steve Dower <steve...@python.org>:
>> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
>> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
>> it is". We know exactly what the encoding is on every supported version of
>> Windows. UTF-16.
>
> I think that you missed a important issue (or "use case") which is
> called the "Makefile problem" by Mercurial developers:
> https://www.mercurial-scm.org/wiki/EncodingStrategy#The_.22makefile_problem.22
>
> I already explained it before, but maybe you misunderstood or just
> missed it, so here is a more concrete example.

I guess I misunderstood. The concrete example really help, thank you.

The problem here is that there is an application boundary without a
defined encoding, right where you put the comment.

> filenameb = os.listdir(b'.')[0]
> # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
> # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
> print("filename bytes: %a" % filenameb)
>
> proc = subprocess.Popen(['py', '-2', script],
> stdin=subprocess.PIPE, stdout=subprocess.PIPE)
> stdout = proc.communicate(filenameb)[0]
> print("File content: %a" % stdout)

If you are defining the encoding as 'mbcs', then you need to check that
sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.

Alternatively, since this script is the "new" code, you would use
`os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
determined that mbcs is the encoding for the later transfer.

Essentially, the problem is that this code is relying on a certain
non-guaranteed behaviour of a deprecated API, where using
sys.getfilesystemencoding() as documented would have prevented any issue
(see
https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).
In one of the emails I think you missed, I called this out as the only
case where code will break with a change to sys.getfilesystemencoding().

So yes, breaking existing code is something I would never do lightly.
However, I'm very much of the opinion that the only code that will break
is code that is already broken (or at least fragile) and that nobody is
forced to take a major upgrade to Python or should necessarily expect
100% compatibility between major versions.

Cheers,
Steve

Victor Stinner

unread,
Aug 16, 2016, 7:52:19 PM8/16/16
to Steve Dower, python-ideas
2016-08-17 1:27 GMT+02:00 Steve Dower <steve...@python.org>:
>> filenameb = os.listdir(b'.')[0]
>> # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
>> # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
>> print("filename bytes: %a" % filenameb)
>>
>> proc = subprocess.Popen(['py', '-2', script],
>> stdin=subprocess.PIPE, stdout=subprocess.PIPE)
>> stdout = proc.communicate(filenameb)[0]
>> print("File content: %a" % stdout)
>
>
> If you are defining the encoding as 'mbcs', then you need to check that
> sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.

Sorry, I don't understand. What do you mean by "defining an encoding"?
It's not possible to modify sys.getfilesystemencoding() in Python.
What does "reencode"? I'm lost.


> Alternatively, since this script is the "new" code, you would use
> `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
> determined that mbcs is the encoding for the later transfer.

My example is not new code. It is a very simplified script to explain
the issue that can occur in a large code base which *currently* works
well on Python 2 and Pyhon 3 in the common case (only handle data
encodable to the ANSI code page).


> Essentially, the problem is that this code is relying on a certain
> non-guaranteed behaviour of a deprecated API, where using
> sys.getfilesystemencoding() as documented would have prevented any issue
> (see
> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).

sys.getfilesystemencoding() is used in applications which store data
as Unicode, but we are talking about applications storing data as
bytes, no?


> So yes, breaking existing code is something I would never do lightly.
> However, I'm very much of the opinion that the only code that will break is
> code that is already broken (or at least fragile) and that nobody is forced
> to take a major upgrade to Python or should necessarily expect 100%
> compatibility between major versions.

Well, it's somehow the same issue that we had in Python 2:
applications work in most cases, but start to fail with non-ASCII
characters, or maybe only in some cases.

In this case, the ANSI code page is fine if all data can be encoded to
the ANSI code page. You start to get troubles when you start to use
characters not encodable to your ANSI code page. Last time I checked,
Microsoft Visual Studio behaved badly (has bugs) with such filenames.
It's the same for many applications. So it's not like Windows
applications already handle this case very well. So let me call it a
corner case.

I'm not sure that it's worth it to explicitly break the Python
backward compatibility on Windows for such corner case, especially
because it's already possible to fix applications by starting to use
Unicode everywhere (which would likely fix more issues than expected
as a side effect).

It's still unclear to me if it's simpler to modify an application
using bytes to start using Unicode (for filenames), or if your
proposition requires less changes.

My main concern is the "makefile issue" which requires more complex
code to transcode data between UTF-8 and ANSI code page. To me, it's
like we are going back to Python 2 where no data had known encoding
and mojibake was the default. If you manipulate strings in two
encodings, it's likely to make mistakes and concatenate two strings
encoded to two different encodings (=> mojibake).

Victor

Steve Dower

unread,
Aug 16, 2016, 8:15:07 PM8/16/16
to Victor Stinner, python-ideas
On 16Aug2016 1650, Victor Stinner wrote:
> 2016-08-17 1:27 GMT+02:00 Steve Dower <steve...@python.org>:
>>> filenameb = os.listdir(b'.')[0]
>>> # Python 3.5 encodes Unicode (UTF-16) to the ANSI code page
>>> # what if Python 3.7 encodes Unicode (UTF-16) to UTF-8?
>>> print("filename bytes: %a" % filenameb)
>>>
>>> proc = subprocess.Popen(['py', '-2', script],
>>> stdin=subprocess.PIPE, stdout=subprocess.PIPE)
>>> stdout = proc.communicate(filenameb)[0]
>>> print("File content: %a" % stdout)
>>
>>
>> If you are defining the encoding as 'mbcs', then you need to check that
>> sys.getfilesystemencoding() == 'mbcs', and if it doesn't then reencode.
>
> Sorry, I don't understand. What do you mean by "defining an encoding"?
> It's not possible to modify sys.getfilesystemencoding() in Python.
> What does "reencode"? I'm lost.

You are transferring text between two applications without specifying
what the encoding is. sys.getfilesystemencoding() does not apply to
proc.communicate() - you can use your choice of encoding for
communicating between two processes.

>> Alternatively, since this script is the "new" code, you would use
>> `os.listdir('.')[0].encode('mbcs')`, given that you have explicitly
>> determined that mbcs is the encoding for the later transfer.
>
> My example is not new code. It is a very simplified script to explain
> the issue that can occur in a large code base which *currently* works
> well on Python 2 and Pyhon 3 in the common case (only handle data
> encodable to the ANSI code page).

If you are planning to run it with Python 3.6, then I'd argue it's "new"
code. When you don't want anything to change, you certainly don't change
the major version of your runtime.

>> Essentially, the problem is that this code is relying on a certain
>> non-guaranteed behaviour of a deprecated API, where using
>> sys.getfilesystemencoding() as documented would have prevented any issue
>> (see
>> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables).
>
> sys.getfilesystemencoding() is used in applications which store data
> as Unicode, but we are talking about applications storing data as
> bytes, no?

No, we're talking about how Python code communicates with the file
system. Applications can store their data however they like, but when
they pass it to a filesystem function they need to pass it as str or
bytes encoding with sys.getfilesystemencoding() (this has always been
the case).

>> So yes, breaking existing code is something I would never do lightly.
>> However, I'm very much of the opinion that the only code that will break is
>> code that is already broken (or at least fragile) and that nobody is forced
>> to take a major upgrade to Python or should necessarily expect 100%
>> compatibility between major versions.
>
> Well, it's somehow the same issue that we had in Python 2:
> applications work in most cases, but start to fail with non-ASCII
> characters, or maybe only in some cases.
>
> In this case, the ANSI code page is fine if all data can be encoded to
> the ANSI code page. You start to get troubles when you start to use
> characters not encodable to your ANSI code page. Last time I checked,
> Microsoft Visual Studio behaved badly (has bugs) with such filenames.
> It's the same for many applications. So it's not like Windows
> applications already handle this case very well. So let me call it a
> corner case.

The existence of bugs in other applications is not a good reason to help
people create new bugs.

> I'm not sure that it's worth it to explicitly break the Python
> backward compatibility on Windows for such corner case, especially
> because it's already possible to fix applications by starting to use
> Unicode everywhere (which would likely fix more issues than expected
> as a side effect).
>
> It's still unclear to me if it's simpler to modify an application
> using bytes to start using Unicode (for filenames), or if your
> proposition requires less changes.

My proposition requires less changes *when you target multiple platforms
and would prefer to use bytes*. It allows the below code to be written
as either branch without losing the ability to round-trip whatever
filename happens to be returned:

if os.name == 'nt':
f = open(os.listdir('.')[-1])
else:
f = open(os.listdir(b'.')[-1])

If you choose just the first branch (use str for paths), then you do get
a better result. However, we have been telling people to do that since
3.0 (and made it easier in 3.2 IIRC) and it's now 3.5 and they are still
complaining about not getting to use bytes for paths. So rather than
have people say "Windows support is too hard", this change enables the
second branch to be used on all platforms.

> My main concern is the "makefile issue" which requires more complex
> code to transcode data between UTF-8 and ANSI code page. To me, it's
> like we are going back to Python 2 where no data had known encoding
> and mojibake was the default. If you manipulate strings in two
> encodings, it's likely to make mistakes and concatenate two strings
> encoded to two different encodings (=> mojibake).

Your makefile example is going back to Python 2, as it has no known
encoding. If you want to associate an encoding with bytes, you decode it
to text or you explicitly specify what the encoding should be. Your own
example makes assumptions about what encoding the bytes have, which is
why it has a bug.

Cheers,
Steve

Brendan Barnwell

unread,
Aug 16, 2016, 10:16:14 PM8/16/16
to python...@python.org
On 2016-08-16 17:14, Steve Dower wrote:
> The existence of bugs in other applications is not a good reason to help
> people create new bugs.

I haven't been following all the details in this thread, but isn't the
whole purpose of this proposed change to accommodate code (apparently on
Linux?) that is buggy in that it assumes it can use bytes for paths
without knowing the encoding? It seems like from one perspective
allowing bytes in paths is just helping to accommodate a certain very
widespread class of bugs.

--
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is no
path, and leave a trail."
--author unknown

Steve Dower

unread,
Aug 16, 2016, 11:45:47 PM8/16/16
to Brendan Barnwell, python...@python.org
On 16Aug2016 1915, Brendan Barnwell wrote:
> On 2016-08-16 17:14, Steve Dower wrote:
>> The existence of bugs in other applications is not a good reason to help
>> people create new bugs.
>
> I haven't been following all the details in this thread, but isn't
> the whole purpose of this proposed change to accommodate code
> (apparently on Linux?) that is buggy in that it assumes it can use bytes
> for paths without knowing the encoding? It seems like from one
> perspective allowing bytes in paths is just helping to accommodate a
> certain very widespread class of bugs.

Using bytes on Linux (in Python) is incorrect but works reliably, while
using bytes on Windows is incorrect and unreliable. This change makes it
incorrect and reliable on both platforms.

I said at the start the correct alternative would be to actually force
all developers to use str for paths everywhere. That seems infeasible,
so I'm trying to at least improve the situation for Windows users who
are running code written by Linux developers. Hence there are tradeoffs,
rather than perfection.

(Also, you took my quote out of context - it was referring to the fact
that non-Python developers sometimes fail to get path encoding correct
too. But your question was fair.)

Cheers,
Steve

Steve Dower

unread,
Aug 16, 2016, 11:52:33 PM8/16/16
to python...@python.org
I've just created http://bugs.python.org/issue27781 with a patch
removing use of the *A API from posixmodule.c and changing the default
FS encoding to utf-8.

Since we're still discussing whether the encoding should be utf-8 or
something else, let's keep that here. But if you want to see how the
changes would look, feel free to check out the patch and comment on the
issue.

When we reach some agreement here I'll try and summarize the points of
view on the issue so we have a record there.

Cheers,
Steve

eryk sun

unread,
Aug 17, 2016, 1:51:31 AM8/17/16
to python...@python.org
On Tue, Aug 16, 2016 at 3:56 PM, Steve Dower <steve...@python.org> wrote:
>
> 2. Windows file system encoding is *always* UTF-16. There's no "assuming
> mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what encoding
> it is". We know exactly what the encoding is on every supported version of
> Windows. UTF-16.

Internal filesystem details don't directly affect this issue, except
for how each filesystem handles invalid surrogates in names passed to
functions in the wide-character API. Some filesystems that are
available on Windows do reject a filename that has an invalid
surrogate, so I think any program that attempts to create such
malformed names is already broken.

For example, with NTFS I can create a file named
"\ud800b\ud800a\ud800d", but trying this in a VirtualBox shared folder
fails because the VBoxSF filesystem can't transcode the name to its
internal UTF-8 encoding. Thus I don't think supporting invalid
surrogates should be a deciding factor in favor of UTF-16, which I
think is an unpractical choice. Bytes coming from files, databases,
and the network are likely to be either UTF-8 or some legacy encoding,
so the practical choice is between ANSI/OEM and UTF-8. The reliable
choice is UTF-8.

Using UTF-8 for bytes paths can be adopted at first in 3.6 as an
option that gets enabled via an environment variable. If it's not
enabled or explicitly disabled, show a visible warning (i.e. not
requiring -Wall) that legacy bytes paths are deprecated. In 3.7 UTF-8
can become the default, but the same environment variable should allow
opting out to use the legacy encoding. The infrastructure put in place
to support this should be able to work either way.

Victor, I haven't checked Steve's patch yet in issue 27781, but making
this change should largely simplify the Windows support code in many
cases, as the bytes path conversion can be centralized, and relatively
few functions return strings that need to be encoded back as bytes.
posixmodule.c will no longer need separate code paths that call *A
functions, e.g.:

CreateFileA, CreateDirectoryA, CreateHardLinkA, CreateSymbolicLinkA,
DeleteFileA, RemoveDirectoryA, FindFirstFileA, MoveFileExA,
GetFileAttributesA, GetFileAttributesExA, SetFileAttributesA,
GetCurrentDirectoryA, SetCurrentDirectoryA, SetEnvironmentVariableA,
ShellExecuteA

Stephen J. Turnbull

unread,
Aug 17, 2016, 5:36:27 AM8/17/16
to Paul Moore, Python-Ideas
Paul Moore writes:
> On 16 August 2016 at 16:56, Steve Dower <steve...@python.org> wrote:

> > This discussion is for the developers who insist on using bytes
> > for paths within Python, and the question is, "how do we best
> > represent UTF-16 encoded paths in bytes?"

That's incomplete, AFAICS. (Paul makes this point somewhat
differently.) We don't want to represent paths in bytes on Windows if
we can avoid it. Nor does UTF-16 really enter into it (except for the
technical issue of invalid surrogate pairs). So a full statement is,
"How do we best represent Windows file system paths in bytes for
interoperability with systems that natively represent paths in bytes?"
("Other systems" refers to both other platforms and existing programs
on Windows.)

BTW, why "surrogate pairs"? Does Windows validate surrogates to
ensure they come in pairs, but not necessarily in the right order (or
perhaps sometimes they resolve to non-characters such as U+1FFFF)?

Paul says:

> People passing bytes to open() have in my view, already chosen not
> to follow the standard advice of "decode incoming data at the
> boundaries of your application". They may have good reasons for
> that, but it's perfectly reasonable to expect them to take
> responsibility for manually tracking the encoding of the resulting
> bytes values flowing through their code.

Abstractly true, but in practice there's no such need for those who
made the choice! In a properly set up POSIX locale[1], it Just Works by
design, especially if you use UTF-8 as the preferred encoding. It's
Windows developers and users who suffer, not those who wrote the code,
nor their primary audience which uses POSIX platforms.

> It is of course, also true that "works for me in my environment" is
> a viable strategy - but the maintenance cost of this strategy if
> things change (whether in Python, or in the environment) is on the
> application developers - they are hoping that cost is minimal, but
> that's a risk they choose to take.

Nick's point is that the risk is on Windows users and developers for
the Windows platform who did *not* make that choice, but rather had it
made for them by developers on a different platform where it Just
Works. He argues that we should level the playing field.

It's also relevant that those developers on the originating platform
for the code typically resist complexifying changes to make things
work on other platforms too (cf. Victor's advocacy of removing the
bytes APIs on Windows). Victor's points are good IMO; he's not just
resisting Windows, there are real resource consequences.

> Code using Unicode is unaffected, certainly. Ideally that means that
> only a tiny minority of users should be affected. Are we over-reacting
> to reports of standard practices in Japan? I've no idea.

AFAIK, India and Southeast Asia have already abandoned their
indigenous standards in favor of Unicode/UTF-8, so it doesn't matter
if they use str or bytes, either way Steve's proposal will Just Work.
I don't know anything about Arabic, Hebrew, Cyrillic, and Eastern
Europeans. That leaves China, which is like Japan in having had a
practically universal encoding (ie, every script you'll actually see
roundtrips, emoji being the only practical issue) since the 1970s. So
I suspect Chinese also primarily use their local code page (GB2312 or
GB18030) for plain text documents, possibly including .ini and
Makefiles.

Over-reaction? I have no idea either. Just a potentially widespread
risk, both to users and to Python's reputation for maintaining
compatibility. (I don't think it's "fair", but among my acquaintances
Python has a poor rep -- Steve's argument that if you develop code for
3.5 you should expect to have to modify it to use it with 3.6 cuts no
ice with them.)

> > If you see an alternative choice to those listed above, feel free
> > to contribute it. Otherwise, can we focus the discussion on these
> > (or any new) choices?
>
> Accept that we should have deprecated builtin open and the io module,
> but didn't do so. Extend the existing deprecation of bytes paths on
> Windows, to cover *all* APIs, not just the os module, But modify the
> deprecation to be "use of the Windows CP_ACP code page (via the ...A
> Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
> the implied encoding for all bytes paths on Windows starting in Python
> 3.7". Document and publicise it much more prominently, as it is a
> breaking change. Then leave it one release for people to prepare for
> the change.

I like this one! If my paranoid fears are realized, in practice it
might have to wait two releases, but at least this announcement should
get people who are at risk to speak up. If they don't, then you can
just call me "Chicken Little" and go ahead!


Footnotes:
[1] An oxymoron, but there you go.

eryk sun

unread,
Aug 17, 2016, 9:39:03 AM8/17/16
to Python-Ideas
On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull
<turnbull....@u.tsukuba.ac.jp> wrote:
> BTW, why "surrogate pairs"? Does Windows validate surrogates to
> ensure they come in pairs, but not necessarily in the right order (or
> perhaps sometimes they resolve to non-characters such as U+1FFFF)?

A program can pass the filesystem a name containing one or more
surrogate codes that isn't in a valid UTF-16 surrogate pair (i.e. a
leading code in the range D800-DBFF followed by a trailing code in the
range DC00-DFFF). In the user-mode runtime library and kernel
executive, nothing up to the filesystem driver checks for a valid
UTF-16 string. Microsoft's filesystems remain compatible with UCS2
from the 90s and don't care that the name isn't legal UTF-16. The same
goes for the in-memory filesystems used for named pipes (NPFS,
\\.\pipe) and mailslots (MSFS, \\.\mailslot). But non-Microsoft
filesystems don't necessarily store names as wide-character strings.
They may use UTF-8, in which case an invalid UTF-16 name will cause
the system call to fail because it's an invalid parameter.

If the filesystem allows creating such a badly named file or
directory, it can still be accessed using a regular unicode path,
which is how things stand currently. I see that Victor has suggested
using "surrogatepass" in issue 27781. That would allow seamless
operation. The downside is that bytes have a higher chance of leaking
out of Python than strings created by 'surrogateescape' on Unix. But
since it isn't a proper Unicode string on disk, at least nothing has
changed substantively by transcoding to "surrogatepass" UTF-8.

Steve Dower

unread,
Aug 17, 2016, 11:34:19 AM8/17/16
to Stephen J. Turnbull, Paul Moore, Python-Ideas
On 17Aug2016 0235, Stephen J. Turnbull wrote:
> Paul Moore writes:
> > On 16 August 2016 at 16:56, Steve Dower <steve...@python.org> wrote:
>
> > > This discussion is for the developers who insist on using bytes
> > > for paths within Python, and the question is, "how do we best
> > > represent UTF-16 encoded paths in bytes?"
>
> That's incomplete, AFAICS. (Paul makes this point somewhat
> differently.) We don't want to represent paths in bytes on Windows if
> we can avoid it. Nor does UTF-16 really enter into it (except for the
> technical issue of invalid surrogate pairs). So a full statement is,
> "How do we best represent Windows file system paths in bytes for
> interoperability with systems that natively represent paths in bytes?"
> ("Other systems" refers to both other platforms and existing programs
> on Windows.)

That's incorrect, or at least possible to interpret correctly as the
wrong thing. The goal is "code compatibility with systems ...", not
interoperability.

Nothing about this will make it easier to take a path from Windows and
use it on Linux or vice versa, but it will make it easier/more reliable
to take code that uses paths on Linux and use it on Windows.

> BTW, why "surrogate pairs"? Does Windows validate surrogates to
> ensure they come in pairs, but not necessarily in the right order (or
> perhaps sometimes they resolve to non-characters such as U+1FFFF)?

Eryk answered this better than I would have.

> Paul says:
>
> > People passing bytes to open() have in my view, already chosen not
> > to follow the standard advice of "decode incoming data at the
> > boundaries of your application". They may have good reasons for
> > that, but it's perfectly reasonable to expect them to take
> > responsibility for manually tracking the encoding of the resulting
> > bytes values flowing through their code.
>
> Abstractly true, but in practice there's no such need for those who
> made the choice! In a properly set up POSIX locale[1], it Just Works by
> design, especially if you use UTF-8 as the preferred encoding. It's
> Windows developers and users who suffer, not those who wrote the code,
> nor their primary audience which uses POSIX platforms.

You mentioned "locale", "preferred" and "encoding" in the same sentence,
so I hope you're not thinking of locale.getpreferredencoding()? Changing
that function is orthogonal to this discussion, despite the fact that in
most cases it returns the same code page as what is going to be used by
the file system functions (which in most cases will also be used by the
encoding returned from sys.getfilesystemencoding()).

When Windows developers and users suffer, I see it as my responsibility
to reduce that suffering. Changing Python on Windows should do that
without affecting developers on Linux, even though the Right Way is to
change all the developers on Linux to use str for paths.

> > > If you see an alternative choice to those listed above, feel free
> > > to contribute it. Otherwise, can we focus the discussion on these
> > > (or any new) choices?
> >
> > Accept that we should have deprecated builtin open and the io module,
> > but didn't do so. Extend the existing deprecation of bytes paths on
> > Windows, to cover *all* APIs, not just the os module, But modify the
> > deprecation to be "use of the Windows CP_ACP code page (via the ...A
> > Win32 APIs) is deprecated and will be replaced with use of UTF-8 as
> > the implied encoding for all bytes paths on Windows starting in Python
> > 3.7". Document and publicise it much more prominently, as it is a
> > breaking change. Then leave it one release for people to prepare for
> > the change.
>
> I like this one! If my paranoid fears are realized, in practice it
> might have to wait two releases, but at least this announcement should
> get people who are at risk to speak up. If they don't, then you can
> just call me "Chicken Little" and go ahead!

I don't think there's any reasonable way to noisily deprecate these
functions within Python, but certainly the docs can be made clearer.
People who explicitly encode with sys.getfilesystemencoding() should not
get the deprecation message, but we can't tell whether they got their
bytes from the right encoding or a RNG, so there's no way to discriminate.

I'm going to put together a summary post here (hopefully today) and get
those who have been contributing to basically sign off on it, then I'll
take it to python-dev. The possible outcomes I'll propose will basically
be "do we keep the status quo, undeprecate and change the functionality,
deprecate the deprecation and undeprecate/change in a couple releases,
or say that it wasn't a real deprecation so we can deprecate and then
change functionality in a couple releases".

Cheers,
Steve
It is loading more messages.
0 new messages