Finally, the encoding of stdin, stdout and stderr are currently
(correctly) inferred from the encoding of the console window that Python
is attached to. However, this is typically a codepage that is different
from the system codepage (i.e. it's not mbcs) and is almost certainly
not Unicode. If users are starting Python from a console, they can use
"chcp 65001" first to switch to UTF-8, and then *most* functionality
works (input() has some issues, but those can be fixed with a slight
rewrite and possibly breaking readline hooks).
It is also possible for Python to change the current console encoding to
be UTF-8 on initialize and change it back on finalize. (This would leave
the console in an unexpected state if Python segfaults, but console
encoding is probably the least of anyone's worries at that point.) So
I'm proposing actively changing the current console to be Unicode while
Python is running, and hence sys.std[in|out|err] will default to utf-8.
So that's a broad range of changes, and I have little hope of figuring
out all the possible issues, back-compat risks, and flow-on effects on
my own. Please let me know (either on-list or off-list) how a change
like this would affect your projects, either positively or negatively,
and whether you have any specific experience with these changes/fixes
and think they should be approached differently.
To summarise the proposals (remembering that these would only affect
Python 3.6 on Windows):
[SNIP]
* force the console encoding to UTF-8 on initialize and revert on finalize
Using 'mbcs' doesn't work reliably with arbitrary bytes paths in
locales that use a DBCS codepage such as 932. If a sequence is
invalid, it gets passed to the filesystem as the default Unicode
character, so it won't successfully roundtrip. In the following
example b"\x81\xad", which isn't defined in CP932, gets mapped to the
codepage's default Unicode character, Katakana middle dot, which
encodes back as b"\x81E":
>>> locale.getpreferredencoding()
'cp932'
>>> open(b'\x81\xad', 'w').close()
>>> os.listdir('.')
['・']
>>> unicodedata.name(os.listdir('.')[0])
'KATAKANA MIDDLE DOT'
>>> '・'.encode('932')
b'\x81E'
This isn't a problem for single-byte codepages, since every byte value
uniquely maps to a Unicode code point, even if it's simply b'\x81' =>
u"\x81". Obviously there's still the general problem of dealing with
arbitrary Unicode filenames created by other programs, since the ANSI
API can only return a best-fit encoding of the filename, which is
useless for actually accessing the file.
>> It probably also entails opening the file descriptor in bytes mode,
>> which might break programs that pass the fd directly to CRT functions.
>> Personally I wish they wouldn't, but it's too late to stop them now.
>
> The only thing O_TEXT does rather than O_BINARY is convert CRLF line
> endings (and maybe end on ^Z), and I don't think we even expose the
> constants for the CRT's unicode modes.
Python 3 uses O_BINARY when opening files, unless you explicitly call
os.open. Specifically, FileIO.__init__ adds O_BINARY to the open flags
if the platform defines it.
The Windows CRT reads the BOM for the Unicode modes O_WTEXT,
O_U16TEXT, and O_U8TEXT. For O_APPEND | O_WRONLY mode, this requires
opening the file twice, the first time with read access. See
configure_text_mode() in "Windows
Kits\10\Source\10.0.10586.0\ucrt\lowio\open.cpp".
Python doesn't expose or use these Unicode text-mode constants. That's
for the best because in Unicode mode the CRT invokes the invalid
parameter handler when a buffer doesn't have an even number of bytes,
i.e. a multiple of sizeof(wchar_t). Python could copy how
configure_text_mode() handles the BOM, except it shouldn't write a BOM
for new UTF-8 files.
My main reaction would be that if Drekin (Adam Bartoš) agrees the
changes natively solve the problems that
https://pypi.python.org/pypi/win_unicode_console works around, it's
probably a good idea.
The status quo is also sufficiently broken from both a native Windows
perspective and a cross-platform compatibility perspective that your
proposals are highly unlikely to make things *worse* :)
Cheers,
Nick.
--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
Also, a reminder that Adam has a couple of proposals on the tracker
aimed at getting CPython to use a UTF-16-LE console on Windows:
http://bugs.python.org/issue22555#msg242943 (last two issue references
in that comment)
On 11 August 2016 at 04:10, Steve Dower <steve.dower at python.org> wrote: > > I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value. My main reaction would be that if Drekin (Adam Bartoš) agrees the changes natively solve the problems that https://pypi.python.org/pypi/win_unicode_console
works around, it's probably a good idea. The status quo is also sufficiently broken from both a native Windows perspective and a cross-platform compatibility perspective that your proposals are highly unlikely to make things *worse* :) Cheers, Nick.
IMO, Python needs a C implementation of the win_unicode_console module, using the wide-character APIs ReadConsoleW and WriteConsoleW. Note that this sets sys.std*.encoding as UTF-8 and transcodes, so Python code never has to work directly with UTF-16 encoded text.
If win_unicode_console gets added to the standard library, I think it
should provide at least a std*.buffer interface that transcodes
between UTF-16 and UTF-8 (with errors='replace'), to make this as much
of a drop-in replacement as possible. I know it's not required. For
example, IDLE doesn't implement this. But I'm also sure there's code
out there that uses stdout.buffer, including in the standard library.
It's mostly test code (not including cases for piping output from a
child process) and simple script interfaces, but if we don't have to
break people's code, we really shouldn't.
On Fri Aug 12 11:33:35 EDT 2016, Random832 wrote:
> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote: >> That's the hope, though that module approaches the solution differently >> and may still uses. An alternative way for us to fix this whole thing >> would be to bring win_unicode_console into the standard library and use >> it by default (or probably whenever PYTHONIOENCODING is not specified). > > I have concerns about win_unicode_console: > - For the "text_transcoded" streams, stdout.encoding is utf-8. For the > "text" streams, it is utf-16.
> - There is no object, as far as I can find, which can be used as an > unbuffered unicode I/O object.
> - raw output streams silently drop the last byte if an odd number of > bytes are written.
> - The sys.stdout obtained via streams.enable does not support .buffer / > .buffer.raw / .detach > - All of these objects provide a fileno() interface.
> - When using os.read/write for data that represents text, the data still > should be encoded in the console encoding and not in utf-8 or utf-16.
 I
understand Steve's point about being an improvement over 100% wrong,
but we've lived with the current state of affairs long enough that I
think we should take whatever time is needed to do it right,
Yes that's what I meant, I just think it needs to be considered if we're
thinking about making it (or something like it) the default python
sys.std*. Maybe the decision will be that maintaining compatibility with
these cases isn't important.
> > - The sys.stdout obtained via streams.enable does not support
> > .buffer / .buffer.raw / .detach
> > - All of these objects provide a fileno() interface.
>
> Is this wrong? If I remember, I provide it because of some check --
> maybe in input() -- to be viewed as a stdio stream.
I don't know if it's *wrong* per se (same with the no buffer/raw thing
etc), I'm just concerned about the possible effects on code that is
written against the current implementation.
> In which case, something IS better than nothing.
I'm not arguing that we do nothing. Are you saying we should use
CP_UTF8 *in preference* to wide character APIs? Or that we should
implement CP_UTF8 first and then wide chars later?
Or are we in
violent agreement that we should implement wide chars?
Hello,
I'm in holiday and I'm writing on a phone, so sorry in advance for the short answer.
In short: we should drop support for the bytes API. Just use Unicode on all platforms, especially for filenames.
Sorry but most of these changes look like very bad ideas. Or maybe I misunderstood something. Windows bytes API are broken in different ways, in short your proposal is to put another layer on top of it to try to workaround issues.
Unicode is complex. Unicode issues are hard to debug. Adding a new layer makes debugging even harder. Is the bug in the input data? In the layer? In the final Windows function?
In my experience on UNIX, the most important part is the interoperability with other applications. I understand that Python 2 will speak ANSI code page but Python 3 will speak UTF-8. I don't understand how it can work. Almsot all Windows applications speak the ANSI code page (I'm talking about stdin, stdout, pipes, ...).
Do you propose to first try to decode from UTF-8 or fallback on decoding from the ANSI code page? What about encoding? Always encode to UTF-8?
About BOM: I hate them. Many applications don't understand them. Again, think about Python 2. I recall vaguely that the Unicode strandard suggests to not use BOM (I have to check).
I recall a bug in gettext. The tool doesn't understand BOM. When I opened the file in vim, the BOM was invisible (hidden). I had to use hexdump to understand the issue!
BOM introduces issues very difficult to debug :-/ I also think that it goes in the wrong direction in term of interoperability.
For the Windows console: I played with all Windows functions, tried all fonts and many code pages. I also read technical blog articles of Microsoft employees. I gave up on this issue. It doesn't seem possible to support fully Unicode the Windows console (at least the last time I checked). By the way, it seems like Windows functions have bugs, and the code page 65001 fixes a few issues but introduces new issues...
Victor
I suspect there's a lot of discussion to be had around this topic, so I want to get it started. There are some fairly drastic ideas here and I need help figuring out whether the impact outweighs the value.
Some background: within the Windows API, the preferred encoding is UTF-16. This is a 16-bit format that is typed as wchar_t in the APIs that use it. These APIs are generally referred to as the *W APIs (because they have a W suffix).
There are also (broadly deprecated) APIs that use an 8-bit format (char), where the encoding is assumed to be "the user's active code page". These are *A APIs. AFAIK, there are no cases where a *A API should be preferred over a *W API, and many newer APIs are *W only.
In general, Python passes byte strings into the *A APIs and text strings into the *W APIs.
Right now, sys.getfilesystemencoding() on Windows returns "mbcs", which translates to "the system's active code page". As this encoding generally cannot represent all paths on Windows, it is deprecated and Unicode strings are recommended instead. This, however, means you need to write significantly different code between POSIX (use bytes) and Windows (use text).
ISTM that changing sys.getfilesystemencoding() on Windows to "utf-8" and updating path_converter() (Python/posixmodule.c; likely similar code in other places) to decode incoming byte strings would allow us to undeprecate byte strings and add the requirement that they *must* be encoded with sys.getfilesystemencoding(). I assume that this would allow cross-platform code to handle paths similarly by encoding to whatever the sys module says they should and using bytes consistently (starting this thread is meant to validate/refute my assumption).
(Yes, I know that people on POSIX should just change to using Unicode and surrogateescape. Unfortunately, rather than doing that they complain about Windows and drop support for the platform. If you want to keep hitting them with the stick, go ahead, but I'm inclined to think the carrot is more valuable here.)
Similarly, locale.getpreferredencoding() on Windows returns a legacy value - the user's active code page - which should generally not be used for any reason. The one exception is as a default encoding for opening files when no other information is available (e.g. a Unicode BOM or explicit encoding argument). BOMs are very common on Windows, since the default assumption is nearly always a bad idea.
Making open()'s default encoding detect a BOM before falling back to locale.getpreferredencoding() would resolve many issues, but I'm also inclined towards making the fallback utf-8, leaving locale.getpreferredencoding() solely as a way to get the active system codepage (with suitable warnings about it only being useful for back-compat). This would match the behavior that the .NET Framework has used for many years - effectively, utf_8_sig on read and utf_8 on write.
Finally, the encoding of stdin, stdout and stderr are currently (correctly) inferred from the encoding of the console window that Python is attached to. However, this is typically a codepage that is different from the system codepage (i.e. it's not mbcs) and is almost certainly not Unicode. If users are starting Python from a console, they can use "chcp 65001" first to switch to UTF-8, and then *most* functionality works (input() has some issues, but those can be fixed with a slight rewrite and possibly breaking readline hooks).
It is also possible for Python to change the current console encoding to be UTF-8 on initialize and change it back on finalize. (This would leave the console in an unexpected state if Python segfaults, but console encoding is probably the least of anyone's worries at that point.) So I'm proposing actively changing the current console to be Unicode while Python is running, and hence sys.std[in|out|err] will default to utf-8.
So that's a broad range of changes, and I have little hope of figuring out all the possible issues, back-compat risks, and flow-on effects on my own. Please let me know (either on-list or off-list) how a change like this would affect your projects, either positively or negatively, and whether you have any specific experience with these changes/fixes and think they should be approached differently.
To summarise the proposals (remembering that these would only affect Python 3.6 on Windows):
* change sys.getfilesystemencoding() to return 'utf-8'
* automatically decode byte paths assuming they are utf-8
* remove the deprecation warning on byte paths
* make the default open() encoding check for a BOM or else use utf-8
* [ALTERNATIVE] make the default open() encoding check for a BOM or else use sys.getpreferredencoding()
* force the console encoding to UTF-8 on initialize and revert on finalize
So what are your concerns? Suggestions?
Thanks,
Steve
Le 10 août 2016 20:16, "Steve Dower" <steve...@python.org> a écrit :
> So what are your concerns? Suggestions?
Add a new option specific to Windows to switch to UTF-8 everywhere, use BOM, whatever you want, *but* don't change the defaults.
IMO mbcs encoding is the least worst encoding for the default.
I have an idea of a similar option for UNIX: ignore user preference (LC_ALL, LC_CTYPE, LANG environment variables) and force UTF-8. It's a common request on UNIX where UTF-8 is now the encoding of almost all systems, whereas the C library continues to use ASCII when the POSIX locale is used (which occurs in many cases).
Perl already has such utf8 option.
Victor
If that's all you want then you can set PYTHONIOENCODING=:replace.
Prepare to be inundated with question marks.
Python's 'cp*' encodings are cross-platform, so they don't call
Windows NLS APIs. If you want a best-fit encoding, then 'mbcs' is the
only choice. Use chcp.com to switch to your system's ANSI codepage and
set PYTHONIOENCODING=mbcs:replace.
An 'oem' encoding could be added, but I'm no fan of these best-fit
encodings. Writing question marks at least hints that the output is
wrong.
> Is there any particular reason for the REPL, when printing the repr of a
> returned object, not to replace characters not in the stdout encoding
> with backslash sequences?
sys.displayhook already does this. It falls back on
sys_displayhook_unencodable if printing the repr raises a
UnicodeEncodeError.
> Does Python provide any mechanism to access the built-in "best fit"
> mappings for windows codepages (which mostly consist of removing accents
> from latin letters)?
As mentioned above, for output this is only available with 'mbcs'. For
reading input via ReadFile or ReadConsoleA (and thus also C _read,
fread, and fgets), the console already encodes its UTF-16 input buffer
using a best-fit encoding to the input codepage. So there's no error
in the following example, even though the result is wrong:
>>> sys.stdin.encoding
'cp437'
>>> s = 'Ä€'
>>> s, ord(s)
('A', 65)
Jumping back to the codepage 65001 discussion, here's a function to
simulate the bad output that Windows Vista and 7 users see:
def write(text):
writes = []
n = 0
buffer = text.replace('\n', '\r\n').encode('utf-8')
while buffer:
decoded = buffer.decode('utf-8', 'replace')
buffer = buffer[len(decoded):]
writes.append(decoded.replace('\r', '\n'))
return ''.join(writes)
For example:
>>> greek = 'αβγδεζηθι\n'
>>> write(greek)
'αβγδεζηθι\n\n�ηθι\n\n�\n\n'
It gets worse with characters that require 3 bytes in UTF-8:
>>> devanagari = 'ऄअआइईउऊऋऌ\n'
>>> write(devanagari)
'ऄअआइईउऊऋऌ\n\n�ईउऊऋऌ\n\n��ऋऌ\n\n��\n\n'
This problem doesn't exit in Windows 8+ because the old LPC-based
communication (LPC is an undocumented protocol that's used extensively
for IPC between Windows subsystems) with the console was rewritten to
use a kernel driver (condrv.sys). Now it works like any other device
by calling NtReadFile, NtWriteFile, and NtDeviceIoControlFile.
Apparently in the rewrite someone fixed the fact that the conhost code
that handles WriteFile and WriteConsoleA was incorrectly returning the
number of UTF-16 codes written instead of the number of bytes.
Unfortunately the rewrite also broke Ctrl+C handling because ReadFile
no longer sets the last error to ERROR_OPERATION_ABORTED when a
console read is interrupted by Ctrl+C. I'm surprised so few Windows
users have noticed or cared that Ctrl+C kills the REPL and misbehaves
with input() in the Windows 8/10 console. The source of the Ctrl+C bug
is an incorrect NTSTATUS code STATUS_ALERTED, which should be
STATUS_CANCELLED. The console has always done this wrong, but before
the rewrite there was common code for ReadFile and ReadConsole that
handled STATUS_ALERTED specially. It's still there in ReadConsole, so
Ctrl+C handling works fine in Unicode programs that use ReadConsoleW
(e.g. cmd.exe, powershell.exe). It also works fine if
win_unicode_console is enabled.
Finally, here's a ctypes example in Windows 10.0.10586 that shows the
unsolvable problem with non-ASCII input when using codepage 65001:
import ctypes, msvcrt
conin = open(r'\\.\CONIN$', 'r+')
hConin = msvcrt.get_osfhandle(conin.fileno())
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
nread = (ctypes.c_uint * 1)()
ASCII-only input works:
>>> buf = (ctypes.c_char * 100)()
>>> kernel32.ReadFile(hConin, buf, 100, nread, None)
spam
1
>>> nread[0], buf.value
(6, b'spam\r\n')
But it returns EOF if "a" is replaced by Greek "α":
>>> buf = (ctypes.c_char * 100)()
>>> kernel32.ReadFile(hConin, buf, 100, nread, None)
spαm
1
>>> nread[0], buf.value
(0, b'')
Notice that the read is successful but nread is 0. That signifies EOF.
So the REPL will just silently quit as if you entered Ctrl+Z, and
input() will raise EOFError. This can't be worked around. The problem
is in conhost.exe, which assumes a request for N bytes wants N UTF-16
codes from the input buffer. This can only work with ASCII in UTF-8.
For the Windows console: I played with all Windows functions, tried all fonts and many code pages. I also read technical blog articles of Microsoft employees. I gave up on this issue. It doesn't seem possible to support fully Unicode the Windows console (at least the last time I checked). By the way, it seems like Windows functions have bugs, and the code page 65001 fixes a few issues but introduces new issues...
The exception is the proposed console changes, because there you *do* perform all I/O with OS APIs. But I don't know anything about the Windows console except that nobody seems happy with it.
> The last point is correct: if you get bytes from a file system API, you should be able to pass them back in without losing information. CP_ACP (a.k.a. the *A API) does not allow this, so I'm proposing using the *W API everywhere and encoding to utf-8 when the user wants/gives bytes.
You get troubles when the filename comes a file, another application, a registry key, ... which is encoded to CP_ACP.
Do you plan to transcode all these data? (decode from CP_ACP, encode back to UTF-8)
Almost -- from my perusing of discussions from the last few years,
there do seem to be some library developers and *nix affectionados
that DO think it's The Right Thing -- after all, a char* has always
worked, yes? But these folks also seem to think that a •nix system
with no way of knowing what the encoding of the names in the file
system (and could have more than one) is not "broken" in any way.
A note about "utf-8 everywhere": while maybe a good idea, it's my
understanding that *nix developers absolutely do not want utf-8 to be
assumed in the Python APIs. Rather, this is all about punting the
handling of encodings down to the application level, rather that the
OS and Library level. Which is more backward compatible, but otherwise
a horrible idea. And very much in conflict with Python 3's approach.
So it seems odd to assume utf-8 on Windows, where it is less ubiquitous.
Back to "The Right Thing" -- it's clear to me that everyone supporting
this proposal is vet much doing so because it's "The Pragmatic Thing".
But it seems folks porting from py2 need to explicitly convert the
calls from str to bytes anyway to get the bytes behavior. With
surrogate escapes, now you need to do nothing. So we're really
supporting code that was ported to py3 earlier in the game - but it
seems a bad idea to cement that hacks solution in place.
And if the file amen in question are coming from a byte stream
somehow, rather than file system API calls, then you really do need to
know the encoding -- yes really! If a developer wants to assume utf-8,
that's fine, but the developer should be making that decision, not
Python itself. And not on Windows only.
-CHB
I am going to mute this thread but I am worried about the outcome. Once there is agreement please check with me first.
--Guido (mobile)
The choices are:
* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)
The fifth option is the best for round-tripping within Windows APIs.