On 2/4/21, Ben Rudiak-Gould <
benr...@gmail.com> wrote:
>
> My proposal is to add a couple of single-character options to open()'s mode
> parameter. 'b' and 't' already exist, and the encoding parameter
> essentially selects subcategories of 't', but it's annoyingly verbose and
> so people often omit it.
>
> If '8' was equivalent to specifying encoding='UTF-8', and 'L' was
> equivalent to specifying encoding=(the real locale encoding, ignoring UTF-8
> mode), that would go a long way toward making open more convenient in the
> common cases on Windows, and I bet it would encourage at least some of
> those developing on Unixy platforms to write more portable code also.
A precedent for using the mode parameter is [_w]fopen in MSVC, which
supports a "ccs=<encoding>" flag, where "<encoding>" can be "UTF-8",
"UTF-16LE", or "UNICODE".
---
In terms of using the 'locale', keep in mind that the implementation
in Windows doesn't use the current LC_CTYPE locale. It only uses the
default locale, which in turn uses the process active (ANSI) code
page. The latter is a system setting, unless overridden to UTF-8 in
the application manifest (e.g. the manifest that's embedded in
"python.exe").
I'd like to see support for a -X option and/or environment variable to
make Python in Windows actually use the current locale to get the
locale encoding (a real shocker, I know). For example,
setlocale(LC_CTYPE, "el_GR") would select "cp1253" (Greek) as the
locale encoding, while setlocale(LC_CTYPE, "el_GR.utf-8") would select
"utf-8" as the locale encoding.
(The CRT supports UTF-8 in locales starting with Windows 10, build
17134, released on 2018-04-03.)
At startup, Python 3.8+ calls setlocale(LC_CTYPE, "") to use the
default locale, for use with C functions such as mbstowcs(). This
allows the default behavior to remain the same, unless the new option
also entails attempting locale coercion to UTF-8 via
setlocale(LC_CTYPE, ".utf-8").
The following gets the current locale's code page in C:
#include <"locale.h">
// ...
loc = _get_current_locale();
locinfo = (__crt_locale_data_public *)loc->locinfo;
cp = locinfo->_locale_lc_codepage;
The "C" locale uses code page 0. C mbstowcs() and wcstombs() handle
this case as Latin-1. locale._get_locale_encoding() could instead map
it to the process ANSI code page, GetACP(). Also, the CRT displays
CP_UTF8 (65001) as "utf8". _get_locale_encoding() should map it to
"utf-8" instead of "cp65001".
Message archived at
https://mail.python.org/archives/list/python...@python.org/message/MZC4DDCTMOX25ZQVUGBNLE6VPVXHXNKU/