Re: [Python-ideas] [issue33865] [EASY] Missing code page aliases: "unknown encoding: 874"

65 views
Skip to first unread message

Steven D'Aprano

unread,
Jun 16, 2018, 7:00:26 AM6/16/18
to python...@python.org
> It is easy to test it. Encoding/decoding with '874' should give the
> same result as with 'cp874'.

I know it is too late to remove that feature, but why do we support
digit-only IDs for encodings? They can be ambiguous. If Wikipedia is
correct, cp874 (also known as ibm874) and Windows-874 (also known as
cp1162) are different:

https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874

https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162


--
Steve
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Stephen J. Turnbull

unread,
Jun 17, 2018, 8:03:11 AM6/17/18
to python...@python.org
Folks. There are standards. "1252" *is not* an alias for
"windows-1252" according to the IANA, while "866" *is* an alias for
"IBM866" according to the same authority. Most 3-digit "IBMxxx" ARE
aliased to both "cpxxx" and just "xxx", but not all. None of
"IBM874", "874", or "cp874" exists according to the IANA.

https://www.iana.org/assignments/character-sets/character-sets.xhtml

For the reasons Steven gave, I would say omit the digits-only aliases,
but if we must use them because "there's a standard" (or backward
compatibility), we should stick to those defined by standard, and only
those.

If we're following other standards that I'm unaware of, fine, but
let's cite them rather than randomly introduce a plethora of aliases
because they "look like" an existing (and unfortunate) standard.

There's also some other weirdness with "windows-874", see below. We
(somebody) should check other "windows-xxx" character sets to make
sure they're not misnamed "cpxxx".

Steven D'Aprano writes:
> > It is easy to test it. Encoding/decoding with '874' should give the
> > same result as with 'cp874'.
>
> I know it is too late to remove that feature, but why do we support
> digit-only IDs for encodings? They can be ambiguous. If Wikipedia is
> correct, cp874 (also known as ibm874) and Windows-874 (also known as
> cp1162) are different:

According to the IANA, they're not necessarily ambiguous. Here is
the entry for IBM866:

IBM866 2086 IBM NLDG Volume 2 cp866
(SE09-8002-03) August 1994 866
[Rick_Pond] csIBM866

where the entries in column 4 show the registered aliases. There are
at least a dozen IBMxxx character sets with 'xxx' aliases.

I don't understand what's with "cp874", though. We can surely take
that one back, although we'd better hurry if it's in 3.7rc. We might
want to add "windows-874" (which does't seem to be present in Python
3.6), since that's the standard character set name per IANA.

The confusion between cp874 and windows-874 may be because in
VENDORS/MICSFT/WINDOWS it's in CP874.TXT (as are all the code pages
there).

> https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_874
>
> https://en.wikipedia.org/wiki/ISO/IEC_8859-11#Code_page_1162

I don't know where Wikipedia's information comes from, but it's not
the IANA.


--
Associate Professor Division of Policy and Planning Science
http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information
Email: turn...@sk.tsukuba.ac.jp University of Tsukuba
Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

Ronald Oussoren

unread,
Jun 17, 2018, 2:31:01 PM6/17/18
to Stephen J. Turnbull, python...@python.org


On 17 Jun 2018, at 14:02, Stephen J. Turnbull <turnbull....@u.tsukuba.ac.jp> wrote:

Folks.  There are standards.  "1252" *is not* an alias for
"windows-1252" according to the IANA, while "866" *is* an alias for
"IBM866" according to the same authority.  Most 3-digit "IBMxxx" ARE
aliased to both "cpxxx" and just "xxx", but not all.  None of
"IBM874", "874", or "cp874" exists according to the IANA.

Sure, but for at least one user Python 3.6 fails to start because initialising the sys.std* streams fails due to not finding a “874” encoding.   

The user sadly enough didn’t provide more information on his machine, other than that it is running some version of Windows. 

BTW. “cp874” does exist according to the unicode consortium: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT, and appears to be a codepage for a (the?) Thai language.  The user might therefore be running Windows with a Thai locale.

Ronald


Steven D'Aprano

unread,
Jun 17, 2018, 8:36:38 PM6/17/18
to python...@python.org
> Sure, but for at least one user Python 3.6 fails to start because
> initialising the sys.std* streams fails due to not finding a “874”
> encoding.

That doesn't mean that the bug is best fixed by adding an alias.

If the error was failing to find encoding "ltain-1", would we add an
alias or fix the spelling? If 874 is not an official alias, we should
consider it a misspelling and fix the misspelling, not add an alias.

But either way, the point Stephen is making is that even if 874 is a
legitimate alias, that shouldn't give us carte blanche to add numeric
aliases for every encoding.

Karthikeyan

unread,
Jun 18, 2018, 11:07:46 AM6/18/18
to python-ideas
> BTW. “cp874” does exist according to the unicode consortium: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP874.TXT, and appears to be a codepage for a (the?) Thai language.  The user might therefore be running Windows with a Thai locale.

This page also lists 874 along with windows-874 as .NET name belonging to Thai language and doesn't mention cp-874. I don't have knowledge of .NET but just wanted to add this as a reference.

One another disadvantage of patching the search function (or adding any alias for digit only encoding assuming cpXXXX) is that it prepends "cp" and it also assumes that aliases.py that takes precedence doesn't resolve correctly. Since some of the digit only encodings like '936' that corresponds to 'gbk' are added in aliases.py they don't get resolved as 'cp936' for now. But if new digit only and non-cp encodings are added in future then they have to be added to the file so that precedence works instead of always resolving to cpXXXX encoding. I think this is noted at https://bugs.python.org/issue33865#msg319617.

It would be nice if the original poster provided some more context or environment to reproduce it than the screenshot which has limited information. I am keeping aside the search_function.patch and look forward to OP to reply back in the issue.

Thanks

PS : This is my first mailing list post. Kindly ignore if I am using wrong quoting mechanism.

Ronald Oussoren

unread,
Jun 18, 2018, 11:20:41 AM6/18/18
to Steven D'Aprano, python...@python.org

> On 18 Jun 2018, at 02:34, Steven D'Aprano <st...@pearwood.info> wrote:
>
>> Sure, but for at least one user Python 3.6 fails to start because
>> initialising the sys.std* streams fails due to not finding a “874”
>> encoding.
>
> That doesn't mean that the bug is best fixed by adding an alias.

I agree, I’ve mentioned in the issue that I’d like to understand why python looks for an encoding with this name.

>
> If the error was failing to find encoding "ltain-1", would we add an
> alias or fix the spelling? If 874 is not an official alias, we should
> consider it a misspelling and fix the misspelling, not add an alias.

That depends, if a major platform ships with locales where the encoding is misspelled we have little choice but to add an alias. To state it too blunt: standards are fine until they conflict with reality.

>
> But either way, the point Stephen is making is that even if 874 is a
> legitimate alias, that shouldn't give us carte blanche to add numeric
> aliases for every encoding.

Possibly just for the “cp…” encodings, but IMHO only if we confirm that the code to look for the preferred encoding returns a codepage number on Windows and changing that code leads to worse results than adding numeric aliases for the “cp…” encodings.

Ronald

Karthikeyan

unread,
Jun 19, 2018, 1:06:19 AM6/19/18
to python-ideas
The user has confirmed that adding the alias to aliases.py fixes the problem. I have posted back a reply to see if it's reproducible in other machines of Thai locale or any other extra information to debug this.


Thanks

On Monday, June 18, 2018 at 12:01:01 AM UTC+5:30, Ronald Oussoren wrote:

Winvinc P. Phichitnitikorn

unread,
Jun 19, 2018, 2:56:46 AM6/19/18
to python-ideas
Hi guy,

I'm user who got the Error. The Error is happen  on Windows  10 64 Bits setting Language fro Non-Unicode as Thai(Thailand) . 
874 is standard encoding  for thai language in microsoft environment

Stephen J. Turnbull

unread,
Jun 21, 2018, 3:18:35 AM6/21/18
to Ronald Oussoren, python...@python.org
Ronald Oussoren writes:

> Possibly just for the “cp…” encodings, but IMHO only if we confirm
> that the code to look for the preferred encoding returns a codepage
> number on Windows and changing that code leads to worse results
> than adding numeric aliases for the “cp…” encodings.

Almost all of the CPxxx encodings have multiple aliases[1], so I just
don't see the point unless numeric-only code page designations are
baked in to default "locales"[2] in official releases by major OS
vendors. And probably not even then, since it should be easy enough
to provide a proper "locale" and/or PYTHONIOENCODING setting.

Of course we should help the reporter figure out what's going on and
help them fix it with appropriate system configuration. If that
doesn't work, then (and *only then*) we could think about doing a
stupid thing.

Footnotes:
[1] Granted, "874" only has "windows-874" registered with the IANA,
so it's kind of salient. Still, if numeric-only aliases were a
"thing", surely we'd have heard about it by now---I first encountered
Thai encodings in 1990 (ok, that was TIS 620, but windows-874 is
basically TIS plus Microsoft punctuation extensions IIRC), Thais do
use computers in their native language a lot.

[2] Scare quotes to refer to appropriate platform facilities, as
neither Windows nor Mac OS is strictly conformant to POSIX on this.

Ronald Oussoren

unread,
Jun 22, 2018, 7:57:52 AM6/22/18
to Stephen J. Turnbull, python...@python.org


On 21 Jun 2018, at 09:17, Stephen J. Turnbull <turnbull....@u.tsukuba.ac.jp> wrote:

Ronald Oussoren writes:

Possibly just for the “cp…” encodings, but IMHO only if we confirm
that the code to look for the preferred encoding returns a codepage
number on Windows and changing that code leads to worse results
than adding numeric aliases for the “cp…” encodings.

Almost all of the CPxxx encodings have multiple aliases[1], so I just
don't see the point unless numeric-only code page designations are
baked in to default "locales"[2] in official releases by major OS
vendors.  And probably not even then, since it should be easy enough
to provide a proper "locale" and/or PYTHONIOENCODING setting.

The user shouldn’t have to do anything other than install Python. IMHO
were doing something wrong when the python interpreter doesn’t start up
with a default system configuration (when the user explicitly sets a bogus
PYTHONIOENCODING or locale all bets are off, although even then
warning about and then ignoring bad settings would be more userfriendly
than the current behavior)



Of course we should help the reporter figure out what's going on and
help them fix it with appropriate system configuration.  If that
doesn't work, then (and *only then*) we could think about doing a
stupid thing.

The issue is making slow progress. I’m not Windows users myself and
therefore cannot easily experiment with what’s going on (other than by
reading the code).

Ronald

Stephen J. Turnbull

unread,
Jun 25, 2018, 10:51:15 AM6/25/18
to Ronald Oussoren, python...@python.org
Ronald Oussoren writes:

> The user shouldn’t have to do anything other than install Python. IMHO
> were doing something wrong when the python interpreter doesn’t start up
> with a default system configuration

There's no evidence in the issue that I can see that suggests that the
user installed Python into the default system configuration. I see a
bunch of Python developers who have no access to the OP's system
configuration demonstrating that something that shouldn't work and never
has worked doesn't work, then providing a patch to make it work. This
despite the fact that the OP hasn't provided any configuration details
that would confirm this is a system default setting.

I wouldn't object to making it work if there were any evidence that it
is a real problem that other users will encounter. But there isn't any
such evidence yet, it's a non-standard alias according to Microsoft's
own IANA registration, and Steven d'Aprano's argument that such aliases
may be ambiguous is plausible, though I haven't seen confirmation it
would be problem in practice.

> (when the user explicitly sets a bogus PYTHONIOENCODING or locale all
> bets are off,

I'm assuming that is the case, based on the fact that none of my two
;-) Thai students ever had this problem, nor have I seen a report of
this problem for any encoding in either Emacs or Python contexts since
about 1990, nor has the OP posted anything about his/her
configuration.

> although even then warning about and then ignoring bad settings
> would be more userfriendly than the current behavior)

If Python is told to talk YTREWQ and it doesn't know how to talk YTREWQ,
ignoring the problem is not possible if any input or output in YTREWQ is
required. The program will crash with a much harder to understand error
message describing "undecodable input" in an encoding the user doesn't
expect. My own experience is that soldiering on is the least user-
friendly thing to do, as typically there's a trivial change that the
user can make to resolve the problem optimally.

The obvious thing to do is to fall back to ASCII, which almost certainly
is compatible with the terminal, the log files, and the user's eyes and
brain, emit a warning, and quit. That is what we do. The warning seems
OK: the OP also diagnosed the missing alias, likely with little trouble.

Steve

Reply all
Reply to author
Forward
0 new messages