i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it surprisingly fails with a LookupError. This seems like something to be corrected?
$ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" for more information.
> i stumbled on this situation, that is if I decode some string, below > just the empty string, using the mcbs encoding, it succeeds, but if I > try to encode it back with the same encoding it surprisingly fails > with a LookupError. This seems like something to be corrected?
Indeed - in your code. It's not the same encoding.
On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> Use "mbcs" in the second call, not "mcbs".
Ooops, sorry about that, when i switched to test it in the interpreter I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-) I.e. it was still teh same encoding, even if maybe non-existant.. ?
If I try again using "mbcs" consistently, I still get the same error:
$ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" for more information.
> On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > Use "mbcs" in the second call, not "mcbs".
> Ooops, sorry about that, when i switched to test it in the interpreter > I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-) > I.e. it was still teh same encoding, even if maybe non-existant.. ?
> If I try again using "mbcs" consistently, I still get the same error:
> $ python > Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) > [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin > Type "help", "copyright", "credits" or "license" for more information.>>> unicode('', 'mbcs') > u'' > >>> unicode('', 'mbcs').encode('mbcs')
> Traceback (most recent call last): > File "<stdin>", line 1, in <module> > LookupError: unknown encoding: mbcs
$ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" for more information.
Do not know what the implications of encoding according to "ANSI codepage (CP_ACP)" are. Windows only seems clear, but why does it only complain when decoding a non-empty string (or when encoding the empty unicode string) ?
> Do not know what the implications of encoding according to "ANSI > codepage (CP_ACP)" are.
Neither do I. YAGNI (especially on darwin) so don't lose any sleep over it.
> Windows only seems clear, but why does it only > complain when decoding a non-empty string (or when encoding the empty > unicode string) ?
My presumption: because it doesn't need a codec to decode '' into u''; no failed codec look-up, so no complaint. Any realistic app will try to decode a non-empty string sooner or later.
On Jan 2, 12:28 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Jan 2, 9:57 pm, mario <ma...@ruggier.org> wrote:
> > Do not know what the implications of encoding according to "ANSI > > codepage (CP_ACP)" are.
> Neither do I. YAGNI (especially on darwin) so don't lose any sleep > over it.
> > Windows only seems clear, but why does it only > > complain when decoding a non-empty string (or when encoding the empty > > unicode string) ?
> My presumption: because it doesn't need a codec to decode '' into u''; > no failed codec look-up, so no complaint. Any realistic app will try > to decode a non-empty string sooner or later.
Yes, I suspect I will never need it ;)
Incidentally, the situation is that in a script that tries to guess a file's encoding, it bombed on the file ".svn/empty-file" -- but why it was going so far with an empty string was really due to a bug elsewhere in the script, trivially fixed. Still, I was curious about this non-symmetric behaviour for the empty string by some encodings.
Anyhow, thanks a lot to both of you for the great feedback!
>>>>> mario <ma...@ruggier.org> (M) wrote: >M> $ python >M> Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) >M> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin >M> Type "help", "copyright", "credits" or "license" for more information. >>>>> unicode('', 'mbcs') >M> u'' >>>>> unicode('abc', 'mbcs') >M> Traceback (most recent call last): >M> File "<stdin>", line 1, in <module> >M> LookupError: unknown encoding: mbcs
>M> Hmmn, strange. Same behaviour for "raboof".
Apparently for the empty string the encoding is irrelevant as it will not be used. I guess there is an early check for this special case in the code. -- Piet van Oostrum <p...@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org
> Do not know what the implications of encoding according to "ANSI > codepage (CP_ACP)" are. Windows only seems clear, but why does it only > complain when decoding a non-empty string (or when encoding the empty > unicode string) ?
It has no implications for this issue here. CP_ACP is a Microsoft invention of a specific encoding alias - the "ANSI code page" (as Microsoft calls it) is not a specific encoding where I could specify a mapping from bytes to characters, but instead a system-global indirection based on a langage default. For example, in the Western-European/U.S. version of Windows, the default for CP_ACP is cp1252 (local installation may change that default, system-wide).
The issue likely has the cause that Piet also guessed: If the input is an empty string, no attempt to actually perform an encoding is done, but the output is assumed to be an empty string again. This is correct behavior for all codecs that Python supports in its default installation, at least for the direction bytes->unicode. For the reverse direction, such an optimization would be incorrect; consider u"".encode("utf-16").
On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:
> Apparently for the empty string the encoding is irrelevant as it will not > be used. I guess there is an early check for this special case in the code.
In the module I an working on [*] I am remembering a failed encoding to allow me, if necessary, to later re-process fewer encodings. In the case of an empty string AND an unknown encoding this strategy failed...
Anyhow, the question is, should the behaviour be the same for these operations, and if so what should it be:
[*] a module to decode heuristically, that imho is actually starting to look quite good, it is at http://gizmojo.org/code/decodeh/ and any comments very welcome.
On Jan 4, 8:03 am, mario <ma...@ruggier.org> wrote:
> On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:
> > Apparently for the empty string the encoding is irrelevant as it will not > > be used. I guess there is an early check for this special case in the code.
> In the module I an working on [*] I am remembering a failed encoding > to allow me, if necessary, to later re-process fewer encodings.
If you were in fact doing that, you would not have had a problem. What you appear to have been doing is (a) remembering a NON-failing encoding, and assuming that it would continue not to fail (b) not differentiating between failure reasons (codec doesn't exist, input not consistent with specified encoding).
A good strategy when dealing with encodings that are unknown (in the sense that they come from user input, or a list of encodings you got out of the manual, or are constructed on the fly (e.g. encoding = 'cp' + str(code_page_number) # old MS Excel files)) is to try to decode some vanilla ASCII alphabetic text, so that you can give an immemdiate in-context error message.
> In the > case of an empty string AND an unknown encoding this strategy > failed...
> Anyhow, the question is, should the behaviour be the same for these > operations, and if so what should it be:
Perhaps you should make TWO comparisons: (1) unistrg = strg.decode(encoding) with unistrg = unicode(strg, encoding) [the latter "optimises" the case where strg is ''; the former can't because its output may be '', not u'', depending on the encoding, so ut must do the lookup] (2) unistrg = strg.decode(encoding) with strg = unistrg.encode(encoding) [both always do the lookup]
In any case, a pointless question (IMHO); the behaviour is extremely unlikely to change, as the chance of breaking existing code outvotes any desire to clean up a minor inconsistency that is easily worked around.
On Jan 4, 12:02 am, John Machin <sjmac...@lexicon.net> wrote:
> On Jan 4, 8:03 am, mario <ma...@ruggier.org> wrote: > > On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:
> > > Apparently for the empty string the encoding is irrelevant as it will not > > > be used. I guess there is an early check for this special case in the code.
> > In the module I an working on [*] I am remembering a failed encoding > > to allow me, if necessary, to later re-process fewer encodings.
> If you were in fact doing that, you would not have had a problem. What > you appear to have been doing is (a) remembering a NON-failing > encoding, and assuming that it would continue not to fail
Yes, exactly. But there is no difference which ones I remember as the two subsets will anyway add up to always the same thing. In this special case (empty string!) the unccode() call does not fail...
> (b) not > differentiating between failure reasons (codec doesn't exist, input > not consistent with specified encoding).
There is no failure in the first pass in this case... if I do as you suggest further down, that is to use s.decode(encoding) instead of unicode(s, encoding) to force the lookup, then I could remember the failure reason to be able to make a decision about how to proceed. However I am aiming at an automatic decision, thus an in-context error message would need to be replaced with a more rigourous info about how the guessing should proceed. I am also trying to keep this simple ;)
<snip>
> In any case, a pointless question (IMHO); the behaviour is extremely > unlikely to change, as the chance of breaking existing code outvotes > any desire to clean up a minor inconsistency that is easily worked > around.
Yes, I would agree. The work around may not even be worth it though, as what I really want is a unicode object, so changing from calling unicode() to s.decode() is not quite right, and will anyway require a further check. Less clear code, and a little unnecessary performance hit for the 99.9 majority of cases... Anyhow, I have improved a little further the "post guess" checking/refining logic of the algorithm [*].
What I'd like to understand better is the "compatibility heirarchy" of known encodings, in the positive sense that if a string decodes successfully with encoding A, then it is also possible that it will encode with encodings B, C; and in the negative sense that is if a string fails to decode with encoding A, then for sure it will also fail to decode with encodings B, C. Any ideas if such an analysis of the relationships between encodings exists?
> What I'd like to understand better is the "compatibility heirarchy" of > known encodings, in the positive sense that if a string decodes > successfully with encoding A, then it is also possible that it will > encode with encodings B, C; and in the negative sense that is if a > string fails to decode with encoding A, then for sure it will also > fail to decode with encodings B, C. Any ideas if such an analysis of > the relationships between encodings exists?
Most certainly. You'll have to learn a lot about many encodings though to really understand the relationships.
Many encodings X are "ASCII supersets", in the sense that if you have only characters in the ASCII set, the encoding of the string in ASCII is the same as the encoding of the string in X. ISO-8859-X, ISO-2022-X, koi8-x, and UTF-8 fall in this category.
Other encodings are "ASCII supersets" only in the sense that they include all characters of ASCII, but encode them differently. EBCDIC and UCS-2/4, UTF-16/32 fall in that category.
Some encodings are 7-bit, so that they decode as ASCII (producing moji-bake if the input wasn't ASCII). ISO-2022-X is an example.
Some encodings are 8-bit, so that they can decode arbitrary bytes (again producing moji-bake if the input wasn't that encoding). ISO-8859-X are examples, as are some of the EBCDIC encodings, and koi8-x. Also, things will successfully (but meaninglessly) decode as UTF-16 if the number of bytes in the input is even (likewise for UTF-32).