Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
different encodings for unicode() and u''.encode(), bug?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  14 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
mario  
View profile  
 More options Jan 2 2008, 3:24 am
Newsgroups: comp.lang.python
From: mario <ma...@ruggier.org>
Date: Wed, 2 Jan 2008 00:24:59 -0800 (PST)
Local: Wed, Jan 2 2008 3:24 am
Subject: different encodings for unicode() and u''.encode(), bug?
Hello!

i stumbled on this situation, that is if I decode some string, below
just the empty string, using the mcbs encoding, it succeeds, but if I
try to encode it back with the same encoding it surprisingly fails
with a LookupError. This seems like something to be corrected?

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> s = ''
>>> unicode(s, 'mcbs')
u''
>>> unicode(s, 'mcbs').encode('mcbs')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mcbs

Best wishes to everyone for 2008!

mario


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Jan 2 2008, 3:30 am
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Wed, 02 Jan 2008 09:30:00 +0100
Subject: Re: different encodings for unicode() and u''.encode(), bug?

> i stumbled on this situation, that is if I decode some string, below
> just the empty string, using the mcbs encoding, it succeeds, but if I
> try to encode it back with the same encoding it surprisingly fails
> with a LookupError. This seems like something to be corrected?

Indeed - in your code. It's not the same encoding.

>>>> unicode(s, 'mcbs')
> u''
>>>> unicode(s, 'mcbs').encode('mcbs')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> LookupError: unknown encoding: mcbs

Use "mbcs" in the second call, not "mcbs".

HTH,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mario  
View profile  
 More options Jan 2 2008, 3:45 am
Newsgroups: comp.lang.python
From: mario <ma...@ruggier.org>
Date: Wed, 2 Jan 2008 00:45:15 -0800 (PST)
Local: Wed, Jan 2 2008 3:45 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.de> wrote:

> Use "mbcs" in the second call, not "mcbs".

Ooops, sorry about that, when i switched to test it in the interpreter
I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
I.e. it was still teh same encoding, even if maybe non-existant.. ?

If I try again using "mbcs" consistently, I still get the same error:

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> unicode('', 'mbcs')
u''
>>> unicode('', 'mbcs').encode('mbcs')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs


mario

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Machin  
View profile  
 More options Jan 2 2008, 4:44 am
Newsgroups: comp.lang.python
From: John Machin <sjmac...@lexicon.net>
Date: Wed, 2 Jan 2008 01:44:47 -0800 (PST)
Local: Wed, Jan 2 2008 4:44 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 7:45 pm, mario <ma...@ruggier.org> wrote:

Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

(2) Read what the manual (Library Reference -> codecs module ->
standard encodings) has to say about mbcs.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Machin  
View profile  
 More options Jan 2 2008, 5:47 am
Newsgroups: comp.lang.python
From: John Machin <sjmac...@lexicon.net>
Date: Wed, 2 Jan 2008 02:47:01 -0800 (PST)
Local: Wed, Jan 2 2008 5:47 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 8:44 pm, John Machin <sjmac...@lexicon.net> wrote:

> (1) Try these at the Python interactive prompt:

> unicode('', 'latin1')

Also use those 6 cases to check out the difference in behaviour
between unicode(x, y) and x.decode(y)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mario  
View profile  
 More options Jan 2 2008, 5:57 am
Newsgroups: comp.lang.python
From: mario <ma...@ruggier.org>
Date: Wed, 2 Jan 2008 02:57:17 -0800 (PST)
Local: Wed, Jan 2 2008 5:57 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 10:44 am, John Machin <sjmac...@lexicon.net> wrote:

> Two things for you to do:

> (1) Try these at the Python interactive prompt:

> unicode('', 'latin1')
> unicode('', 'mbcs')
> unicode('', 'raboof')
> unicode('abc', 'latin1')
> unicode('abc', 'mbcs')
> unicode('abc', 'raboof')

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('', 'mbcs')
u''
>>> unicode('abc', 'mbcs')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs


Hmmn, strange. Same behaviour for "raboof".

> (2) Read what the manual (Library Reference -> codecs module ->
> standard encodings) has to say about mbcs.

Page at http://docs.python.org/lib/standard-encodings.html says that
mbcs "purpose":
Windows only: Encode operand according to the ANSI codepage (CP_ACP)

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are. Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

mario


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Machin  
View profile  
 More options Jan 2 2008, 6:28 am
Newsgroups: comp.lang.python
From: John Machin <sjmac...@lexicon.net>
Date: Wed, 2 Jan 2008 03:28:38 -0800 (PST)
Local: Wed, Jan 2 2008 6:28 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 9:57 pm, mario <ma...@ruggier.org> wrote:

Neither do I. YAGNI (especially on darwin) so don't lose any sleep
over it.

> Windows only seems clear, but why does it only
> complain when decoding a non-empty string (or when encoding the empty
> unicode string) ?

My presumption: because it doesn't need a codec to decode '' into u'';
no failed codec look-up, so no complaint. Any realistic app will try
to decode a non-empty string sooner or later.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mario  
View profile  
 More options Jan 2 2008, 7:16 am
Newsgroups: comp.lang.python
From: mario <ma...@ruggier.org>
Date: Wed, 2 Jan 2008 04:16:16 -0800 (PST)
Local: Wed, Jan 2 2008 7:16 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 12:28 pm, John Machin <sjmac...@lexicon.net> wrote:

> On Jan 2, 9:57 pm, mario <ma...@ruggier.org> wrote:

> > Do not know what the implications of encoding according to "ANSI
> > codepage (CP_ACP)" are.

> Neither do I. YAGNI (especially on darwin) so don't lose any sleep
> over it.

> > Windows only seems clear, but why does it only
> > complain when decoding a non-empty string (or when encoding the empty
> > unicode string) ?

> My presumption: because it doesn't need a codec to decode '' into u'';
> no failed codec look-up, so no complaint. Any realistic app will try
> to decode a non-empty string sooner or later.

Yes, I suspect I will never need it ;)

Incidentally, the situation is that in a script that tries to guess a
file's encoding, it bombed on the file ".svn/empty-file" -- but why it
was going so far with an empty string was really due to a bug
elsewhere in the script, trivially fixed. Still, I was curious about
this non-symmetric behaviour for the empty string by some encodings.

Anyhow, thanks a lot to both of you for the great feedback!

mario


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Piet van Oostrum  
View profile  
 More options Jan 2 2008, 8:25 am
Newsgroups: comp.lang.python
From: Piet van Oostrum <p...@cs.uu.nl>
Date: Wed, 02 Jan 2008 14:25:48 +0100
Local: Wed, Jan 2 2008 8:25 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?

>>>>> mario <ma...@ruggier.org> (M) wrote:
>M> $ python
>M> Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
>M> [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
>M> Type "help", "copyright", "credits" or "license" for more information.
>>>>> unicode('', 'mbcs')
>M> u''
>>>>> unicode('abc', 'mbcs')
>M> Traceback (most recent call last):
>M>   File "<stdin>", line 1, in <module>
>M> LookupError: unknown encoding: mbcs

>M> Hmmn, strange. Same behaviour for "raboof".

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.
--
Piet van Oostrum <p...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Jan 2 2008, 3:48 pm
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Wed, 02 Jan 2008 21:48:30 +0100
Local: Wed, Jan 2 2008 3:48 pm
Subject: Re: different encodings for unicode() and u''.encode(), bug?

> Do not know what the implications of encoding according to "ANSI
> codepage (CP_ACP)" are. Windows only seems clear, but why does it only
> complain when decoding a non-empty string (or when encoding the empty
> unicode string) ?

It has no implications for this issue here. CP_ACP is a Microsoft
invention of a specific encoding alias - the "ANSI code page"
(as Microsoft calls it) is not a specific encoding where I could
specify a mapping from bytes to characters, but instead a
system-global indirection based on a langage default. For example,
in the Western-European/U.S. version of Windows, the default for
CP_ACP is cp1252 (local installation may change that default,
system-wide).

The issue likely has the cause that Piet also guessed: If the
input is an empty string, no attempt to actually perform an
encoding is done, but the output is assumed to be an empty
string again. This is correct behavior for all codecs that Python
supports in its default installation, at least for the direction
bytes->unicode. For the reverse direction, such an optimization
would be incorrect; consider u"".encode("utf-16").

HTH,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mario  
View profile  
 More options Jan 3 2008, 4:03 pm
Newsgroups: comp.lang.python
From: mario <ma...@ruggier.org>
Date: Thu, 3 Jan 2008 13:03:08 -0800 (PST)
Local: Thurs, Jan 3 2008 4:03 pm
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:

> Apparently for the empty string the encoding is irrelevant as it will not
> be used. I guess there is an early check for this special case in the code.

In the module I an working on [*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings. In the
case of an empty string AND an unknown encoding this strategy
failed...

Anyhow, the question is, should the behaviour be the same for these
operations, and if so what should it be:

u"".encode("non-existent")
unicode("", "non-existent")

mario

[*] a module to decode heuristically, that imho is actually starting
to look quite good, it is at http://gizmojo.org/code/decodeh/ and any
comments very welcome.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Machin  
View profile  
 More options Jan 3 2008, 6:02 pm
Newsgroups: comp.lang.python
From: John Machin <sjmac...@lexicon.net>
Date: Thu, 3 Jan 2008 15:02:46 -0800 (PST)
Local: Thurs, Jan 3 2008 6:02 pm
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 4, 8:03 am, mario <ma...@ruggier.org> wrote:

> On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:

> > Apparently for the empty string the encoding is irrelevant as it will not
> > be used. I guess there is an early check for this special case in the code.

> In the module I an working on [*] I am remembering a failed encoding
> to allow me, if necessary, to later re-process fewer encodings.

If you were in fact doing that, you would not have had a problem. What
you appear to have been doing is (a) remembering a NON-failing
encoding, and assuming that it would continue not to fail (b) not
differentiating between failure reasons (codec doesn't exist, input
not consistent with specified encoding).

A good strategy when dealing with encodings that are unknown (in the
sense that they come from user input, or a list of encodings you got
out of the manual, or are constructed on the fly (e.g. encoding = 'cp'
+ str(code_page_number) # old MS Excel files)) is to try to decode
some vanilla ASCII alphabetic text, so that you can give an immemdiate
in-context error message.

> In the
> case of an empty string AND an unknown encoding this strategy
> failed...

> Anyhow, the question is, should the behaviour be the same for these
> operations, and if so what should it be:

> u"".encode("non-existent")
> unicode("", "non-existent")

Perhaps you should make TWO comparisons:
(1)
    unistrg = strg.decode(encoding)
with
    unistrg = unicode(strg, encoding)
[the latter "optimises" the case where strg is ''; the former can't
because its output may be '', not u'', depending on the encoding, so
ut must do the lookup]
(2)
    unistrg = strg.decode(encoding)
with
    strg = unistrg.encode(encoding)
[both always do the lookup]

In any case, a pointless question (IMHO); the behaviour is extremely
unlikely to change, as the chance of breaking existing code outvotes
any desire to clean up a minor inconsistency that is easily worked
around.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mario  
View profile  
 More options Jan 12 2008, 3:58 am
Newsgroups: comp.lang.python
From: mario <ma...@ruggier.org>
Date: Sat, 12 Jan 2008 00:58:42 -0800 (PST)
Local: Sat, Jan 12 2008 3:58 am
Subject: Re: different encodings for unicode() and u''.encode(), bug?
On Jan 4, 12:02 am, John Machin <sjmac...@lexicon.net> wrote:

> On Jan 4, 8:03 am, mario <ma...@ruggier.org> wrote:
> > On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nl> wrote:

> > > Apparently for the empty string the encoding is irrelevant as it will not
> > > be used. I guess there is an early check for this special case in the code.

> > In the module I an working on [*] I am remembering a failed encoding
> > to allow me, if necessary, to later re-process fewer encodings.

> If you were in fact doing that, you would not have had a problem. What
> you appear to have been doing is (a) remembering a NON-failing
> encoding, and assuming that it would continue not to fail

Yes, exactly. But there is no difference which ones I remember as the
two subsets will anyway add up to always the same thing. In this
special case (empty string!) the unccode() call does not fail...

> (b) not
> differentiating between failure reasons (codec doesn't exist, input
> not consistent with specified encoding).

There is no failure in the first pass in this case... if I do as you
suggest further down, that is to use s.decode(encoding) instead of
unicode(s, encoding) to force the lookup, then I could remember the
failure reason to be able to make a decision about how to proceed.
However I am aiming at an automatic decision, thus an in-context error
message would need to be replaced with a more rigourous info about how
the guessing should proceed. I am also trying to keep this simple ;)

<snip>

> In any case, a pointless question (IMHO); the behaviour is extremely
> unlikely to change, as the chance of breaking existing code outvotes
> any desire to clean up a minor inconsistency that is easily worked
> around.

Yes, I would agree. The work around may not even be worth it though,
as what I really want is a unicode object, so changing from calling
unicode() to s.decode() is not quite right, and will anyway require a
further check. Less clear code, and a little unnecessary performance
hit for the 99.9 majority of cases... Anyhow, I have improved a little
further the "post guess" checking/refining logic of the algorithm [*].

What I'd like to understand better is the "compatibility heirarchy" of
known encodings, in the positive sense that if a string decodes
successfully with encoding A, then it is also possible that it will
encode with encodings B, C; and in the negative sense that is if a
string fails to decode with encoding A, then for sure it will also
fail to decode with encodings B, C. Any ideas if such an analysis of
the relationships between encodings exists?

Thanks! mario

[*] http://gizmojo.org/code/decodeh/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Jan 12 2008, 6:19 pm
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Sun, 13 Jan 2008 00:19:05 +0100
Local: Sat, Jan 12 2008 6:19 pm
Subject: Re: different encodings for unicode() and u''.encode(), bug?

> What I'd like to understand better is the "compatibility heirarchy" of
> known encodings, in the positive sense that if a string decodes
> successfully with encoding A, then it is also possible that it will
> encode with encodings B, C; and in the negative sense that is if a
> string fails to decode with encoding A, then for sure it will also
> fail to decode with encodings B, C. Any ideas if such an analysis of
> the relationships between encodings exists?

Most certainly. You'll have to learn a lot about many encodings though
to really understand the relationships.

Many encodings X are "ASCII supersets", in the sense that if you have
only characters in the ASCII set, the encoding of the string in ASCII
is the same as the encoding of the string in X. ISO-8859-X, ISO-2022-X,
koi8-x, and UTF-8 fall in this category.

Other encodings are "ASCII supersets" only in the sense that they
include all characters of ASCII, but encode them differently. EBCDIC
and UCS-2/4, UTF-16/32 fall in that category.

Some encodings are 7-bit, so that they decode as ASCII (producing
moji-bake if the input wasn't ASCII). ISO-2022-X is an example.

Some encodings are 8-bit, so that they can decode arbitrary bytes
(again producing moji-bake if the input wasn't that encoding).
ISO-8859-X are examples, as are some of the EBCDIC encodings, and
koi8-x. Also, things will successfully (but meaninglessly) decode
as UTF-16 if the number of bytes in the input is even (likewise
for UTF-32).

HTH,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »