urllib.unquote and unicode

George Sakkis

unread,

Dec 18, 2006, 9:57:38 PM12/18/06

to

The following snippet results in different outcome for (at least) the
last three major releases:

>>> import urllib
>>> urllib.unquote(u'%94')

# Python 2.3.4
u'%94'

# Python 2.4.2
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
ordinal not in range(128)

# Python 2.5
u'\x94'

Is the current version the "right" one or is this function supposed to
change every other week ?

George

Leo Kislov

unread,

Dec 19, 2006, 12:02:58 AM12/19/06

to

IMHO, none of the results is right. Either unicode string should be
rejected by raising ValueError or it should be encoded with ascii
encoding and result should be the same as
urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can consider
current behaviour as undefined just like if you pass a random object
into some function you can get different outcome in different python
versions.

-- Leo

Peter Otten

unread,

Dec 19, 2006, 3:57:49 AM12/19/06

to

George Sakkis wrote:

> The following snippet results in different outcome for (at least) the
> last three major releases:
>
>>>> import urllib
>>>> urllib.unquote(u'%94')

> # Python 2.4.2

> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
> ordinal not in range(128)

Python 2.4.3 (#3, Aug 23 2006, 09:40:15)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import urllib
>>> urllib.unquote(u"%94")

u'\x94'
>>>

From the above I infer that the 2.4.2 behaviour was considered a bug.

Peter

Fredrik Lundh

unread,

Dec 19, 2006, 4:05:45 AM12/19/06

to pytho...@python.org

George Sakkis wrote:

why are you passing non-ASCII Unicode strings to a function designed for
fixing up 8-bit strings in the first place? if you do proper encoding
before you quote things, it'll work the same way in all Python releases.

</F>

Duncan Booth

unread,

Dec 19, 2006, 4:08:59 AM12/19/06

to

"Leo Kislov" <Leo.K...@gmail.com> wrote:

I agree with you that none of the results is right, but not that the
behaviour should be undefined.

The way that uri encoding is supposed to work is that first the input
string in unicode is encoded to UTF-8 and then each byte which is not in
the permitted range for characters is encoded as % followed by two hex
characters.

That means that the string u'\x94' should be encoded as %c2%94. The
string %94 should generate a unicode decode error, but it should be the
utf-8 codec raising the error not the ascii codec.

Unfortunately RFC3986 isn't entirely clear-cut on this issue:

> When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set [UCS],
> the data should first be encoded as octets according to the UTF-8
> character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be percent-
> encoded. For example, the character A would be represented as "A",
> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
> as "%C3%80", and the character KATAKANA LETTER A would be represented
> as "%E3%82%A2".

I think it leaves open the possibility that existing URI schemes which do
not support unicode characters can use other encodings, but given that the
original posting started by decoding a unicode string I think that utf-8
should definitely be assumed in this case.

Also, urllib.quote() should encode into utf-8 instead of throwing KeyError
for a unicode string.

George Sakkis

unread,

Dec 19, 2006, 8:24:25 AM12/19/06

to

I'm using BeautifulSoup, which from version 3 returns Unicode only, and
I stumbled on a page with such bogus char encodings; I have the
impression that whatever generated it used ord() to encode reserved
characters instead of the proper hex representation in latin-1. If
that's the case, unquote() won't do anyway and I'd have to go with
chr() on the number part.

George

"Martin v. Löwis"

unread,

Dec 19, 2006, 3:50:06 PM12/19/06

to duncan...@suttoncourtenay.org.uk

Duncan Booth schrieb:

> The way that uri encoding is supposed to work is that first the input
> string in unicode is encoded to UTF-8 and then each byte which is not in
> the permitted range for characters is encoded as % followed by two hex
> characters.

Can you back up this claim ("is supposed to work") by reference to
a specification (ideally, chapter and verse)?

In URIs, it is entirely unspecified what the encoding is of non-ASCII
characters, and whether % escapes denote characters in the first place.

> Unfortunately RFC3986 isn't entirely clear-cut on this issue:
>
>> When a new URI scheme defines a component that represents textual
>> data consisting of characters from the Universal Character Set [UCS],
>> the data should first be encoded as octets according to the UTF-8
>> character encoding [STD63]; then only those octets that do not
>> correspond to characters in the unreserved set should be percent-
>> encoded. For example, the character A would be represented as "A",
>> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
>> as "%C3%80", and the character KATAKANA LETTER A would be represented
>> as "%E3%82%A2".

This is irrelevant, it talks about new URI schemes only.

> I think it leaves open the possibility that existing URI schemes which do
> not support unicode characters can use other encodings, but given that the
> original posting started by decoding a unicode string I think that utf-8
> should definitely be assumed in this case.

No, the http scheme is defined by RFC 2616 instead. It doesn't really
talk about encodings, but hints an interpretation in 3.2.3:

# When comparing two URIs to decide if they match or not, a client
# SHOULD use a case-sensitive octet-by-octet comparison of the entire
# URIs, [...]
# Characters other than those in the "reserved" and "unsafe" sets (see
# RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

Now, RFC 2396 already says that URIs are sequences of characters,
not sequences of octets, yet RFC 2616 fails to recognize that issue
and refuses to specify a character set for its scheme (which
RFC 2396 says that it could).

The conventional wisdom is that the choice of URI encoding for HTTP
is a server-side decision; for that reason, IRIs were introduced.

Regards,
Martin

Duncan Booth

unread,

Dec 20, 2006, 4:12:17 AM12/20/06

to

"Martin v. Löwis" <mar...@v.loewis.de> wrote:

> Duncan Booth schrieb:
>> The way that uri encoding is supposed to work is that first the input
>> string in unicode is encoded to UTF-8 and then each byte which is not
>> in the permitted range for characters is encoded as % followed by two
>> hex characters.
>
> Can you back up this claim ("is supposed to work") by reference to
> a specification (ideally, chapter and verse)?

I'm not sure I have time to read the various RFC's in depth right now,
so I may have to come back on this thread later. The one thing I'm
convinced of is that the current implementations of urllib.quote and
urllib.unquote are broken in respect to their handling of unicode. In
particular % encoding is defined in terms of octets, so when given a
unicode string urllib.quote should either encoded it, or throw a suitable
exception (not KeyError which is what it seems to throw now).

My objection to urllib.unquote is that urllib.unquote(u'%a3') returns
u'\xa3' which is a character not an octet. I think it should always return
a byte string, or it should calculate a byte string and then decode it
according to some suitable encoding, or it should throw an exception
[choose any of the above].

Adding an optional encoding parameter to quote/unquote be one option,
although since you can encode/decode the parameter it doesn't add much.

> No, the http scheme is defined by RFC 2616 instead. It doesn't really
> talk about encodings, but hints an interpretation in 3.2.3:

The applicable RFC is 3986. See RFC2616 section 3.2.1:
> For definitive information on URL syntax and semantics, see "Uniform
> Resource Identifiers (URI):
> Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
> 1738 [4] and RFC 1808 [11]).

and RFC 2396:
> Obsoleted by: 3986

> Now, RFC 2396 already says that URIs are sequences of characters,
> not sequences of octets, yet RFC 2616 fails to recognize that issue
> and refuses to specify a character set for its scheme (which
> RFC 2396 says that it could).

and RFC2277, 3.1 says that it MUST identify which charset is used (although
that's just a best practice document not a standard). (The block capitals
are the RFC's not mine.)

> The conventional wisdom is that the choice of URI encoding for HTTP
> is a server-side decision; for that reason, IRIs were introduced.

Yes, I know that in practice some systems use other character sets.

Walter Dörwald

unread,

Dec 21, 2006, 7:12:08 AM12/21/06

to "Martin v. Löwis", pytho...@python.org

Martin v. Löwis wrote:
> Duncan Booth schrieb:
>> The way that uri encoding is supposed to work is that first the input
>> string in unicode is encoded to UTF-8 and then each byte which is not in
>> the permitted range for characters is encoded as % followed by two hex
>> characters.
>
> Can you back up this claim ("is supposed to work") by reference to
> a specification (ideally, chapter and verse)?
>
> In URIs, it is entirely unspecified what the encoding is of non-ASCII
> characters, and whether % escapes denote characters in the first place.

http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Servus,
Walter

"Martin v. Löwis"

unread,

Dec 21, 2006, 3:29:49 PM12/21/06

to Walter Dörwald

>>> The way that uri encoding is supposed to work is that first the input
>>> string in unicode is encoded to UTF-8 and then each byte which is not in
>>> the permitted range for characters is encoded as % followed by two hex
>>> characters.
>> Can you back up this claim ("is supposed to work") by reference to
>> a specification (ideally, chapter and verse)?

> http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Thanks. Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.

If somebody implemented IRIs, that would be an entirely different
matter.

Regards,
Martin

Duncan Booth

unread,

Dec 22, 2006, 4:13:29 AM12/22/06

to

"Martin v. Löwis" <mar...@v.loewis.de> wrote:

>>>> The way that uri encoding is supposed to work is that first the
>>>> input string in unicode is encoded to UTF-8 and then each byte
>>>> which is not in the permitted range for characters is encoded as %
>>>> followed by two hex characters.
>>> Can you back up this claim ("is supposed to work") by reference to
>>> a specification (ideally, chapter and verse)?
>> http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
>
> Thanks.

and thanks from me too.

> Unfortunately, this isn't normative, but "we recommend". In
> addition, it talks about URIs found HTML only. If somebody writes
> a user agent written in Python, they are certainly free to follow
> this recommendation - but I think this is a case where Python should
> refuse the temptation to guess.

So you believe that because something is only recommended by a standard
Python should refuse to implement it? This is the kind of thinking that in
the 1980's gave us a version of gcc where any attempt to use #pragma (which
according to the standard invokes undefined behaviour) would spawn a copy
of nethack or rogue.

You don't seem to have realised yet, but my objection to the behaviour of
urllib.unquote is precisely that it does guess, and it guesses wrongly. In
fact it guesses latin1 instead of utf8. If it threw an exception for non-
ascii values, then it would match the standard (in the sense of not
following a recommendation because it doesn't have to) and it would be
purely a quality of implementation issue.

If you don't believe me that it guesses latin1, try it. For all valid URIs
(i.e. ignoring those with non-ascii characters already in them) in the
current implementation where u is a unicode object:

unquote(u)==unquote(u.encode('ascii')).decode('latin1')

I generally agree that Python should avoid guessing, so I wouldn't really
object if it threw an exception or always returned a byte string even
though the html standard recommends using utf8 and the uri rfc requires it
for all new uri schemes. However, in this case I think it would be useful
behaviour: e.g. a decent xml parser is going to give me back the attributes
including encoded uris in unicode. To handle those correctly you must
encode to ascii before unquoting. This is an avoidable pitfall in the
standard library.

On second thoughts, perhaps the current behaviour is actually closer to:

unquote(u)==unquote(u.encode('latin1')).decode('latin1')

as that also matches the current behaviour for uris which contain non-ascii
characters when the characters have a latin1 encoding. To fully conform
with the html standard's recommendation it should actually be equivalent
to:

unquote(u)==unquote(u.encode('utf8')).decode('utf8')

The catch with the current behaviour is that it doesn't exactly mimic any
sensible behaviour at all. It decodes the escaped octets as though they
were latin1 encoded, but it mixes them into a unicode string so there is no
way to correct its bad guess. In other words the current behaviour is
actively harmful.

"Martin v. Löwis"

unread,

Dec 22, 2006, 10:28:28 AM12/22/06

to duncan...@suttoncourtenay.org.uk

Duncan Booth schrieb:

> So you believe that because something is only recommended by a standard
> Python should refuse to implement it?

Yes. In the face of ambiguity, refuse the temptation to guess.

This is *deeply* ambiguous; people have been using all kinds of
encodings in http URLs.

> You don't seem to have realised yet, but my objection to the behaviour of
> urllib.unquote is precisely that it does guess, and it guesses wrongly.

Yes, it seems that this was a bad move.

Regards,
Martin