Fail to decode the non-ascii characters parameters in URL

294 views
Skip to first unread message

Flier Lu

unread,
May 25, 2011, 11:53:51 PM5/25/11
to Tornado Web Server
As you known, the browser will encode the non-ascii characters in URL,
for example

http://localhost:8080/tag/%E9%A3%8E%E9%99%A9%E7%AE%A1%E7%90%86

The non-ascii characters will be encoded with UTF-8 first, and convert
it to a percent-encoded string

When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set
[UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded. For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be
represented
as "%C3%80", and the character KATAKANA LETTER A would be
represented
as "%E3%82%A2".

http://tools.ietf.org/html/rfc3986

So, tornado should decode it in reverse, and pass the decoded string
to the request handler

But if we define a handler with parameter, like this

URLSpec(r'/tag/(?P<name>.+)', TagHandler, name='tag'),

Tornado will decode the whole url and parameters as Unicode, the non-
ascii code will be unquoted to a invalid string

# web.py:1198
for spec in handlers:
match = spec.regex.match(request.path) # path is a
unicode string
if match:
# None-safe wrapper around urllib.unquote to
handle
# unmatched optional groups correctly
def unquote(s):
if s is None: return s
return urllib.unquote(s) # it should be
urllib.unquote(str(s)).decode('utf-8')


Flier Lu

unread,
May 26, 2011, 6:48:25 AM5/26/11
to Tornado Web Server
It seems IE has different behavior, Firefox and Chrome will encode the
non-ascii with UTF-8 first, IE encode it with the native encoding.

Ben Darnell

unread,
May 28, 2011, 5:27:51 PM5/28/11
to python-...@googlegroups.com
On Thu, May 26, 2011 at 3:48 AM, Flier Lu <flier.lu@gmail.com> wrote:
It seems IE has different behavior, Firefox and Chrome will encode the
non-ascii with UTF-8 first, IE encode it with the native encoding.

Yeah, this sort of thing is why I generally prefer to deal with bytes as much as possible (although python 3 makes that difficult).  However, we already assume utf8 for query parameters (by decoding utf8 in get_argument()), so we should do the same for path components for consistency.  If you're having problems with IE not sending utf8, you may be able to trick it with a hidden form field: http://railssnowman.info/

-Ben

Ben Darnell

unread,
May 30, 2011, 2:43:19 AM5/30/11
to python-...@googlegroups.com
It's now possible to decode url components and query parameters with an encoding other than utf-8, by overriding RequestHandler.decode_argument.  

-Ben
Reply all
Reply to author
Forward
0 new messages