Unicode in URLs

0 views
Skip to first unread message

Jim Fulton

unread,
Jul 23, 2009, 7:06:52 AM7/23/09
to paste...@googlegroups.com
webob doesn't convert URLs to Unicode. RFC3986 specifies that UCS
characters should be encoded in URIs using a UTF-8 encoding followed
by a URL encoding, so the reverse decoding is straightforward.
Reading the RFC, I can see how the decision of whether to interpret
URLs (or URL path segments) as encoded UCS characters might be
application specific.

My question is whether it was a design decision to leave URLs
un-decoded, and, if so, what the rational is. I'm not necessarily
disagreeing with such a decision. :)

Jim

--
Jim Fulton

Ian Bicking

unread,
Jul 23, 2009, 5:06:16 PM7/23/09
to Jim Fulton, paste...@googlegroups.com
I have intended to decode them, specifically req.path_info and req.script_name, using the same encoding that req.GET etc (req.charset).  I just haven't gotten around to making that change; I am a little worried about messing up people's code, but then waiting doesn't make that any easier either.

--
Ian Bicking  |  http://blog.ianbicking.org  |  http://topplabs.org/civichacker

Chris McDonough

unread,
Jul 23, 2009, 5:18:18 PM7/23/09
to Ian Bicking, Jim Fulton, paste...@googlegroups.com

If you do this, do you think you could make this be unicode_path_info,
unicode_script_name, etc? We've already solidified a lot of code around the
fact that these are not decoded.

- C

Jim Fulton

unread,
Jul 23, 2009, 5:37:46 PM7/23/09
to Ian Bicking, paste...@googlegroups.com
On Thu, Jul 23, 2009 at 5:06 PM, Ian Bicking<ianbi...@gmail.com> wrote:
> On Thu, Jul 23, 2009 at 4:06 AM, Jim Fulton <j...@zope.com> wrote:
>>
>> webob doesn't convert URLs to Unicode.  RFC3986 specifies that UCS
>> characters should be encoded in URIs using a UTF-8 encoding followed
>> by a URL encoding, so the reverse decoding is straightforward.
>> Reading the RFC, I can see how the decision of whether to interpret
>> URLs (or URL path segments) as encoded UCS characters might be
>> application specific.
>>
>> My question is whether it was a design decision to leave URLs
>> un-decoded, and, if so, what the rational is.  I'm not necessarily
>> disagreeing with such a decision.  :)
>
> I have intended to decode them, specifically req.path_info and
> req.script_name, using the same encoding that req.GET etc (req.charset).

That would be inconsistent with RFC3986, which specifies utf-8.

Jim

--
Jim Fulton

Ian Bicking

unread,
Jul 23, 2009, 5:46:13 PM7/23/09
to Jim Fulton, paste...@googlegroups.com

I guess it really depends on What The World Actually Does, and I'm not
sure in this case.  For instance, QUERY_STRING is encoded with the
page encoding I'm pretty sure, so then presumably it could be
/UTF8-urlencoded-data?latin1-urlencoded-data -- which of course may
actually be the case (after all, the browser doesn't generate the
path).  Also, what happens when you have <a href="/bête"> or something
in a page? The browser encodes unsafe characters in these cases.
So... I'm hoping someone who has experience with the more challenging
situations with encodings could say what happens.

Sergey Schetinin

unread,
Jul 23, 2009, 7:14:48 PM7/23/09
to Ian Bicking, Jim Fulton, paste...@googlegroups.com
I think unicode versions of those attrs should be separate and would
like to suggest names upath_info, uscript_name (and an alias ubody for
unicode_body). My experience with non-ascii URIs and forms data made
me stick to ASCII and if that's not possible, UTF-8. Still links from
other websites to non-ascii uris sometimes make the user-agent send
request in some other encoding. So I try to keep script_name /
path_info in ASCII and use POST for forms. Google seems to use an
additional field in search form to specify what encoding the form used
(ie=..., probably meaning "input encoding" and oe=... for the encoding
of page returned) and I think they only use ascii for the path
component.
--
Best Regards,
Sergey Schetinin

http://s3bk.com/ -- S3 Backup
http://word-to-html.com/ -- Word to HTML Converter
Reply all
Reply to author
Forward
0 new messages