Charset for URL decoding (#19468)

89 views
Skip to first unread message

Aymeric Augustin

unread,
Dec 18, 2012, 2:43:06 AM12/18/12
to django-d...@googlegroups.com
Hello,

#19468 sparked an interesting debate; Claude and I would like some feedback before making a decision.

Here's a summary of the problem.

Django must decode environ['PATH_INFO'] to obtain request.path, where decoding means :
1 - URL-decoding to a bytestring
2 - "charset-decoding" to an unicode string

The question is : what charset should be used in step 2?

Candidates are:
a)





--
Aymeric.

Aymeric Augustin

unread,
Dec 18, 2012, 3:34:01 AM12/18/12
to django-d...@googlegroups.com
(complete version follows)

Hello,

I'm looking for some feedback on #19468 before making a decision. It's one of the tickets that currently block the 1.5 release.

Here's a summary of the problem.

Django must decode environ['PATH_INFO'] to obtain request.path, where decoding means :
  1 - URL-decoding to a bytestring
  2 - "charset-decoding" to an unicode string

The question is : which charset should be used in step 2? settings.DEFAULT_CHARSET or utf-8 (hardcoded)?

Of course, since the default value of DEFAULT_CHARSET is utf-8, this only makes a difference for the websites where it's been changed. They're probably a minority.

Currently, Django uses utf-8. As far as I can tell, that's more a side-effect of (ab)using force_str than anything else. It also has the drawback of making it impossible to serve perfectly legit HTTP URLs such as /caf%E9/ — try it: https://www.djangoproject.com/caf%E9/ — that returns a 400 with no content. I think I once saw a ticket about this, but I can't locate it right now.

If we switch to DEFAULT_CHARSET, we'll also have to change the reverse() function and the {% url %} tag to honor DEFAULT_CHARSET when it encodes URLs, so that URLs round-trip properly.

Arguments for DEFAULT_CHARSET / against UTF-8:
- The query string is already decoded with DEFAULT_CHARSET; it's weird to decode different parts of the URL with different charsets (principle of least astonishment).
- It should be possible to serve any valid HTTP URL with Django (see example above).

Arguments for UTF-8 / against DEFAULT_CHARSET:
- Browsers default to UTF-8 when they open non-ASCII URLs.
- Everyone should use UTF-8 everywhere anyway; HTTP only allows non-ASCII URLs for legacy reasons.

Do you have experience on this topic? What do you think?

-- 
Aymeric.

Łukasz Rekucki

unread,
Dec 18, 2012, 5:10:03 AM12/18/12
to django-developers
On 18 December 2012 09:34, Aymeric Augustin <aymeric....@polytechnique.org> wrote:
(complete version follows)

Hello,

I'm looking for some feedback on #19468 before making a decision. It's one of the tickets that currently block the 1.5 release.

Here's a summary of the problem.

Django must decode environ['PATH_INFO'] to obtain request.path, where decoding means :
  1 - URL-decoding to a bytestring
  2 - "charset-decoding" to an unicode string

The question is : which charset should be used in step 2? settings.DEFAULT_CHARSET or utf-8 (hardcoded)?


I wonder if  UTF-8 with "surrogates escape" error mode makes sense here. Python 3 uses it for decoding file-system paths, where it's not always possible to determine the charset.  I think it's pretty much the same case. After all, the %-coded bytes can be some binary data that's not possible to reasonably decode with any charset.


--
Łukasz Rekucki

Karen Tracey

unread,
Dec 18, 2012, 8:26:27 AM12/18/12
to django-d...@googlegroups.com
On Tue, Dec 18, 2012 at 3:34 AM, Aymeric Augustin <aymeric....@polytechnique.org> wrote:

Currently, Django uses utf-8. As far as I can tell, that's more a side-effect of (ab)using force_str than anything else. It also has the drawback of making it impossible to serve perfectly legit HTTP URLs such as /caf%E9/ — try it: https://www.djangoproject.com/caf%E9/ — that returns a 400 with no content. I think I once saw a ticket about this, but I can't locate it right now.


I think it is:

https://code.djangoproject.com/ticket/5738

Comment #10 notes that utf-8 is what Django will use but with the last fix noted against that ticket it is easier for the request class to be subclassed to change things for an installation where a different charset for decoding might be desired.

Karen

Aymeric Augustin

unread,
Dec 18, 2012, 10:08:25 AM12/18/12
to django-d...@googlegroups.com
2012/12/18 Karen Tracey <kmtr...@gmail.com>
https://code.djangoproject.com/ticket/5738

Comment #10 notes that utf-8 is what Django will use but with the last fix noted against that ticket it is easier for the request class to be subclassed to change things for an installation where a different charset for decoding might be desired.

Hi Karen,

Indeed, thanks.

I reviewed that ticket and I agree with the final fix. Sure, it's a bit sloppy to use a protocol-level message for an application-level requirement (in other words, to reply to a well-formed HTTP request with 400 Bad Request), but I can't think of a better solution when the URL cannot be decoded with the "expected charset".

Now, I'm proposing something slightly different. I think that the "expected charset" to be DEFAULT_CHARSET rather than "utf-8". If the URL cannot be decoded, the error handling designed in #5738 would still kick in.

I consider that Django already "strongly recommends" utf-8, as it's the default value of DEFAULT_CHARSET. If a developper changes this setting, I believe it should apply to the HTTP side of the application as a whole, and specifically, to the entire URL.

-- 
Aymeric.

Aymeric Augustin

unread,
Dec 18, 2012, 10:19:03 AM12/18/12
to django-d...@googlegroups.com
2012/12/18 Łukasz Rekucki <lrek...@gmail.com>
I wonder if  UTF-8 with "surrogates escape" error mode makes sense here. Python 3 uses it for decoding file-system paths, where it's not always possible to determine the charset.  I think it's pretty much the same case. After all, the %-coded bytes can be some binary data that's not possible to reasonably decode with any charset.

Hi Łukasz,

This proposal is about error handling, which was settled in #5738.

(Off-topic: another fallback strategy is to use iso-8859-1, because it maps 1-to-1 to bytes. WSGI on Python 3 takes advantage of this.)

--
Aymeric.
Reply all
Reply to author
Forward
0 new messages