Request data encoding

60 views
Skip to first unread message

Daniel Brandt

unread,
Aug 2, 2007, 3:08:32 PM8/2/07
to django-d...@googlegroups.com
I was cleaning up a django app to work with the unicode merge, when I
noticed a problem that has taken me a good while to debug, and I
still don't have the full picture.

I am recieving POST-data that is submitted to my application not via a
form or a browser, but from other web applications, according to a
known protocol. This data may or may not have the charset of the data
set in the Content-Type header.

When data comes this way, the Request object I handle in my view never
have its charset attribute set. This leads to a situation where the
POST data, when it is converted to unicode, will have all non-ascii
characters translated to \ufffd (the codepoint for an unknown
character). Not exatly great when most of the data comes this way is
latin-1 encoded.

I solved it temporarily like this:

i = request.META['CONTENT_TYPE'].find('charset')
if i > 0:
request.encoding = request.META['CONTENT_TYPE'][i+8:]
else:
request.encoding = 'ISO-8859-1'

.. which is far from good or fail-safe, but handles the situation for
the moment at least.

Of course this needs a more elegant solution, but I'm unsure what
would solve this in a good way. I do feel though, that the framework
ought to handle the unicode conversion better in this case.

Any comments?

Regards,
Daniel

Jacob Kaplan-Moss

unread,
Aug 2, 2007, 4:39:16 PM8/2/07
to django-d...@googlegroups.com
On 8/2/07, Daniel Brandt <daniel...@gmail.com> wrote:
> I am recieving POST-data that is submitted to my application not via a
> form or a browser, but from other web applications, according to a
> known protocol. This data may or may not have the charset of the data
> set in the Content-Type header.

Yuck, clients that don't speak HTTP correctly make me angry.

Reading the RFC, though, I see that since HTTP 1.0 made "charset"
optional, it remains so in HTTP 1.1, and we're supposed to "guess" and
use ISO-8859-1 like you're doing in your code snippet. I suppose that
means that Django's request object should do pretty much what you've
done in this snippet.

Care to try your hand at whipping up a patch?

Jacob

Gábor Farkas

unread,
Aug 2, 2007, 5:02:07 PM8/2/07
to django-d...@googlegroups.com

i made some quick tests (on-disk html file + netcat),
and it seems that firefox does not send the charset at all when it
submits a form...

i admit it was only a quick test, so maybe i did it wrong, but if not,
then it's perhaps not always a good idea to guess the charset...

gabor

ludvig.ericson

unread,
Aug 2, 2007, 5:06:27 PM8/2/07
to Django developers

On Aug 2, 11:02 pm, Gábor Farkas <ga...@nekomancer.net> wrote:
> Jacob Kaplan-Moss wrote:

> > On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote:
> >> I am recieving POST-data that is submitted to my application not via a
> >> form or a browser, but from other web applications, according to a
> >> known protocol. This data may or may not have the charset of the data
> >> set in the Content-Type header.
>
> > Yuck, clients that don't speak HTTP correctly make me angry.
>
> > Reading the RFC, though, I see that since HTTP 1.0 made "charset"
> > optional, it remains so in HTTP 1.1, and we're supposed to "guess" and
> > use ISO-8859-1 like you're doing in your code snippet. I suppose that
> > means that Django's request object should do pretty much what you've
> > done in this snippet.
>
> > Care to try your hand at whipping up a patch?
>
> i made some quick tests (on-disk html file + netcat),
> and it seems that firefox does not send the charset at all when it
> submits a form...

Surely this is done if you set the enctype correctly?

There's also the possibility that Firefox looks for HTTP headers
telling it what charsets are acceptable -- though I forgot the name of
said header, it's one of the Accept-* ones.

> i admit it was only a quick test, so maybe i did it wrong, but if not,
> then it's perhaps not always a good idea to guess the charset...
>
> gabor

Ludvig Ericson

Daniel Brandt

unread,
Aug 2, 2007, 5:08:44 PM8/2/07
to django-d...@googlegroups.com

Sure.. Will do when I get to work in the morning. I'll notify you
people here on the list when I'm done creating a ticket with a first
stab at a patch.

// D

Craig Ogg

unread,
Aug 2, 2007, 6:14:55 PM8/2/07
to django-d...@googlegroups.com
On 8/2/07, ludvig.ericson <ludvig....@gmail.com> wrote:
> On Aug 2, 11:02 pm, Gábor Farkas <ga...@nekomancer.net> wrote:
> > Jacob Kaplan-Moss wrote:
> > > On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote:
> > >> I am recieving POST-data that is submitted to my application not via a
> > >> form or a browser, but from other web applications, according to a
> > >> known protocol. This data may or may not have the charset of the data
> > >> set in the Content-Type header.
> >
> There's also the possibility that Firefox looks for HTTP headers
> telling it what charsets are acceptable -- though I forgot the name of
> said header, it's one of the Accept-* ones.
>

There is actually very inconsistent browser behavior when it comes to
letting an app know the charset of a form.

I hit this when developing a bookmarklet that would submit forms from
web pages on arbitrary servers. From my experience, here is the short
version:

1. There is no way to reliably know the charset of a url-encoded form
across all browsers from the content of the submission, but the
charset of the submitted form will be the same charset used to render
the form.

2. You can use a hidden field _charset_ on IE and Firefox (but not
Safari) to reliably get the charset.

3. W3C now recommends using multipart/form-data for non-ASCII data
(essentially all forms) [1]:

"The content type "application/x-www-form-urlencoded" is inefficient
for sending large quantities of binary data or text containing
non-ASCII characters. The content type "multipart/form-data" should be
used for submitting forms that contain files, non-ASCII data, and
binary data."

(For a decent overview of the issue, but a little dated, see [2])

IMO, the best route forward for django would be to assume that the
decoding should be done using the same charset the site is using to
render pages. If the developer has special needs, they can use
_charset_ or other means to determine the charset and handle the
encoding.

As an aside, I also found that virtually all browsers actually use
Windows-1252 when they say they are using Latin-1 (across all Windows,
Mac and Linux at least). The easiest test for this is the trademark
symbol (tm) which doesn't exist in ISO-8859-1. This is described in
Wikipedia [3] and can be seen by setting the encoding on your browser
while viewing this page [4] for the Palm Treo which a literal (tm) in
it that renders fine when the browser is set to ISO-8859-1. Greatest
compatibility with browsers would also treat ISO-8859-1 as
Windows-1252.

I am new to django and this list, so I hope this email is constructive
and helpful.

Craig

[1] http://www.w3.org/TR/html40/interact/forms.html#submit-format
[2] http://www.crazysquirrel.com/computing/general/form-encoding.jspx
[3] http://en.wikipedia.org/wiki/ISO_8859-1
[4] http://www.palm.com/us/products/smartphones/treo650/

Jacob Kaplan-Moss

unread,
Aug 2, 2007, 7:01:03 PM8/2/07
to django-d...@googlegroups.com
On 8/2/07, Craig Ogg <crai...@gmail.com> wrote:
> I am new to django and this list, so I hope this email is constructive
> and helpful.

Amazingly useful!

I'm still digesting all that tasty informational goodness, but I'm
pretty sure you're right that we should assume the current default
charset instead of hardcoding ISO-whatever-1.

Jacob

Simon Willison

unread,
Aug 2, 2007, 8:19:51 PM8/2/07
to Django developers
On Aug 2, 9:39 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com>
wrote:

> Yuck, clients that don't speak HTTP correctly make me angry.
>
> Reading the RFC, though, I see that since HTTP 1.0 made "charset"
> optional, it remains so in HTTP 1.1, and we're supposed to "guess" and
> use ISO-8859-1 like you're doing in your code snippet. I suppose that
> means that Django's request object should do pretty much what you've
> done in this snippet.

This is a totally ridiculous flaw with the HTTP spec - you literally
have no reliable way of telling what encoding a request coming in to
your site uses, since you can't be absolutely sure that the user-agent
read a page from your site to find out your character encoding!

One really smart trick you can do is this: attempt to decode as UTF-8
(which is nice and strict and will fail noisily for pretty much
anything that isn't either UTF-8 or ASCII, a UTF-8 subset). If
decoding fails, assume ISO-8859-1 which will decode absolutely
anything without ever throwing an error (although if the content isn't
ISO-8859-1 you'll end up with garbage). I tend to call this the Flickr
trick, because of the lovely big letters here:
http://www.flickr.com/services/api/misc.encoding.html

If it really matters, you can use Mark Pilgrim's chardet library to
detect the most likely encoding based on statistical analysis:
http://chardet.feedparser.org/

Jacob Kaplan-Moss

unread,
Aug 2, 2007, 8:33:30 PM8/2/07
to django-d...@googlegroups.com
On 8/2/07, Simon Willison <swil...@gmail.com> wrote:
> This is a totally ridiculous flaw with the HTTP spec - you literally
> have no reliable way of telling what encoding a request coming in to
> your site uses, since you can't be absolutely sure that the user-agent
> read a page from your site to find out your character encoding!

W3C FTW!

> One really smart trick you can do is this: attempt to decode as UTF-8
> (which is nice and strict and will fail noisily for pretty much
> anything that isn't either UTF-8 or ASCII, a UTF-8 subset). If
> decoding fails, assume ISO-8859-1 which will decode absolutely
> anything without ever throwing an error (although if the content isn't
> ISO-8859-1 you'll end up with garbage). I tend to call this the Flickr
> trick, because of the lovely big letters here:
> http://www.flickr.com/services/api/misc.encoding.html

Yeah, fooling around with it that's been pretty much the conclusion
I've come to.

I'd like to wait for Malcolm to weigh in since he wrote much of this
code (and I think he's on his way back to AU so it might be a bit
before he's over jetlag and back on the list), but I think this is the
right approach:

* Try to decode the form data using ``settings.DEFAULT_CHARSET``. In
most cases this'll be UTF-8, but when it's not we can try to assume
that data's being POSTed back in the same encoding we're serving it up
in.
* If that fails and ``DEFAULT_CHARSET`` isn't UTF-8, try UTF-8.
That'll deal with relatively sane automated clients (i.e.
``WWW::Mechanize`` and all its clones).
* If that fails, use ISO-WTFBBQNAMBLA-1.

How's that sound?

Craig Ogg

unread,
Aug 2, 2007, 9:10:11 PM8/2/07
to django-d...@googlegroups.com
On 8/2/07, Jacob Kaplan-Moss <jacob.ka...@gmail.com> wrote:
> * Try to decode the form data using ``settings.DEFAULT_CHARSET``. In
> most cases this'll be UTF-8, but when it's not we can try to assume
> that data's being POSTed back in the same encoding we're serving it up
> in.
> * If that fails and ``DEFAULT_CHARSET`` isn't UTF-8, try UTF-8.
> That'll deal with relatively sane automated clients (i.e.
> ``WWW::Mechanize`` and all its clones).
> * If that fails, use ISO-WTFBBQNAMBLA-1.
>
> How's that sound?
>

As long as ISO-WTFBBQNAMBLA-1 is actually Windows-1252, sounds great
:-) FWIW, only difference between Windows-1252 and ISO-8859-1 is that
the first has characters (like tm) where the other has control
characters that are meaningless on the web.

Craig

James Tauber

unread,
Aug 3, 2007, 4:52:02 AM8/3/07
to django-d...@googlegroups.com

Coincidentally, I mostly recently encountered this problem with form
character encoding an hour ago when I tried to use the International
Phonetic Alphabet in a comment on Ian Bicking's blog post[1] about
how to pronounce Django :-)

(note: at time of writing, Ian hadn't approved my comment yet)

James

[1] http://blog.ianbicking.org/2007/08/02/pronouncing-django/


--
James Tauber http://jtauber.com/
journeyman of some http://jtauber.com/blog/


Daniel Brandt

unread,
Aug 3, 2007, 2:43:42 PM8/3/07
to django-d...@googlegroups.com
OK.. I gave it a shot.. not really satisfied with the patch yet, but
it's a start. Feel free to suggest improvements (or submit a better,
or extended, patch) if need be.

Here's the ticket http://code.djangoproject.com/ticket/5076

Hope everyone has a great friday!
Regards,
Daniel

Malcolm Tredinnick

unread,
Aug 10, 2007, 11:09:04 PM8/10/07
to django-d...@googlegroups.com

I realise this is now an old thread, but I wanted to point out that the
above paragraphs are precisely the reasoning behind why we do things the
current way in Django. Because there is no reliable way to know the
submission encoding, we assume it is what Django uses by default and
provide a way to set it (via request.encoding) inside the view (which is
important, so that it can be set by the client code) on a per-view
basis. The logic is that the client code is in a much better position
than Django to know what the encoding is when it's talking to a legacy
application.

When I wrote the docs for form encoding handling, I made it simple and
wrote "there is no reliable way to tell", but Craig has laid out all the
reasoning behind the decision.

Regards,
Malcolm

--
How many of you believe in telekinesis? Raise my hand...
http://www.pointy-stick.com/blog/

Malcolm Tredinnick

unread,
Aug 10, 2007, 11:16:18 PM8/10/07
to django-d...@googlegroups.com

I dislike it. Various fallbacks were discussed in past threads and I
read them all again when doing that work. They all sounded flawed for
the same reason: Different people have different expectations about how
fallbacks should work (after all, if you're cursed to receive data from
Windows clients, cp-1252 is the best first fallback). Setting false
expectations feels wrong here.

We make one attempt at a default -- using the commonly applicable case
that Django will have generated the form. We provide a one-line way to
change the encoding if this isn't the case. Note that you can set
request.encoding *at any point* in the process, even after you've
already tried to access GET and POST. All it does it reset those
properties and redecodes the data with your new encoding settings the
next time you try to access them. So it's not a burden, in practice.

Receiving genuinely bad/invalid data is not uncommon either, as is
obvious as soon as you start running a really anal comment sanitisation
feature or looking at uploads from corporate systems. Trying to silently
change the encoding just to minimise the errors isn't a solution here --
you'll often end up in the wrong encoding altogether, when you should
have been ignoring bad data (because things like cp-1252 and iso-8859-1
understand more single byte values in all contexts than UTF-8, for
example). Change the encoding deliberately or not all.

I'm -1 on the proposed patch and the change in general.

Regards,
Malcolm

--
If Barbie is so popular, why do you have to buy her friends?
http://www.pointy-stick.com/blog/

Malcolm Tredinnick

unread,
Aug 10, 2007, 11:22:51 PM8/10/07
to django-d...@googlegroups.com
On Sat, 2007-08-11 at 13:16 +1000, Malcolm Tredinnick wrote:
[...]

> Receiving genuinely bad/invalid data is not uncommon either, as is
> obvious as soon as you start running a really anal comment sanitisation
> feature or looking at uploads from corporate systems. Trying to silently
> change the encoding just to minimise the errors isn't a solution here --
> you'll often end up in the wrong encoding altogether, when you should
> have been ignoring bad data (because things like cp-1252 and iso-8859-1
> understand more single byte values in all contexts than UTF-8, for
> example). Change the encoding deliberately or not all.
>
> I'm -1 on the proposed patch and the change in general.

Addendum: I'm not against including a method that, when called, tries to
guess that encoding of raw_data and even sets the HttpRequest encoding
attribute based on that. It could even conditionally import and use Mark
Pilgrim's charset detection code if you wanted (not particularly fast,
but reasonably thorough). My argument is that we shouldn't make this the
default behaviour, though.

Malcolm

--
Tolkien is hobbit-forming.
http://www.pointy-stick.com/blog/

Daniel Brandt

unread,
Aug 11, 2007, 2:04:45 PM8/11/07
to django-d...@googlegroups.com
My only problem with this is that I feel I'm writing framework-code in
my view. I'm clearly breaking the separation between application and
framework. Of course, no design will ever be perfect and you will
always have corner cases like this.

Instead of ditching the idea alltogether I think making some kind of
code available that can guess the charset would be the better
solution. The view coder may then execute it at his/her own option,
should they need it.

Regards,
Daniel

Malcolm Tredinnick

unread,
Aug 11, 2007, 8:26:37 PM8/11/07
to django-d...@googlegroups.com

Yes, absolutely. Sorry if I wasn't clear, but that was the approach I
was advocating in my addendum email. Even happy with it being a method
on HttpRequest so that request.guess_charset() works.

My slight concern was with character set "guessing" (a.k.a. botching up,
in some cases) being automatic behaviour. Nothing more.

Reply all
Reply to author
Forward
0 new messages