I was cleaning up a django app to work with the unicode merge, when I noticed a problem that has taken me a good while to debug, and I still don't have the full picture.
I am recieving POST-data that is submitted to my application not via a form or a browser, but from other web applications, according to a known protocol. This data may or may not have the charset of the data set in the Content-Type header.
When data comes this way, the Request object I handle in my view never have its charset attribute set. This leads to a situation where the POST data, when it is converted to unicode, will have all non-ascii characters translated to \ufffd (the codepoint for an unknown character). Not exatly great when most of the data comes this way is latin-1 encoded.
I solved it temporarily like this:
i = request.META['CONTENT_TYPE'].find('charset') if i > 0: request.encoding = request.META['CONTENT_TYPE'][i+8:] else: request.encoding = 'ISO-8859-1'
.. which is far from good or fail-safe, but handles the situation for the moment at least.
Of course this needs a more elegant solution, but I'm unsure what would solve this in a good way. I do feel though, that the framework ought to handle the unicode conversion better in this case.
On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote:
> I am recieving POST-data that is submitted to my application not via a > form or a browser, but from other web applications, according to a > known protocol. This data may or may not have the charset of the data > set in the Content-Type header.
Yuck, clients that don't speak HTTP correctly make me angry.
Reading the RFC, though, I see that since HTTP 1.0 made "charset" optional, it remains so in HTTP 1.1, and we're supposed to "guess" and use ISO-8859-1 like you're doing in your code snippet. I suppose that means that Django's request object should do pretty much what you've done in this snippet.
Jacob Kaplan-Moss wrote: > On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote: >> I am recieving POST-data that is submitted to my application not via a >> form or a browser, but from other web applications, according to a >> known protocol. This data may or may not have the charset of the data >> set in the Content-Type header.
> Yuck, clients that don't speak HTTP correctly make me angry.
> Reading the RFC, though, I see that since HTTP 1.0 made "charset" > optional, it remains so in HTTP 1.1, and we're supposed to "guess" and > use ISO-8859-1 like you're doing in your code snippet. I suppose that > means that Django's request object should do pretty much what you've > done in this snippet.
> Care to try your hand at whipping up a patch?
i made some quick tests (on-disk html file + netcat), and it seems that firefox does not send the charset at all when it submits a form...
i admit it was only a quick test, so maybe i did it wrong, but if not, then it's perhaps not always a good idea to guess the charset...
> Jacob Kaplan-Moss wrote: > > On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote: > >> I am recieving POST-data that is submitted to my application not via a > >> form or a browser, but from other web applications, according to a > >> known protocol. This data may or may not have the charset of the data > >> set in the Content-Type header.
> > Yuck, clients that don't speak HTTP correctly make me angry.
> > Reading the RFC, though, I see that since HTTP 1.0 made "charset" > > optional, it remains so in HTTP 1.1, and we're supposed to "guess" and > > use ISO-8859-1 like you're doing in your code snippet. I suppose that > > means that Django's request object should do pretty much what you've > > done in this snippet.
> > Care to try your hand at whipping up a patch?
> i made some quick tests (on-disk html file + netcat), > and it seems that firefox does not send the charset at all when it > submits a form...
Surely this is done if you set the enctype correctly?
There's also the possibility that Firefox looks for HTTP headers telling it what charsets are acceptable -- though I forgot the name of said header, it's one of the Accept-* ones.
> i admit it was only a quick test, so maybe i did it wrong, but if not, > then it's perhaps not always a good idea to guess the charset...
> On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote: > > I am recieving POST-data that is submitted to my application not via a > > form or a browser, but from other web applications, according to a > > known protocol. This data may or may not have the charset of the data > > set in the Content-Type header.
> Yuck, clients that don't speak HTTP correctly make me angry.
> Reading the RFC, though, I see that since HTTP 1.0 made "charset" > optional, it remains so in HTTP 1.1, and we're supposed to "guess" and > use ISO-8859-1 like you're doing in your code snippet. I suppose that > means that Django's request object should do pretty much what you've > done in this snippet.
> Care to try your hand at whipping up a patch?
Sure.. Will do when I get to work in the morning. I'll notify you people here on the list when I'm done creating a ticket with a first stab at a patch.
On 8/2/07, ludvig.ericson <ludvig.eric...@gmail.com> wrote:
> On Aug 2, 11:02 pm, Gábor Farkas <ga...@nekomancer.net> wrote: > > Jacob Kaplan-Moss wrote: > > > On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote: > > >> I am recieving POST-data that is submitted to my application not via a > > >> form or a browser, but from other web applications, according to a > > >> known protocol. This data may or may not have the charset of the data > > >> set in the Content-Type header.
> There's also the possibility that Firefox looks for HTTP headers > telling it what charsets are acceptable -- though I forgot the name of > said header, it's one of the Accept-* ones.
There is actually very inconsistent browser behavior when it comes to letting an app know the charset of a form.
I hit this when developing a bookmarklet that would submit forms from web pages on arbitrary servers. From my experience, here is the short version:
1. There is no way to reliably know the charset of a url-encoded form across all browsers from the content of the submission, but the charset of the submitted form will be the same charset used to render the form.
2. You can use a hidden field _charset_ on IE and Firefox (but not Safari) to reliably get the charset.
3. W3C now recommends using multipart/form-data for non-ASCII data (essentially all forms) [1]:
"The content type "application/x-www-form-urlencoded" is inefficient for sending large quantities of binary data or text containing non-ASCII characters. The content type "multipart/form-data" should be used for submitting forms that contain files, non-ASCII data, and binary data."
(For a decent overview of the issue, but a little dated, see [2])
IMO, the best route forward for django would be to assume that the decoding should be done using the same charset the site is using to render pages. If the developer has special needs, they can use _charset_ or other means to determine the charset and handle the encoding.
As an aside, I also found that virtually all browsers actually use Windows-1252 when they say they are using Latin-1 (across all Windows, Mac and Linux at least). The easiest test for this is the trademark symbol (tm) which doesn't exist in ISO-8859-1. This is described in Wikipedia [3] and can be seen by setting the encoding on your browser while viewing this page [4] for the Palm Treo which a literal (tm) in it that renders fine when the browser is set to ISO-8859-1. Greatest compatibility with browsers would also treat ISO-8859-1 as Windows-1252.
I am new to django and this list, so I hope this email is constructive and helpful.
> I am new to django and this list, so I hope this email is constructive > and helpful.
Amazingly useful!
I'm still digesting all that tasty informational goodness, but I'm pretty sure you're right that we should assume the current default charset instead of hardcoding ISO-whatever-1.
On Aug 2, 9:39 pm, "Jacob Kaplan-Moss" <jacob.kaplanm...@gmail.com> wrote:
> Yuck, clients that don't speak HTTP correctly make me angry.
> Reading the RFC, though, I see that since HTTP 1.0 made "charset" > optional, it remains so in HTTP 1.1, and we're supposed to "guess" and > use ISO-8859-1 like you're doing in your code snippet. I suppose that > means that Django's request object should do pretty much what you've > done in this snippet.
This is a totally ridiculous flaw with the HTTP spec - you literally have no reliable way of telling what encoding a request coming in to your site uses, since you can't be absolutely sure that the user-agent read a page from your site to find out your character encoding!
One really smart trick you can do is this: attempt to decode as UTF-8 (which is nice and strict and will fail noisily for pretty much anything that isn't either UTF-8 or ASCII, a UTF-8 subset). If decoding fails, assume ISO-8859-1 which will decode absolutely anything without ever throwing an error (although if the content isn't ISO-8859-1 you'll end up with garbage). I tend to call this the Flickr trick, because of the lovely big letters here: http://www.flickr.com/services/api/misc.encoding.html
If it really matters, you can use Mark Pilgrim's chardet library to detect the most likely encoding based on statistical analysis: http://chardet.feedparser.org/
On 8/2/07, Simon Willison <swilli...@gmail.com> wrote:
> This is a totally ridiculous flaw with the HTTP spec - you literally > have no reliable way of telling what encoding a request coming in to > your site uses, since you can't be absolutely sure that the user-agent > read a page from your site to find out your character encoding!
W3C FTW!
> One really smart trick you can do is this: attempt to decode as UTF-8 > (which is nice and strict and will fail noisily for pretty much > anything that isn't either UTF-8 or ASCII, a UTF-8 subset). If > decoding fails, assume ISO-8859-1 which will decode absolutely > anything without ever throwing an error (although if the content isn't > ISO-8859-1 you'll end up with garbage). I tend to call this the Flickr > trick, because of the lovely big letters here: > http://www.flickr.com/services/api/misc.encoding.html
Yeah, fooling around with it that's been pretty much the conclusion I've come to.
I'd like to wait for Malcolm to weigh in since he wrote much of this code (and I think he's on his way back to AU so it might be a bit before he's over jetlag and back on the list), but I think this is the right approach:
* Try to decode the form data using ``settings.DEFAULT_CHARSET``. In most cases this'll be UTF-8, but when it's not we can try to assume that data's being POSTed back in the same encoding we're serving it up in. * If that fails and ``DEFAULT_CHARSET`` isn't UTF-8, try UTF-8. That'll deal with relatively sane automated clients (i.e. ``WWW::Mechanize`` and all its clones). * If that fails, use ISO-WTFBBQNAMBLA-1.
On 8/2/07, Jacob Kaplan-Moss <jacob.kaplanm...@gmail.com> wrote:
> * Try to decode the form data using ``settings.DEFAULT_CHARSET``. In > most cases this'll be UTF-8, but when it's not we can try to assume > that data's being POSTed back in the same encoding we're serving it up > in. > * If that fails and ``DEFAULT_CHARSET`` isn't UTF-8, try UTF-8. > That'll deal with relatively sane automated clients (i.e. > ``WWW::Mechanize`` and all its clones). > * If that fails, use ISO-WTFBBQNAMBLA-1.
> How's that sound?
As long as ISO-WTFBBQNAMBLA-1 is actually Windows-1252, sounds great :-) FWIW, only difference between Windows-1252 and ISO-8859-1 is that the first has characters (like tm) where the other has control characters that are meaningless on the web.
Coincidentally, I mostly recently encountered this problem with form character encoding an hour ago when I tried to use the International Phonetic Alphabet in a comment on Ian Bicking's blog post[1] about how to pronounce Django :-)
(note: at time of writing, Ian hadn't approved my comment yet)
OK.. I gave it a shot.. not really satisfied with the patch yet, but it's a start. Feel free to suggest improvements (or submit a better, or extended, patch) if need be.
On Thu, 2007-08-02 at 15:14 -0700, Craig Ogg wrote: > On 8/2/07, ludvig.ericson <ludvig.eric...@gmail.com> wrote: > > On Aug 2, 11:02 pm, Gábor Farkas <ga...@nekomancer.net> wrote: > > > Jacob Kaplan-Moss wrote: > > > > On 8/2/07, Daniel Brandt <daniel.bra...@gmail.com> wrote: > > > >> I am recieving POST-data that is submitted to my application not via a > > > >> form or a browser, but from other web applications, according to a > > > >> known protocol. This data may or may not have the charset of the data > > > >> set in the Content-Type header.
> > There's also the possibility that Firefox looks for HTTP headers > > telling it what charsets are acceptable -- though I forgot the name of > > said header, it's one of the Accept-* ones.
> There is actually very inconsistent browser behavior when it comes to > letting an app know the charset of a form.
> I hit this when developing a bookmarklet that would submit forms from > web pages on arbitrary servers. From my experience, here is the short > version:
> 1. There is no way to reliably know the charset of a url-encoded form > across all browsers from the content of the submission, but the > charset of the submitted form will be the same charset used to render > the form.
> 2. You can use a hidden field _charset_ on IE and Firefox (but not > Safari) to reliably get the charset.
> 3. W3C now recommends using multipart/form-data for non-ASCII data > (essentially all forms) [1]:
> "The content type "application/x-www-form-urlencoded" is inefficient > for sending large quantities of binary data or text containing > non-ASCII characters. The content type "multipart/form-data" should be > used for submitting forms that contain files, non-ASCII data, and > binary data."
> (For a decent overview of the issue, but a little dated, see [2])
> IMO, the best route forward for django would be to assume that the > decoding should be done using the same charset the site is using to > render pages. If the developer has special needs, they can use > _charset_ or other means to determine the charset and handle the > encoding.
I realise this is now an old thread, but I wanted to point out that the above paragraphs are precisely the reasoning behind why we do things the current way in Django. Because there is no reliable way to know the submission encoding, we assume it is what Django uses by default and provide a way to set it (via request.encoding) inside the view (which is important, so that it can be set by the client code) on a per-view basis. The logic is that the client code is in a much better position than Django to know what the encoding is when it's talking to a legacy application.
When I wrote the docs for form encoding handling, I made it simple and wrote "there is no reliable way to tell", but Craig has laid out all the reasoning behind the decision.
On Thu, 2007-08-02 at 19:33 -0500, Jacob Kaplan-Moss wrote: > On 8/2/07, Simon Willison <swilli...@gmail.com> wrote: > > This is a totally ridiculous flaw with the HTTP spec - you literally > > have no reliable way of telling what encoding a request coming in to > > your site uses, since you can't be absolutely sure that the user-agent > > read a page from your site to find out your character encoding!
> W3C FTW!
> > One really smart trick you can do is this: attempt to decode as UTF-8 > > (which is nice and strict and will fail noisily for pretty much > > anything that isn't either UTF-8 or ASCII, a UTF-8 subset). If > > decoding fails, assume ISO-8859-1 which will decode absolutely > > anything without ever throwing an error (although if the content isn't > > ISO-8859-1 you'll end up with garbage). I tend to call this the Flickr > > trick, because of the lovely big letters here: > > http://www.flickr.com/services/api/misc.encoding.html
> Yeah, fooling around with it that's been pretty much the conclusion > I've come to.
> I'd like to wait for Malcolm to weigh in since he wrote much of this > code (and I think he's on his way back to AU so it might be a bit > before he's over jetlag and back on the list), but I think this is the > right approach:
> * Try to decode the form data using ``settings.DEFAULT_CHARSET``. In > most cases this'll be UTF-8, but when it's not we can try to assume > that data's being POSTed back in the same encoding we're serving it up > in. > * If that fails and ``DEFAULT_CHARSET`` isn't UTF-8, try UTF-8. > That'll deal with relatively sane automated clients (i.e. > ``WWW::Mechanize`` and all its clones). > * If that fails, use ISO-WTFBBQNAMBLA-1.
> How's that sound?
I dislike it. Various fallbacks were discussed in past threads and I read them all again when doing that work. They all sounded flawed for the same reason: Different people have different expectations about how fallbacks should work (after all, if you're cursed to receive data from Windows clients, cp-1252 is the best first fallback). Setting false expectations feels wrong here.
We make one attempt at a default -- using the commonly applicable case that Django will have generated the form. We provide a one-line way to change the encoding if this isn't the case. Note that you can set request.encoding *at any point* in the process, even after you've already tried to access GET and POST. All it does it reset those properties and redecodes the data with your new encoding settings the next time you try to access them. So it's not a burden, in practice.
Receiving genuinely bad/invalid data is not uncommon either, as is obvious as soon as you start running a really anal comment sanitisation feature or looking at uploads from corporate systems. Trying to silently change the encoding just to minimise the errors isn't a solution here -- you'll often end up in the wrong encoding altogether, when you should have been ignoring bad data (because things like cp-1252 and iso-8859-1 understand more single byte values in all contexts than UTF-8, for example). Change the encoding deliberately or not all.
I'm -1 on the proposed patch and the change in general.
On Sat, 2007-08-11 at 13:16 +1000, Malcolm Tredinnick wrote:
[...]
> Receiving genuinely bad/invalid data is not uncommon either, as is > obvious as soon as you start running a really anal comment sanitisation > feature or looking at uploads from corporate systems. Trying to silently > change the encoding just to minimise the errors isn't a solution here -- > you'll often end up in the wrong encoding altogether, when you should > have been ignoring bad data (because things like cp-1252 and iso-8859-1 > understand more single byte values in all contexts than UTF-8, for > example). Change the encoding deliberately or not all.
> I'm -1 on the proposed patch and the change in general.
Addendum: I'm not against including a method that, when called, tries to guess that encoding of raw_data and even sets the HttpRequest encoding attribute based on that. It could even conditionally import and use Mark Pilgrim's charset detection code if you wanted (not particularly fast, but reasonably thorough). My argument is that we shouldn't make this the default behaviour, though.
My only problem with this is that I feel I'm writing framework-code in my view. I'm clearly breaking the separation between application and framework. Of course, no design will ever be perfect and you will always have corner cases like this.
Instead of ditching the idea alltogether I think making some kind of code available that can guess the charset would be the better solution. The view coder may then execute it at his/her own option, should they need it.
On Sat, 2007-08-11 at 20:04 +0200, Daniel Brandt wrote: > My only problem with this is that I feel I'm writing framework-code in > my view. I'm clearly breaking the separation between application and > framework. Of course, no design will ever be perfect and you will > always have corner cases like this.
> Instead of ditching the idea alltogether I think making some kind of > code available that can guess the charset would be the better > solution. The view coder may then execute it at his/her own option, > should they need it.
Yes, absolutely. Sorry if I wasn't clear, but that was the approach I was advocating in my addendum email. Even happy with it being a method on HttpRequest so that request.guess_charset() works.
My slight concern was with character set "guessing" (a.k.a. botching up, in some cases) being automatic behaviour. Nothing more.