forcing UTF8 data inside django

870 views
Skip to first unread message

Victor Ng

unread,
Dec 10, 2006, 11:02:11 PM12/10/06
to django...@googlegroups.com
Hi all,

The unicode problem seems to creep up in this list a lot, so here's
what I've done to solve my problems.

My particular problem is that I need to be able to deal with Unicode
data in the URLs as well as the regular request GET/POST data.

This is a piece of middleware that I'm using to force all incoming
data to be UTF-8. If you also add in a meta tag in your head section
of your template to declare utf-8, I think IE will actually do the
right thing and not do it's weird charset guessing.

Setting that meta tag, along with explicitly seting the
settings.DEFAULT_CHARSET to 'utf-8', and then applying this middleware
layer seems to get all my string data inside Django to be clean UTF8
data.

This has the advantage over a 'full' unicode conversion of Django
since you don't have to touch any existing Django code, and you don't
have to enforce non-obvious rules like implementing "__unicode__" and
using "unicode()" instead of "str()" everywhere.

Anyway, I hope this is of use to people.

The utf8encode function is probably overly paranoid, but well... I
really don't trust IE to send properly encoded data.

vic

1 import types
2
3 '''
4 This filter will force any incoming GET or POST data to become
UTF8 data for
5 processing inside of Django.
6 '''
7
8 class UTF8Filter(object):
9 def process_request(self, request):
10 get_parms = request.GET
11 post_parms = request.POST
12
13 request.GET._mutable = True
14 request.POST._mutable = True
15
16 for cgiargs in [request.GET, request.POST]:
17 for key, vallist in cgiargs.items():
18 tmp_values = []
19 if isinstance(vallist, types.ListType):
20 for i, val in enumerate(vallist):
21 tmp_values.append(utf8_encode(val))
22 else:
23 tmp_values = [utf8_encode(vallist),]
24
25 cgiargs.setlist(key, tmp_values)
26
27 # Rewrite the request path as UTF8 data for Ajax calls
28 request.path = utf8_encode(request.path)
29
30 request.GET._mutable = False
31 request.POST._mutable = False
32
33 return None
34
35 def utf8_encode(val):
36 try:
37 tmp = val.decode('utf8')
38 except:
39 try:
40 tmp = val.decode('latin1')
41 except:
42 tmp= val.decode('ascii', 'ignore')
43 tmp = tmp.encode('utf8')
44 return tmp

mezhaka

unread,
Dec 11, 2006, 5:44:07 AM12/11/06
to Django users

What was your motivation to create all this?
The reason I am asking, I suppose my problem
(http://groups-beta.google.com/group/django-users/browse_thread/thread/a9b53db451aa4590)
is somehow related to these issues.

Gábor Farkas

unread,
Dec 11, 2006, 7:30:56 AM12/11/06
to django...@googlegroups.com
Victor Ng wrote:
> Hi all,
>
> The unicode problem seems to creep up in this list a lot, so here's
> what I've done to solve my problems.
>
> My particular problem is that I need to be able to deal with Unicode
> data in the URLs as well as the regular request GET/POST data.
>
> This is a piece of middleware that I'm using to force all incoming
> data to be UTF-8. If you also add in a meta tag in your head section
> of your template to declare utf-8, I think IE will actually do the
> right thing and not do it's weird charset guessing.
>

hi,

well, from my experiences, the most important thing is the content-type
http header. if you explicitly tell there the charset, then the browser
will use that, and completely ignore the charset-specification in the
html file.

also, may i ask, why such a paranoid way of working with GET/POST?
because (also, only my experience, no big testing), the browsers submit
their form-data in the charset in which the page containing the form was.

so if you send to the browser an utf-8 page, it's submitted data is
going to be utf-8.


gabor

Victor Ng

unread,
Dec 11, 2006, 11:08:32 AM12/11/06
to django...@googlegroups.com
Hi Gabor,

First off, I just realized that the code I posted earlier has a small bug.

Line 17 should've read:

17 for key, vallist in cgiargs.lists():

the old code used 'items()' which only pulls a single value out of
multivaluedict.

On to unicode....

The reason I'm paranoid about handling GET/POST data is because MSIE
is retarded.

Here's two good references:

http://ln.hixie.ch/?start=1144794177&count=1
http://www.joelonsoftware.com/articles/Unicode.html

Basically, IE ignores the content-type header and figures out the
content type by doing content sniffing.

So sometimes, IE guesses wrong - and you get garbage if you just use
the Content-Type header. If you use the meta tag, it forces UTF8 in
almost all browsers.

victor "MSIE is a four letter word" ng

On 12/11/06, Gábor Farkas <ga...@nekomancer.net> wrote:
> well, from my experiences, the most important thing is the content-type
> http header. if you explicitly tell there the charset, then the browser
> will use that, and completely ignore the charset-specification in the
> html file.
>
> also, may i ask, why such a paranoid way of working with GET/POST?
> because (also, only my experience, no big testing), the browsers submit
> their form-data in the charset in which the page containing the form was.
>
> so if you send to the browser an utf-8 page, it's submitted data is
> going to be utf-8.
>
>
> gabor
>
> >
>


--
"Never attribute to malice that which can be adequately explained by
stupidity." - Hanlon's Razor

Victor Ng

unread,
Dec 11, 2006, 11:17:31 AM12/11/06
to django...@googlegroups.com
Hi Anton,

I don't have mysql5 to test with right now, but I have tested my stuff
against sqlite and it seems to work there, so I can't imagine that
this will cause you problems on mysql.

My usecase is probably like yours - I need multilingual support since
I have to handle names of countries and people from all over the
world.

vic

favo

unread,
Dec 12, 2006, 11:08:28 AM12/12/06
to Django users
I think you'd better enforce de/encoding to settings.DEFAULT_CHARSET in
the middleware. not hardcode utf8.

Victor Ng

unread,
Dec 12, 2006, 5:27:35 PM12/12/06
to django...@googlegroups.com
Unfortunately, not all charsets will support all unicode characters,
so really, the fact that DEFAULT_CHARSET configurable is mostly a moot
point for me. For example, latin1 won't let me encode asian
characters.

I honestly can't think of a good reason to do anything other than UTF8
unless you've got some weird requirement to do otherwise.

vic

On 12/12/06, favo <Favo...@gmail.com> wrote:
>
> I think you'd better enforce de/encoding to settings.DEFAULT_CHARSET in
> the middleware. not hardcode utf8.
>
>
> >
>

Reply all
Reply to author
Forward
0 new messages