add support for unicode-normalizing get/post-data?

Gábor Farkas

未讀,

2008年4月10日凌晨2:48:412008/4/10

收件者：django-d...@googlegroups.com

hi,

would it be a good idea to add support to django to unicode-normalize
incoming get/post-data?

the normalization problem is basically this:

for example my name, 'gábor' can be written in 2 different ways in
unicode: u'g\xe1bor' and u'ga\u0301bor'.
the first one uses the 'LATIN SMALL LETTER A WITH ACUTE' character, and
the second one uses two characters to describe the same info: 'LATIN
SMALL LETTER A' + 'COMBINING ACUTE ACCENT'.

both strings are more&less equal from an 'unicode' perspective, and also
usually from an end-user's perspective.

as you can imagine, this can make problems, when in a web-app the user
searches using the first format, but the data is stored in the second
format.

the issue can be solved by normalizing the text. it means that you
convert all your strings into the 'same format'. there are several
normalization forms, some interesting ones are "NFC" (the most compact
representation) and "NFD" (the most decomposed representation).
(in my name-examples the first one is in NFC, and the second one is in NFD)

it's easy to normalize in python:

norm1 = unicodedata.normalize('NFC',text)
norm2 = unicodedata.normalize('NFD',text)

so i wanted to implement this with django in a way where i do not have
to normalize the get/post-data manually in every view. unfortunately,
the only way i found involved patching django:

it could be implemented like:

1. a new setting in settings.py called something like
'NORMALIZE_REQUEST_DATA', defaults to None, and can be set to 'NFC' or
'NFD' or to the other available normalization-forms.

2. do the normalization when the request-data is converted from
binary-strings to unicode-strings. this can be achieved either by adding
a new optional parameter to http.QueryDict, or by creating a
helper-function that converts a QueryDict into an unicode-normalized
QueryDict. and modifying the mod-python/wsgi handlers to call that code.

it's pretty simple to implement, but, before i submit an
enhancement-ticket... is there a chance for such a change to be accepted
into django?

p.s: if there is a way to achieve this without touching
django-internals, please tell me :)

p.s.2: the other interesting question of course is normalizing all
writes to the db in the django-orm, but that can be implemented
relatively simply in userland-code (signals and/or save() methods), and
maybe even more simpler when model-inheritance arrives.

thanks,
gabor

Amit Upadhyay

未讀,

2008年4月10日凌晨3:01:132008/4/10

收件者：django-d...@googlegroups.com

On Thu, Apr 10, 2008 at 12:18 PM, Gábor Farkas <ga...@nekomancer.net> wrote:

p.s: if there is a way to achieve this without touching
django-internals, please tell me :)

I would have written a view decorator that takes request, and normalizes the request.GET/POST before calling the real view.

--
Amit Upadhyay
Vakow! www.vakow.com
+91-9820-295-512

Gábor Farkas

未讀,

2008年4月10日凌晨3:17:402008/4/10

收件者：django-d...@googlegroups.com

Amit Upadhyay wrote:
> On Thu, Apr 10, 2008 at 12:18 PM, Gábor Farkas <ga...@nekomancer.net
> <mailto:ga...@nekomancer.net>> wrote:
>
> p.s: if there is a way to achieve this without touching
> django-internals, please tell me :)
>
>
> I would have written a view decorator that takes request, and normalizes
> the request.GET/POST before calling the real view.
>

that would work, but there are 2 problems with it:

1. i have to do it for every request (this could be solved maybe with a
middleware-based approach)
2. it would be very hard to support things like setting the encoding of
the request (in other words, if the encoding is changed in the view, the
normalized data is lost)

or would you do it differently?

thanks,
gabor

simonb

未讀,

2008年4月10日上午9:44:352008/4/10

收件者：Django developers

On Apr 10, 2:48 pm, Gábor Farkas <ga...@nekomancer.net> wrote:
> hi,
>
> would it be a good idea to add support to django to unicode-normalize
> incoming get/post-data?

class NormCharField(forms.CharField):
def clean(self, value):
value = super(NormCharField, self).clean(value)
return unicodedata.normalize('NFC',text)

Or am I missing something...

Simon

Luke Plant

未讀,

2008年4月10日上午10:52:152008/4/10

收件者：django-d...@googlegroups.com

On Thursday 10 April 2008 08:17:40 Gábor Farkas wrote:

> that would work, but there are 2 problems with it:
>
> 1. i have to do it for every request (this could be solved maybe
> with a middleware-based approach)

You could also do it in your urls.py e.g.:

urls = [
('/foo/', fooviewfunc),
('/bar/', barviewfunc),
]

urlpatterns = patterns('', [(regex, normalise_request_unicode(func)
for (regex, func) in urls])

Luke

--
"Pessimism: Every dark cloud has a silver lining, but lightning kills
hundreds of people each year trying to find it." (despair.com)

Luke Plant || http://lukeplant.me.uk/

simonb

未讀,

2008年4月10日中午12:57:452008/4/10

收件者：Django developers

On Apr 10, 9:44 pm, simonb <bno...@gmail.com> wrote:
> return unicodedata.normalize('NFC',text)

That should be "return unicodedata.normalize('NFC',value)"

It's late!

Simon

Gábor Farkas

未讀,

2008年4月10日下午2:26:532008/4/10

收件者：django-d...@googlegroups.com

the idea is nice, but what if i don't want to use forms for this?

the point is,that in my opinion, 99% of all developers want to have
their unicode-data normalized, before they process it. (maybe they
don't know yet that they want it, but they want it :-)

there's simply no need to be able to differentiate between NFC and NFD
strings except when you are writing an application especially for
unicode-related-work. if there are any other situations, please tell me.

so, because the general case is the developer-does-not-care-about-it,
we should make it easy to do in django. imho, of course.

gabor

David Cramer

未讀,

2008年4月10日下午5:00:392008/4/10

收件者：Django developers

Why wouldn't you just do this in a middleware? A decorator, or a
clean_X on forms would not handle every incoming request like he
wants.

Gábor Farkas

未讀,

2008年4月11日凌晨1:19:442008/4/11

收件者：django-d...@googlegroups.com

David Cramer wrote:
> Why wouldn't you just do this in a middleware? A decorator, or a
> clean_X on forms would not handle every incoming request like he
> wants.

maybe i wasn't clear in my original response to Amit.. the thing is,
even a middleware cannot nicely solve it imho.. or at least i'm
not sure how..

the problem is that the GET/POST dictionaries of the HttpRequest
are dynamically calculated...

so i'd have to wrap the HttpRequest object into my own object, which
would delegate everything into HttpRequest, except _get_get, _get_post,
etc... and even so i'm not sure if i could wrap it in a good-enough way
(not to mention view-code that maybe assumes that isinstance(request,
HttpRequest) holds...)

another approach is to (in the middleware) modify the "inside" of the
request (request.raw_post_data and request.environ['QUERY_STRING']), but
even that would require that i copy the convert-get-post-data-to-unicode
code from django. which seems to be quite ugly and can cause
errors/incompatibilities in the future.

gabor

J. Cliff Dyer

未讀,

2008年4月11日上午11:06:252008/4/11

收件者：django-d...@googlegroups.com

On Thu, 2008-04-10 at 20:26 +0200, Gábor Farkas wrote:
> On Thu, Apr 10, 2008 at 06:44:35AM -0700, simonb wrote:
> >
> > On Apr 10, 2:48 pm, Gábor Farkas <ga...@nekomancer.net> wrote:
> > > hi,
> > >
> > > would it be a good idea to add support to django to unicode-normalize
> > > incoming get/post-data?
> >
> > class NormCharField(forms.CharField):
> > def clean(self, value):
> > value = super(NormCharField, self).clean(value)
> > return unicodedata.normalize('NFC',text)
> >
> > Or am I missing something...
>
> the idea is nice, but what if i don't want to use forms for this?
>
> the point is,that in my opinion, 99% of all developers want to have
> their unicode-data normalized, before they process it. (maybe they
> don't know yet that they want it, but they want it :-)
>

I'd say it's more like 70% of all developers. The other thirty percent
want their bits to pass through in an unmutilated condition. My number
is pulled out of just as much thin air as yours, I realize. All I mean
to say is that there are legitimate and not-ridiculously-uncommon
reasons to want your data left alone.

> there's simply no need to be able to differentiate between NFC and NFD
> strings except when you are writing an application especially for
> unicode-related-work. if there are any other situations, please tell me.
>
> so, because the general case is the developer-does-not-care-about-it,
> we should make it easy to do in django. imho, of course.
>

I agree, but it should also be reasonably straightforward to *not* do.

> gabor

Cheers,
Cliff

Malcolm Tredinnick

未讀,

2008年4月12日清晨6:03:342008/4/12

收件者：django-d...@googlegroups.com

On Thu, 2008-04-10 at 20:26 +0200, Gábor Farkas wrote:

[...]

> the point is,that in my opinion, 99% of all developers want to have
> their unicode-data normalized, before they process it. (maybe they
> don't know yet that they want it, but they want it :-)

87.35% of all statistics are just made up on the spot. :-)

A large bunch of the time you just won't care. If you're just serving
back the data that was entered and you don't need to search over that
data, or you know that your input data is very unlikely to carry
anything in the ambiguous encoding sections, you might well choose to
forgo the extra processing overhead. Normalisation isn't a free
operation. It involves a reasonable amount of table lookups and,
ultimately, another linear pass through all the input data for
decomposition and recomposition. We need should be careful about the
extra overhead being introduced.

I think it's not a bad idea to add this as something that is possible,
but it's probably a bit costly to do on every request (based on some
simple timings I've done just now -- although I want to look at it
further). Don't blow things out of proportion by trying to claim it's a
no-brainer, though. That just hurts your argument.

Definitely worth looking at as an option on an HttpRequest, though.

Regards,
Malcolm

--
Telepath required. You know where to apply...
http://www.pointy-stick.com/blog/

Gábor Farkas

未讀,

2008年4月15日凌晨3:37:452008/4/15

收件者：django-d...@googlegroups.com

J. Cliff Dyer wrote:
> On Thu, 2008-04-10 at 20:26 +0200, Gábor Farkas wrote:
>> the point is,that in my opinion, 99% of all developers want to have
>> their unicode-data normalized, before they process it. (maybe they
>> don't know yet that they want it, but they want it :-)
>>
>
> I'd say it's more like 70% of all developers. The other thirty percent
> want their bits to pass through in an unmutilated condition.

please note, that if they want their bits, even now they have to use
request.META['QUERY_STRING'] and request.raw_post_data (i wrote this
from memory, so maybe the exact names are wrong), because request.GET
and request.POST are already unicode-decoded.

>>
>> so, because the general case is the developer-does-not-care-about-it,
>> we should make it easy to do in django. imho, of course.
>>
>
> I agree, but it should also be reasonably straightforward to *not* do.

of course.

gabor

回覆所有人

回覆作者

轉寄