german umlaute on search querys

186 views
Skip to first unread message

Hinnack

unread,
Nov 24, 2009, 5:57:52 AM11/24/09
to django...@googlegroups.com
Hi,

I have django 1.1 and a mysql database created in utf-8.
My tables are also utf8 as the collation is, too.

Showing data in a view works pretty well with german umlaute, but doing
a search with filter and icontains always returns an empty queryset...

if I convert the search back to iso-8859 I get results...

what am I doing wrong?

-- Hinnack

Karen Tracey

unread,
Nov 24, 2009, 7:49:00 PM11/24/09
to django...@googlegroups.com

What does "convert the search back to iso-8859" mean?

Karen

Hinnack

unread,
Nov 25, 2009, 1:54:46 AM11/25/09
to django...@googlegroups.com
Hi Karen,

thanks for your reply

it means so far I must do a:
qs = search[query].encode('iso-8859-1')

before I add the qs to a Q object of a queryset. Only in this case I get results.

the full codepart looks like:

decoder = simplejson.JSONDecoder()

search = decoder.decode(request.POST['search'])
qs = search['caption'].encode('iso-8859-1')

searchstr = urllib.unquote_plus(qs).strip('=!~')

basic.filter( Q(evid__caption__icontains=searchstr) )


I do have no DATABASE_OPTIONS set. Maybe that's it?


-- Hinnack


2009/11/25 Karen Tracey <kmtr...@gmail.com>

--

You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

Karen Tracey

unread,
Nov 25, 2009, 11:24:52 PM11/25/09
to django...@googlegroups.com
On Wed, Nov 25, 2009 at 1:54 AM, Hinnack <henrik....@googlemail.com> wrote:
Hi Karen,

thanks for your reply

it means so far I must do a:
qs = search[query].encode('iso-8859-1')

before I add the qs to a Q object of a queryset. Only in this case I get results.

the full codepart looks like:

decoder = simplejson.JSONDecoder()

search = decoder.decode(request.POST['search'])
qs = search['caption'].encode('iso-8859-1')

searchstr = urllib.unquote_plus(qs).strip('=!~')

basic.filter( Q(evid__caption__icontains=searchstr) )


I do have no DATABASE_OPTIONS set. Maybe that's it?


No, there's nothing you need to set specially to get this to work.

It would be interesting to see the repr of search['caption'] before and after you do the encode('iso-8859-1'). 

I suspect search['caption'] is originally a Unicode object that was (incorrectly) constructed from a utf-8 bytestring assuming iso-8859-1 encoding.  That would explain the results you are getting, since encoding such an incorrectly constructed object to iso-8859-1 will restore it to a utf-8 bytestring. Django, when handed a bytestring, assumes it is utf-8 encoded and speaks utf-8 to the database, so all works.  If on the other hand you pass such an incorrectly built unicode object as a unicode object, Django will encode it using utf-8, which results in two levels utf-8 encoding having been done, and the result won't match actual utf-8 data in the database.

But it's also just possible that the problem is the data values in the database, maybe.  Confirmation of where the problem is would come from knowing the repr of search['caption'] before and after the encode to iso-8859-1.  Then we'd be sure whether to look more closely at the way in which search['caption'] is getting built or the database itself.

Karen

Hinnack

unread,
Nov 26, 2009, 7:03:36 AM11/26/09
to django...@googlegroups.com
Hi Karen,

thanks again for your reply.
I use Aptana with pydev extension.
Debugging the app shows the following for search:
dict: {u'caption': u'f\\xfcr', u'showold': False} 

and for qs:
str: für
although it seems to be &#65533; instead of ASCII 252 - but this could be, because I am sitting on a MAC
while debugging.
(the search problem itself stays the same on MAC and LINUX debian)

regards
-- Hinnack

2009/11/26 Karen Tracey <kmtr...@gmail.com>

--

Karen Tracey

unread,
Nov 26, 2009, 10:38:36 AM11/26/09
to django...@googlegroups.com
On Thu, Nov 26, 2009 at 7:03 AM, Hinnack <henrik....@googlemail.com> wrote:
Hi Karen,

thanks again for your reply.
I use Aptana with pydev extension.
Debugging the app shows the following for search:
dict: {u'caption': u'f\\xfcr', u'showold': False} 


That's confusing to me, because other than having an extra \ (which could be an artifact of how it's being displayed), that looks like a correctly-built unicode object für.

and for qs:
str: für
although it seems to be &#65533; instead of ASCII 252 - but this could be, because I am sitting on a MAC
while debugging.

Using python manage.py shell might shed more light, I fear the tool here is assuming an incorrect bytestring encoding and getting in the way.

I cannot recreate anything like what you are seeing.  I have a model Thing stored in a MySQL DB (using a utf-8 encoded table) with CharField name.  There are two instances of this Thing in the DB that contain für in the name.  From a python manage.py shell, using Django 1.1.1:

>>> from ttt.models import Thing
>>> import django
>>> django.get_version()
'1.1.1'
>>> ufur = u'f\u00fcr'
>>> print ufur
für
>>> ufur
u'f\xfcr'
>>> ufur.encode('utf-8')
'f\xc3\xbcr'
>>> ufur.encode('iso-8859-1')
'f\xfcr'

small-u with umlaut is U+00FC, encoded in utf-8 that takes 2 bytes C3BC, encoded in iso-8859-1 it is the 1 byte FC.

Filtering with icontains, using either the Unicode object or the utf-8 encode bytestring version, works properly:

>>> Thing.objects.filter(name__icontains=ufur)
[<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>]
>>> Thing.objects.filter(name__icontains=ufur.encode('utf-8'))
[<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>]

Attempting to filter with an iso-8859-1 encoded bytestring raises an error:

>>> Thing.objects.filter(name__icontains=ufur.encode('iso-8859-1'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/django/db/models/manager.py", line 129, in filter
    return self.get_query_set().filter(*args, **kwargs)
  File "/usr/lib/python2.5/site-packages/django/db/models/query.py", line 498, in filter
    return self._filter_or_exclude(False, *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/django/db/models/query.py", line 516, in _filter_or_exclude
    clone.query.add_q(Q(*args, **kwargs))
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/query.py", line 1675, in add_q
    can_reuse=used_aliases)
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/query.py", line 1614, in add_filter
    connector)
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/where.py", line 56, in add
    obj, params = obj.process(lookup_type, value)
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/where.py", line 269, in process
    params = self.field.get_db_prep_lookup(lookup_type, value)
  File "/usr/lib/python2.5/site-packages/django/db/models/fields/__init__.py", line 214, in get_db_prep_lookup
    return ["%%%s%%" % connection.ops.prep_for_like_query(value)]
  File "/usr/lib/python2.5/site-packages/django/db/backends/__init__.py", line 364, in prep_for_like_query
    return smart_unicode(x).replace("\\", "\\\\").replace("%", "\%").replace("_", "\_")
  File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 44, in smart_unicode
    return force_unicode(s, encoding, strings_only, errors)
  File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 92, in force_unicode
    raise DjangoUnicodeDecodeError(s, *e.args)
DjangoUnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected end of data. You passed in 'f\xfcr' (<type 'str'>)

This is because Django assumes the bytestring is utf-8 encoded, and runs into trouble attempting to convert to unicode specifying utf-8 as the string's encoding, since it is not valid utf-8 data.

The only way I have been able to recreate anything like what you are describing is to incorrectly construct the original unicode object from a utf-8 bytestring assuming a iso-8859-1 encoding:

>>> badufur = ufur.encode('utf-8').decode('iso-8859-1')
>>> badufur
u'f\xc3\xbcr'
>>> print badufur
für
>>> print badufur.encode('utf-8')
für
>>> print badufur.encode('iso-8859-1')
für

Using that unicode object doesn't produce any hits in the DB:

>>> Thing.objects.filter(name__icontains=badufur)
[]

But encoding it to iso-8859-1 does, because that has the effect of restoring the original utf-8 bytestring:

>>> Thing.objects.filter(name__icontains=badufur.encode('iso-8859-1'))
[<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>]

However, the debug info you show above doesn't show an incorrectly-built unicode object, so I'm very confused by it.

Karen

Hinnack

unread,
Nov 29, 2009, 8:06:54 AM11/29/09
to django...@googlegroups.com
Hi Karen,

thanks for investigating...
I solved the problem.
There were 2 reasons:
- php code non passing correct encoded POST
- urllib.unquote_plus not working as expected

and not for last that the raw_post_data is not decoded and a POST var is...
(my blindness)

Thanks again for your help.

-- Hinnack

2009/11/26 Karen Tracey <kmtr...@gmail.com>
On Thu, Nov 26, 2009 at 7:03 AM, Hinnack <henrik....@googlemail.com> wrote:

--

Reply all
Reply to author
Forward
0 new messages