Converting "François Pétillant" to UTF-8

96 views
Skip to first unread message

Tom

unread,
Jan 13, 2009, 11:25:25 AM1/13/09
to Google App Engine
I need some suggestions/guidance on how to handle strings that contain
characters like "François Pétillant".

I have several problems.

First, I do want to store these names in their original encoding
(which I think I can do in a db.Text object).

I also want to be able to search their names with or without the
special characters. (eg. "François" and "Francois" would both in the
SearchableModel's tags) Any utility to convert sensibly?

Finally, I want to be able to send back the name in utf-8 in a format
that can be converted back to the original on the user side. (In this
specific case the wine name is being sent back to my android phone
app.)

Tom

unread,
Jan 13, 2009, 11:38:37 AM1/13/09
to Google App Engine
It does seem like the name "François Pétillant" is being stored
successfully on the server and is queryable. However, when I use the
Datastore Viewer I get the error below.

Any help in guiding me along the do's/don'ts of character encodings
would be most appreciated.
Thanks!
Tom

Traceback (most recent call last):
File "/cygdrive/c/Program Files/Google/google_appengine/google/
appengine/ext/webapp/__init__.py", line 499, in __call__
handler.get(*groups)
File "/cygdrive/c/Program Files/Google/google_appengine/google/
appengine/ext/admin/__init__.py", line 520, in get
value = DataType.get(raw_value).format(raw_value)
File "/cygdrive/c/Program Files/Google/google_appengine/google/
appengine/ext/admin/__init__.py", line 852, in format
writer.writerow(value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in
position 4: ordinal not in range(128)

Geoffrey Spear

unread,
Jan 13, 2009, 12:52:16 PM1/13/09
to Google App Engine
That sounds like a bug in the dev server datastore viewer. The
production datastore viewer handles unicode just fine.

As for searching, my suggestion would be to store a normalized version
of whatever strings you want to be able to search, then use the same
normalization on search strings before searching. I have no idea if a
tool to do this normalization exists already; my experience with
encodings is almost entirely perl-based.

Chris Tan

unread,
Jan 13, 2009, 1:18:40 PM1/13/09
to Google App Engine
I've been using the code below for normalization. Any characters
without
ascii equivalents are stripped:

import unicodedata

nfkd = unicodedata.normalize('NFKD', data)
normalized = nfkd.encode('ascii', 'ignore').lower()

It seems to work well so far for prefix suggest, except when the
datastore
query contains spaces it will fail. Ideas anyone?

Tom

unread,
Jan 13, 2009, 3:55:52 PM1/13/09
to Google App Engine
Very helpful!

Tom

unread,
Jan 13, 2009, 7:08:25 PM1/13/09
to Google App Engine
SearchableModel source
http://code.google.com/p/googleappengine/source/browse/trunk/google/appengine/ext/search/__init__.py?r=27

I am going to attempt to make the SearchableModel handle adding the
normalized forms of words automatically. I'll report back with my
progress. Wish me luck!
Reply all
Reply to author
Forward
0 new messages