UnicodeEncodeError problems

141 views
Skip to first unread message

Julien Phalip

unread,
Aug 7, 2008, 3:28:24 AM8/7/08
to Django users
Hi,

There is a recurrent problem that I can never get my head around,
probably because of my limited understanding of how unicode works.

There is some text fetched from the database that I try to convert to
unicode, using force_unicode, before processing it further. It
generally works but occasionally it chokes with some special
characters. The type of errors I get is like the following:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in
position 15: ordinal not in range(128)

Before it was ported to MySQL, the database was made with MS Access.
Therefore some text contains microsoft-proprietary encoding stuff,
which I try to go around with the following:

try:
encoded_item = force_unicode(item, "utf-8")
except UnicodeDecodeError:
encoded_item = force_unicode(item, "cp1252")

Anyway, I don't know how to deal with those UnicodeEncodeError, and
I'd appreciate any pointers or hints to help me fix it. Any idea?

Thanks a lot,

Julien

Erik Allik

unread,
Aug 7, 2008, 8:18:45 AM8/7/08
to django...@googlegroups.com
Hey,

I also occasionally get UnicodeDecodeError's. Right now I'm getting a
traceback when posting a form in the admin area. Everything works
perfectly when the form does not contain any non-ascii letters, but
when it does, it gives me a traceback (http://dpaste.com/69963/). I
have existing model objects in the database whose charfields contain
unicode data and they display just fine. It's just when I save unicode
data that I get this error. The error occurs before any SQL UPDATE
statements because the object isn't modified.

Erik

Karen Tracey

unread,
Aug 7, 2008, 9:32:48 AM8/7/08
to django...@googlegroups.com
On Thu, Aug 7, 2008 at 8:18 AM, Erik Allik <eal...@gmail.com> wrote:

Hey,

I also occasionally get UnicodeDecodeError's. Right now I'm getting a
traceback when posting a form in the admin area. Everything works
perfectly when the form does not contain any non-ascii letters, but
when it does, it gives me a traceback (http://dpaste.com/69963/). I
have existing model objects in the database whose charfields contain
unicode data and they display just fine. It's just when I save unicode
data that I get this error. The error occurs before any SQL UPDATE
statements because the object isn't modified.


Erik, this is a bug in Django.  I've opened ticket #8151 to get it fixed.  Basically some of the Unicode changes got lost on the way from trunk to the newforms-admin branch due to code moving from one file to another.


On 07.08.2008, at 10:28, Julien Phalip wrote:

>
> Hi,
>
> There is a recurrent problem that I can never get my head around,
> probably because of my limited understanding of how unicode works.
>
> There is some text fetched from the database that I try to convert to
> unicode, using force_unicode, before processing it further. It
> generally works but occasionally it chokes with some special
> characters. The type of errors I get is like the following:
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in
> position 15: ordinal not in range(128)
>
> Before it was ported to MySQL, the database was made with MS Access.
> Therefore some text contains microsoft-proprietary encoding stuff,
> which I try to go around with the following:
>
>        try:
>            encoded_item = force_unicode(item, "utf-8")
>        except UnicodeDecodeError:
>            encoded_item = force_unicode(item, "cp1252")
>
> Anyway, I don't know how to deal with those UnicodeEncodeError, and
> I'd appreciate any pointers or hints to help me fix it. Any idea?
>


This is likely not a bug in Django, however, it looks more like you've got a bit of a mess in you DB data.  You should not have to guess at the encoding of something pulled from the database.  Actually, you should not have to manually force it to unicode at all.  What is the output of 'show create table' in a mysql shell for the table in question? 

(The fact that the DB started out as MS Access and was migrated to MySQL should not cause a problem.  MySQL's default latin1 encoding is the same as Microsoft's cp1252, so unless you migrated it to a utf-8 table without actually doing a cp1252->latin1 conversion there should be no problem.)

Karen

Erik Allik

unread,
Aug 7, 2008, 12:59:12 PM8/7/08
to django...@googlegroups.com
Thanks for the reply, Karen!

Does anyone know of a workaround until it gets fixed? It's really a blocker right now.

Erik

Karen Tracey

unread,
Aug 7, 2008, 1:01:26 PM8/7/08
to django...@googlegroups.com
On Thu, Aug 7, 2008 at 12:59 PM, Erik Allik <eal...@gmail.com> wrote:
Thanks for the reply, Karen!

Does anyone know of a workaround until it gets fixed? It's really a blocker right now.

Try using the patch provided in the ticket noted as a dup.

Karen

n00m

unread,
Aug 8, 2008, 6:34:42 AM8/8/08
to Django users

I think here is a bit of misunderstanding of (de)coding stuff.
Check how I do it. Below is a simple web utility for interaction with
Sqlite3 db,
namely -- Northwind.sl3. All its text fields are in cp1252.
The web page itself AND its python code (def sqlite3(request)) in
utf-8.
Note the 2 lines with decode().encode(). All works perfectly.
Try to play with those lines, comment/uncomment them, and you'll see
the diff.


# -*- coding: utf-8 -*-

def sqlite3(request):
res = ''
que = "select ShipAddress from Orders where ShipAddress='Rua do
Paço, 67';"
if request.method == 'POST':
pth = 'D:/"Program Files"/"Apache Software Foundation"/
Apache2.2/htdocs/deksite'
q = request.POST['query']
que = q
q = q.replace('\n', ' ')
q = q.strip()
q = q.decode('utf-8').encode('cp1252')
res = os.popen(pth + '/sqlite3.exe ' + pth + '/Northwind.sl3
"' + q + '"', 'r').readlines()
res = ''.join(res)
res = res.decode('cp1252').encode('utf-8')
return HttpResponse(
'''
<html>
<head>
<title>sqlite3</title>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
</head>
<body>
<center>
<form name="fm" method="post" action="">
<textarea name="query" cols="80" rows="15">
%s
</textarea>
<br>
<input type="submit" name="sbt" value="Послать запрос!">
<br>
</form>
<textarea name="res" wrap="off" style="font-size:8pt;" cols="160"
rows="30">
%s
</textarea>
</center>
</body>
</html>
'''%(que, res,))

n00m

unread,
Aug 8, 2008, 6:57:03 AM8/8/08
to Django users

If someone need Northwind.sl3 file I can email it.

Shell utility sqlite3.exe can be downloaded from sqlite.org site.

n00m

unread,
Aug 8, 2008, 7:02:27 AM8/8/08
to Django users

Note hardcoded sql query:

select ShipAddress from Orders where ShipAddress='Rua do
Pa***ç***o, 67';

Now it's in utf-8. Before sending it to db, it should be converted to
cp1252:

q = q.decode('utf-8').encode('cp1252')

And viceverse, resultset getting from db is in cp1252, in order to
show it properly in utf-8 browser it should be converted to utf-8:

n00m

unread,
Aug 8, 2008, 7:51:01 AM8/8/08
to Django users

Self-quote:
<<All its text fields are in cp1252.>>

It's not quite a correct statement.
From machine's point of view it's nonsense.

Better to say:
I, human, know that those bytes should be interpreted
accordingly codepage #1252, mapped to cp1252.

That's why we must *explicitly* point to proper cp
while converting:
Reply all
Reply to author
Forward
0 new messages