Django, MySQL, unicode

65 views
Skip to first unread message

Tracy Reed

unread,
Sep 7, 2009, 5:03:58 AM9/7/09
to django...@googlegroups.com
I have a Django app which processes emails. It is often handed emails
with unicode characters in them. My understanding is that Python and
Django handle unicode just fine and somewhat transparently. I was,
however, told that I need to set my database tables to UTF-8
encoding. I have done this. Yet I still frequently get errors such as
this when my app encounters unicode:

Traceback (most recent call last):

File "/usr/lib/python2.4/site-packages/django/core/handlers/base.py", line 92, in get_response
response = callback(request, *callback_args, **callback_kwargs)

File "/var/spool/filter/email_archive/store_emails/views.py", line 84, in mail_detail
return render_to_response('mail_detail.html', {'mail': ourmail,

File "/usr/lib/python2.4/site-packages/django/shortcuts/__init__.py", line 20, in render_to_response
return HttpResponse(loader.render_to_string(*args, **kwargs), **httpresponse_kwargs)

File "/usr/lib/python2.4/site-packages/django/template/loader.py", line 108, in render_to_string
return t.render(context_instance)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 178, in render
return self.nodelist.render(context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 779, in render
bits.append(self.render_node(node, context))

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 792, in render_node
return node.render(context)

File "/usr/lib/python2.4/site-packages/django/template/loader_tags.py", line 97, in render
return compiled_parent.render(context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 178, in render
return self.nodelist.render(context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 779, in render
bits.append(self.render_node(node, context))

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 792, in render_node
return node.render(context)

File "/usr/lib/python2.4/site-packages/django/template/loader_tags.py", line 24, in render
result = self.nodelist.render(context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 779, in render
bits.append(self.render_node(node, context))

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 792, in render_node
return node.render(context)

File "/usr/lib/python2.4/site-packages/django/template/defaulttags.py", line 243, in render
return self.nodelist_true.render(context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 779, in render
bits.append(self.render_node(node, context))

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 792, in render_node
return node.render(context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 831, in render
return _render_value_in_context(output, context)

File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 811, in _render_value_in_context
value = force_unicode(value)

File "/usr/lib/python2.4/site-packages/django/utils/encoding.py", line 92, in force_unicode
raise DjangoUnicodeDecodeError(s, *e.args)

DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in
position 1468: unexpected code byte. You passed in "\nGood
Day,\n\n\n\nWe offer a part time job on your computer.

<text of spam containing unicode deleted>

There is a 0x92 in position 1468 just as the error says.

Do I need to be doing a .encode('utf-8') before putting anything into
the db? I cannot seem to get a clear answer on this. Some say no, some
say yes. Do I need to do any decoding or anything on data pulled out
of the db? I have been told that MySQL should be handling all of this
for me.

I have been banging my head on this particular error off and on for a
couple of weeks and cannot seem to find the solution.

Any pointers appreciated.

--
Tracy Reed
http://tracyreed.org

ray

unread,
Sep 7, 2009, 5:33:44 AM9/7/09
to Django users
Hi there

We have a app that processes xml from a third party web service.

We were constantly getting decoding errors so we now use BeautifulSoup
to tidy the xml up before any processing.

best wishes

Ray
>  application_pgp-signature_part
> < 1KViewDownload

Tracy Reed

unread,
Sep 7, 2009, 11:11:31 PM9/7/09
to django...@googlegroups.com
On Mon, Sep 07, 2009 at 02:33:44AM -0700, ray spake thusly:

> We were constantly getting decoding errors so we now use BeautifulSoup
> to tidy the xml up before any processing.

Wish I could implement a solution like that but I am just pulling
times out of emails, not processing (X)HTML. I am amazed that over the
course of almost three weeks poking around on IRC, in the docs, and on
this mailing list nobody knows the answer. The most useful info I have
received so far is "set your database to use utf-8", "django handles
unicode for you", and "your code is broken". Indeed it is. If only I
could figure out why and what the proper way to handle unicode with
django and MySQL is.

Jan Ostrochovsky

unread,
Sep 8, 2009, 2:47:09 AM9/8/09
to Django users
Hi Tracy,

less than two months ago, we were trying to use MySQL in our Django
project (because we were not successful with installation of
PostgreSQL adapter for Python psycopg2 on Mac OS X).

We had similar problems as you have, and ugly, but only found,
solution was to add encode (or decode? I do not remember exactly) to
many places in the code. usually in __unicode_() definitions.

And there were also other problems with MySQL, e.g. Django Evolution
does not work with MySQL so well, as with PostgreSQL.

Happily, in the end we wer successful with psycopg2 installation, and
we are very satisfied now, when using PostgreSQL, not MySQL.

Really, my personal experience is: Django and Django Evolution run
better on PostgreSQL. If you will decide to go this way and need some
support with PostgreSQL, does not hesitate to contact me directly.
Good luck!

Jano
>  application_pgp-signature_part
> < 1KViewDownload

Karen Tracey

unread,
Sep 8, 2009, 7:32:08 AM9/8/09
to django...@googlegroups.com
On Mon, Sep 7, 2009 at 5:03 AM, Tracy Reed <tr...@ultraviolet.org> wrote:
I have a Django app which processes emails. It is often handed emails
with unicode characters in them. My understanding is that Python and
Django handle unicode just fine and somewhat transparently. I was,
however, told that I need to set my database tables to UTF-8
encoding. I have done this. Yet I still frequently get errors such as
this when my app encounters unicode:


In some places here you are using the term 'unicode' where non-ASCII would be more correct.  The emails your code is handed, for example, contain non-ASCII characters.  These emails are not packaged as Python unicode strings (they cannot be, if they are coming from outside Python), they are bytestrings.  In order to successfully turn them into unicode objects the correct encoding of the email bytestring must be known.  The exception you include below shows Django attempting to convert an email bytestring into a unicode object, assuming the bytestring is utf-8 encoded.  This is failing, so apparently the email bytestring is using some other encoding.

Python/Django handling unicode "transparently" is a bit of an optimistic hope.  Python has unicode support, and what Django attempts to do is take bytestrings at boundary points and convert to and from unicode objects so that your application code never has to deal with bytestrings but rather always has unicode strings.  So Django will convert bytestrings from the database and bytestrings from web clients and convert them to unicode before handing them to your application code.  Similarly it will accept unicode from your application and convert to bytestrings for sending outside the boundary (back to the DB or out as a client response).

Django does not require, however, that your application only use unicode strings -- you are free to hand Django functions bytestrings.  When given a bytestring, though, Django has to make some assumption about what encoding the bytestring is using.  The problem with bytestrings is that they do not carry around with them any encoding information.  What Django does when handed a bytestring is assume it utf-8 encoded.  If your application hands Django a bytestring that is not utf-8 encoded, you'll get errors like the one you include below.

Someone else responding on this thread mentioned BeautifulSoup fixing problems like this.  My understanding (I don't have time to verify at the moment) is BeautifulSoup either detects encoding by examining the bytes and guessing what the proper encoding may be, or trying different encoding until one works.  Django does not do this -- it simply assumes if your code hands in a bytestring that it is utf-8 encoded.  Thus if you have non-utf8 encoded bytestrings you are dealing with (as you apparently do) you will need to convert them to unicode before handing them to Django.

That, of course, just pushes the problem back onto you, and you will have to now figure out what encoding these things are using.  Perhaps someone on this list can help with that, but you haven't provided enough information to really help here.  All you have said is that your app is "handed emails".  All I can tell you about those emails, based on the traceback below, is that they are bytestrings and they are not utf-8 encoded.  If you show some of your code that is receiving the emails perhaps someone can provide more guidance on how to transform the email bytestrings into unicode.

 
 Traceback (most recent call last):

  File "/usr/lib/python2.4/site-packages/django/core/handlers/base.py", line 92, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/var/spool/filter/email_archive/store_emails/views.py", line 84, in mail_detail
    return render_to_response('mail_detail.html', {'mail': ourmail, 
 
  File "/usr/lib/python2.4/site-packages/django/shortcuts/__init__.py", line 20, in render_to_response
    return HttpResponse(loader.render_to_string(*args, **kwargs), **httpresponse_kwargs)

  [snip bunches of template context traceback]

  File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 831, in render
    return _render_value_in_context(output, context)

  File "/usr/lib/python2.4/site-packages/django/template/__init__.py", line 811, in _render_value_in_context
    value = force_unicode(value)

  File "/usr/lib/python2.4/site-packages/django/utils/encoding.py", line 92, in force_unicode
    raise DjangoUnicodeDecodeError(s, *e.args)

 DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in
 position 1468: unexpected code byte. You passed in "\nGood
 Day,\n\n\n\nWe offer a part time job on your computer.

<text of spam containing unicode deleted>

There is a 0x92 in position 1468 just as the error says.

Do I need to be doing a .encode('utf-8') before putting anything into
the db? I cannot seem to get a clear answer on this. Some say no, some
say yes. Do I need to do any decoding or anything on data pulled out
of the db? I have been told that MySQL should be handling all of this
for me.


Note the boundary your traceback is dealing with here is not the database, the traceback shows trying to render something in a template for a response.  Whatever code path you are following here involves your template trying to render a non-utf8 bytestring.  It's running into trouble because Django is attempting to convert the bytestring to unicode assuming utf-8 encoding.  You are not dealing with the database boundary here.

But to answer the database question: no, you do not have to encode/decode anything at the database boundary.  Django handles that for you.  The only exception here is if you are using a binary collation on MySQL then Django is not able to do the bytestring/unicode conversion.  See: http://docs.djangoproject.com/en/dev/ref/databases/#collation-settings
 
I have been banging my head on this particular error off and on for a
couple of weeks and cannot seem to find the solution.

Any pointers appreciated.


For further help you will need to give some more information about how your code is getting these emails. 

Karen

mechanix

unread,
Sep 22, 2009, 10:27:22 AM9/22/09
to Django users
I have recently faced the same thing while developing a Django app
that processes emails too.
But I found a quick fix, unicode() accepts argument, that tells it
what to do when it stumbles upon non-valid character:

email.mail_from = unicode(email['From'], errors = 'ignore')

Possible values for "errors" are 'strict' (which is default - raises
exception), 'ignore' - just removes invalid character from string and
'replace' - replaces invalid character with U+FFFD
Reply all
Reply to author
Forward
0 new messages