Unicode vs SQLAlchemy vs MySQL and the dreaded UnicodeDecodeError

617 views
Skip to first unread message

Bob Tanner

unread,
Mar 31, 2011, 12:51:20 AM3/31/11
to turbo...@googlegroups.com
I must be missing something simple or fundamental when it comes to TG2 and Unicode. Every time I step out of the ASCII world the UnicodeDecodeError hits me.

The table as dumped from MySQL

--
-- Table structure for table `resourceitem`
--

DROP TABLE IF EXISTS `resourceitem`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `resourceitem` (
  `resourceid` binary(16) NOT NULL,
  `item` int(11) NOT NULL,
  `description` text,
  `title` varchar(2000) DEFAULT NULL,
  `publisher` varchar(2000) DEFAULT NULL,
  `creator` varchar(2000) DEFAULT NULL,
  `language` varchar(10) DEFAULT NULL,
  `identifier` varchar(127) DEFAULT NULL,
  `thumbnailurl` varchar(254) DEFAULT NULL,
  `format` varchar(40) DEFAULT NULL,
  `src` varchar(254) DEFAULT NULL,
  `downloadtype` varchar(20) DEFAULT NULL,
  UNIQUE KEY `resourceitem_resourceid` (`resourceid`,`item`),
  KEY `resourceitem_creator` (`creator`(255)),
  KEY `resourceitem_title` (`title`(255)),
  KEY `resourceitem_thumbnailurl` (`thumbnailurl`),
  KEY `resourceitem_identifier` (`identifier`),
  KEY `resourceitem_src` (`src`),
  KEY `resourceitem_publisher` (`publisher`(255)),
  CONSTRAINT `resourceitem_ibfk_1` FOREIGN KEY (`resourceid`) REFERENCES `resourcekey` (`resourceid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

The TG2 model

# -*- coding: utf-8 -*-
"""ResourceItem model module."""

from sqlalchemy import Column, Integer, String, Text, ForeignKey, LargeBinary, Unicode, UnicodeText, Table
from sqlalchemy.orm import relation
from reports.model import DeclarativeBase

__all__ = ['ResourceItemMode']

class ResourceItemModel(DeclarativeBase):
    __tablename__ = 'resourceitem'
    
    resourceid = Column(LargeBinary(16), nullable=False, primary_key=True)
    item = Column(Integer(11), nullable=False, primary_key=True)
    description = Column(Text, nullable=True)
    title = Column(Unicode(2000), nullable=True)
    publisher = Column(Unicode(2000), nullable=True)
    creator = Column(Unicode(2000), nullable=True)
    language = Column(Unicode(10), nullable=True)
    identifier = Column(Unicode(127), nullable=True)
    thumbnailurl = Column(Unicode(254), nullable=True)
    format = Column(Unicode(40), nullable=True)
    src = Column(String(254), nullable=True)
    downloadtype = Column(Unicode(20), nullable=True)

One of the creators is Antoine de Saint-Exupéry <http://en.wikipedia.org/wiki/Antoine_de_Saint-Exup%C3%A9ry> and when I do a mysql comment line query I get this:

mysql> select creator   from resourceitem where creator like '%Antoine%' ;
+---------------------------+
| creator                   |
+---------------------------+
| Saint-Exupéry, Antoine de |
+---------------------------+
1 row in set (0.00 sec)

Looks properly decoded and display. I set up a simple CrudRestController and did the boiler plate TG2 code:

class ResourceItemController(CrudRestController):
    model = ResourceItemModel
    
    class new_form_type(AddRecordForm):
        __model__ = ResourceItemModel

    class edit_form_type(EditableForm):
        __model__ = ResourceItemModel
        
    class edit_filler_type(EditFormFiller):
        __model__ = ResourceItemModel
        
    class table_type(TableBase):
        __model__ = ResourceItemModel
        
    class table_filler_type(TableFiller):
        __model__ = ResourceItemModel
        __omit_fields__ = ['__actions__']       

class RootController(BaseController):
    secc = SecureController()
    admin = AdminController(model, DBSession, config_type=TGAdminConfig)
    error = ErrorController()

    resource_items = ResourceItemController(DBSession)

Open web browser to http://127.0.0.1:8080/resource_items/ I get a Server Error, the dreaded UnicodeDecodeError on Antoine de Saint-Exupéry name.

I'm not sure what I did wrong.

Is the default template from sprox not <meta content="text/html; charset=UTF-8"/> by chance?

Any recommendations on how I can resolve my UnicodeDecodeError?

Thanks.

--
Bob Tanner <tan...@real-time.com>                 | Phone : 952-943-8700
http://www.real-time.com, Linux, OSX, VMware | Fax      : 952-943-8500
Key fingerprint = F785 DDFC CF94 7CE8 AA87 3A9D 3895 26F1 0DDB E378

PGP.sig

basicth...@gmail.com

unread,
Mar 31, 2011, 3:29:13 AM3/31/11
to TurboGears
> Open web browser tohttp://127.0.0.1:8080/resource_items/I get a Server Error, the dreaded UnicodeDecodeError on Antoine de Saint-Exupéry name.
>

Screenshot of error message that didn't make it here when I sent the
email:

https://img.skitch.com/20110331-tbcjjfxbwamxqprss426tumisc.jpg

Toshio Kuratomi

unread,
Mar 31, 2011, 11:03:45 PM3/31/11
to turbo...@googlegroups.com
From the error there, the data in the database may not be utf-8.
'\xe9' is the byte sequence for é in latin-1, not in utf-8.

If you can retrieve the data from the database and look at it in a hex editor
and it has the byte sequence 0xe9 there then the problem is the data stored
in the db. If you retrieve it that way and the hex sequence is 0xc3 0xa9
then the data in the database is correct and somewhere in the mysql python
bindings or sqlalchemy the data is going through a series of decode and
encode operations that result in it being encoded as latin-1 instead of
utf-8.

-Toshio

basicth...@gmail.com

unread,
Apr 3, 2011, 1:18:58 AM4/3/11
to TurboGears
> From the error there, the data in the database may not be utf-8.  
> '\xe9' is the byte sequence for é in latin-1, not in utf-8.
>
> If you can retrieve the data from the database and look at it in a hex editor
> and it has the byte sequence 0xe9 there then the problem is the data stored
> in the db.  If you retrieve it that way and the hex sequence is 0xc3 0xa9
> then the data in the database is correct and somewhere in the mysql python
> bindings or sqlalchemy the data is going through a series of decode and
> encode operations that result in it being encoded as latin-1 instead of
> utf-8.

Thank you very much for a response!

S a i n t - E x u p r y
53 61 69 6e 74 2d 45 78 75 70 c3 a9 72 79

0xc3 0xa9 in the database as viewed in a hex editor.

As I understand it the database is correct. Where do I go from here?

mysql python people?
sqlalchemy people?

What else can I do to make sure this isn't my code, my database?

Thanks.

basicth...@gmail.com

unread,
Apr 3, 2011, 2:28:56 AM4/3/11
to TurboGears
> Thank you very much for a response!
>
> S   a   i    n   t    -    E  x    u   p             r   y
> 53 61 69 6e 74 2d 45 78 75 70 c3 a9 72 79
>
> 0xc3 0xa9 in the database as viewed in a hex editor.

Made some progress.

Change development.ini to this:

sqlalchemy.url=mysql+mysqldb://user:pass...@127.0.0.1:3306/mydb?
charset=utf8

I get some warnings:

python2.6/site-packages/SQLAlchemy-0.6.6-py2.6.egg/sqlalchemy/engine/
url.py:195: PendingDeprecationWarning: cgi.parse_qsl is deprecated,
use urlparse.parse_qsl instead query = (len(tokens) > 1 and
dict(cgi.parse_qsl(tokens[1]))) or None

But I no longer get the UnicodeDecodeError. Still have other problems
but that is tracked in another post.

Carlos Daniel Ruvalcaba Valenzuela

unread,
Apr 5, 2011, 8:44:55 AM4/5/11
to turbo...@googlegroups.com
Funny, I discussed this same problem on IRC on the sqlalchemy channel
the other day, the solution was to add the options charset=utf8 and
use_unicode=0 to sqlalchemy.url, then force sqlalchemy to convert to
unicode (this fixes Genshi problems), with the option
sqlalchemy.convert_unicode = True.

The problem seems to be mysqldb connector, it has problems when
working with unicode, setting it explicitly to use utf8 helps to
keeping it from converting to ascii, and then you need to force
mysqldb to not use it's internal unicode functions (they have
problems, leaks, exceptions and whatnot), it will return a bytecode
string, which with genshi it gives trouble as is not a regular string,
you' ll need to convert it to unicode (something genshi can work with)
with something else, sqlalchemy can do the work for you, adding the
option convert_unicode.

It was a blast drilling though the documentation, mailing list,
traceback and code to find this. Bottom line, use another DB with a
proper DBAPI driver or use another driver for MySQL.

Regards,
Carlos Daniel Ruvalcaba Valenzuela

Moritz Schlarb

unread,
Aug 24, 2012, 3:13:04 AM8/24/12
to turbo...@googlegroups.com
Ahem...
I don't think changing a distributed Python library file is a solution in any way...

You can change the default encoding programatically, although it is not considered the best way:

You should try to find a better solution, which has to exist, since everyone is struggling with unicode...

Am Freitag, 24. August 2012 01:24:27 UTC+2 schrieb enc0de:
> I'm not sure what I did wrong.
>
>
> Is the default template from sprox not <meta content="text/html; charset=UTF-8"/> by chance?
>
>
> Any recommendations on how I can resolve my UnicodeDecodeError?
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Bob Tanner <tan...@real-time.com>                 | Phone : 952-943-8700
> http://www.real-time.com, Linux, OSX, VMware | Fax      : 952-943-8500
> Key fingerprint = F785 DDFC CF94 7CE8 AA87 3A9D 3895 26F1 0DDB E378

Listen I have been on this same problem it's not genshi or anython it's python itself just find the site.py in the python/Lib/site.py and look for

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()


you see the encoding = "ascii" change to look like this encoding = "utf-8" and the whole problem is gone.

Yes the solution is that EASY
Reply all
Reply to author
Forward
0 new messages