Problem storing utf-8 in MySQL

454 views
Skip to first unread message

Niklas Collin

unread,
Mar 21, 2008, 12:55:32 PM3/21/08
to django...@googlegroups.com
Hi, I'm trying to store a large utf-8 string with many different language
chars into a MySQL utf-8 table but get a strange exception: Incorrect string
value.

Does anyone know why this would happen? I'm really at my wits end about this
one. It seems rather standard to me?

Best regards,
Niklas

-----------------------------------
The code:

class Resource(models.Model):
text = models.TextField(default=u'', max_length=200, core=True)

f = open('%s/testdata/utf8.html' % os.path.dirname(__file__), 'rb') # Open
utf8-encoded file in binary mode (for window's benefit)

# utf8.html is a saved html dump from
http://www.columbia.edu/~kermit/utf8.html

utf8_string = f.read()
f.close()
exotic_text = unicode(utf8_string,'utf-8')

r = Resource(text = exotic_text )
r.save()

--------------------------------------
The exception

Exception raised:
Traceback (most recent call last):
File
"S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\test\_doctest.py",
line 1267, in __run
compileflags, 1) in test.globs
File "<doctest ticmate.djangorelated.apps.resources.tests[84]>", line
1, in <module>
(r, is_created) = o.get_or_create_resource(u'os',
languages.by_iso['RU'].id, type = Resource.TYPE_HTML, text = exotic_text)
File
"S:\nc\Projects\Inventive\Ticmate\svn\ticmate\trunk\source\python\ticmate\dj
angorelated\apps\resources\models.py", line 159, in get_or_create_resource
return Resource.objects.get_or_create(language_id = language_id,
ownership = p, defaults=defaults)
File
"S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\db\models\manager.
py", line 72, in get_or_create
return self.get_query_set().get_or_create(**kwargs)
File
"S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\db\models\query.py
", line 291, in get_or_create
obj.save()
File
"S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\db\models\base.py"
, line 264, in save
','.join(placeholders)), db_values)
File "C:\Python25\Lib\site-packages\MySQLdb\cursors.py", line 168, in
execute
if not self._defer_warnings: self._warning_check()
File "C:\Python25\Lib\site-packages\MySQLdb\cursors.py", line 82, in
_warning_check
warn(w[-1], self.Warning, 3)
File "c:\python25\lib\warnings.py", line 62, in warn
globals)
File "c:\python25\lib\warnings.py", line 102, in warn_explicit
raise message
Warning: Incorrect string value: '\xF0\x90\x8C\xBC\xF0\x90...' for
column 'text' at row 1


Karen Tracey

unread,
Mar 21, 2008, 3:48:43 PM3/21/08
to django...@googlegroups.com
On Fri, Mar 21, 2008 at 12:55 PM, Niklas Collin <nik...@inventive.se> wrote:

Hi, I'm trying to store a large utf-8 string with many different language
chars into a MySQL utf-8 table but get a strange exception: Incorrect string
value.

You're sure the table you are using has a utf8 encoding, verified via a 'show create table your_table_name" from a mysql session?  Maybe post the output of that 'show create table' just to see if anyone notices anything odd about it.


Does anyone know why this would happen? I'm really at my wits end about this
one. It seems rather standard to me?

Best regards,
Niklas

-----------------------------------
The code:

class Resource(models.Model):
   text = models.TextField(default=u'', max_length=200, core=True)

f = open('%s/testdata/utf8.html' % os.path.dirname(__file__), 'rb') # Open
utf8-encoded file in binary mode (for window's benefit)

# utf8.html is a saved html dump from
http://www.columbia.edu/~kermit/utf8.html

That's a big page.  Much bigger than will fit into a TextField of max_length 200.  So I'm confused -- do you just have a portion of this page saved to a file?  If you've got the whole thing I'd expect a "Data too long for column" error, not what you are getting.


utf8_string = f.read()
f.close()
exotic_text = unicode(utf8_string,'utf-8')

r = Resource(text = exotic_text )
r.save()

I can do exactly what you've coded for portions (that will fit in my text field) of the page you mention, and it works fine.  So I'm missing some essential info needed to recreate your error. 

Karen

Niklas Collin

unread,
Mar 21, 2008, 5:38:35 PM3/21/08
to django...@googlegroups.com
Hi, Karen
 
It's definitely a utf-8 table and I've increased the max_length but still seem to get the same problem. I've compiled information on the platform and a docstring test to easily reproduce this problem on which is included below or more easily browsed at:
 
 
Please let me know if I can provide you with better information and thanks a lot for looking at this since I'm really stumped at the moment :-)
 
Best regards,
Niklas
 
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# -*- coding: utf-8 -*-
from django.db import models

class Utf8Test(models.Model):
    '''
    I'm developing on Windows but the db host is on FreeBSD:
    
    * My max_length of text field is now max_length=256000 
      and the loaded file is below 64k 
      
    * The SQL shows a utf-8 table    
    
    platform._sys_version()
        ('2.5.1', 
        'r251:54863', 
        'Apr 18 2007 08:51:08', 
        'MSC v.1310 32 bit (Intel)')
    platform._syscmd_ver()
        'Microsoft Windows', 'XP', '5.1.2600')        
    MySQLdb
        MySQL-python-1.2.2.win32-py2.5.exe
    MySQL server
        Server version: 5.0.45 FreeBSD port: mysql-server-5.0.45_1
    Django
        svn

     This is the SQL output by cursor.execute("show create table foo_utf8test;") during a unit test
    'CREATE TABLE `foo_utf8test` 
     (`id` int(11) NOT NULL auto_increment,
      `text` longtext NOT NULL,
      PRIMARY KEY  (`id`)) 
     ENGINE=MyISAM AUTO_INCREMENT=2 DEFAULT CHARSET=utf8')
    
    >>> import types
    >>> import os
    
    LOAD UTF-8 ENCODED FILE 
    
    >>> f = open('%s/../resources/testdata/utf8.html' % os.path.dirname(__file__), 'rb')     # Open utf8-encoded file in binary mode (for window's benefit)     # utf8.html is a saved html dump from http://www.columbia.edu/~kermit/utf8.html
    >>> utf8_string = f.read()
    >>> f.close()
    >>> stored_unicode_text = unicode(utf8_string,'utf-8')
    
    CREATE MODEL OBJECT AND SAVE
    
    >>> u = Utf8Test()
    >>> u.text = stored_unicode_text
    >>> u.save()

    RELOAD OBJECT FROM DB

    >>> u2 = Utf8Test.objects.get(id = 1)
    >>> u2.id == u.id    
    True
    >>> retrieved_text = u2.text
    
    COMPARE DATA

    >>> type(retrieved_text) == types.UnicodeType
    True
    >>> type(stored_unicode_text) == types.UnicodeType    
    True
    >>> retrieved_text == stored_unicode_text    
    True   
    '''
    text = models.TextField(default=u'', max_length=256000)

#======================================================================
#FAIL: Doctest: ticmate.djangorelated.apps.foo.models.Utf8Test
#----------------------------------------------------------------------
#Traceback (most recent call last):
#  File "S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\test\_doctest.py", line 2180, in runTest
#    raise self.failureException(self.format_failure(new.getvalue()))
#AssertionError: Failed doctest test for ticmate.djangorelated.apps.foo.models.Utf8Test
#  File "S:\nc\Projects\Inventive\Ticmate\svn\ticmate\trunk\source\python\ticmate\djangorelated\apps\foo\models.py", line 10, in Utf8Test
#
#----------------------------------------------------------------------
#File "S:\nc\Projects\Inventive\Ticmate\svn\ticmate\trunk\source\python\ticmate\djangorelated\apps\foo\models.py", line 42, in ticmate.djangorelated.apps.foo.models.Utf8Test
#Failed example:
#    u.save()
#Exception raised:
#    Traceback (most recent call last):
#      File "S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\test\_doctest.py", line 1267, in __run
#        compileflags, 1) in test.globs
#      File "<doctest ticmate.djangorelated.apps.foo.models.Utf8Test[8]>", line 1, in <module>
#   Destroying test database...
#     u.save()
#      File "S:\nc\Projects\Inventive\Ticmate\svn\django-trunk\django\db\models\base.py", line 264, in save
#        ','.join(placeholders)), db_values)
#      File "C:\Python25\Lib\site-packages\MySQLdb\cursors.py", line 168, in execute
#        if not self._defer_warnings: self._warning_check()
#      File "C:\Python25\Lib\site-packages\MySQLdb\cursors.py", line 82, in _warning_check
#        warn(w[-1], self.Warning, 3)
#      File "c:\python25\lib\warnings.py", line 62, in warn
#        globals)
#      File "c:\python25\lib\warnings.py", line 102, in warn_explicit
#        raise message
#    Warning: Incorrect string value: '\xF0\x90\x8C\xBC\xF0\x90...' for column 'text' at row 1
#----------------------------------------------------------------------
#File "S:\nc\Projects\Inventive\Ticmate\svn\ticmate\trunk\source\python\ticmate\djangorelated\apps\foo\models.py", line 47, in ticmate.djangorelated.apps.foo.models.Utf8Test
#Failed example:
#    u2.id == u.id    
#Expected:
#    True
#Got:
#    False
#----------------------------------------------------------------------
#File "S:\nc\Projects\Inventive\Ticmate\svn\ticmate\trunk\source\python\ticmate\djangorelated\apps\foo\models.py", line 57, in ticmate.djangorelated.apps.foo.models.Utf8Test
#Failed example:
#    retrieved_text == stored_unicode_text    
#Expected:
#    True   
#Got:
#    False
#
#
#----------------------------------------------------------------------
#Ran 7 tests in 3.141s
#
#FAILED (failures=1)
 


From: django...@googlegroups.com [mailto:django...@googlegroups.com] On Behalf Of Karen Tracey
Sent: den 21 mars 2008 20:49
To: django...@googlegroups.com
Subject: Re: Problem storing utf-8 in MySQL

Karen Tracey

unread,
Mar 21, 2008, 7:19:31 PM3/21/08
to django...@googlegroups.com
On Fri, Mar 21, 2008 at 5:38 PM, Niklas Collin <nik...@inventive.se> wrote:
Hi, Karen
 
It's definitely a utf-8 table and I've increased the max_length but still seem to get the same problem. I've compiled information on the platform and a docstring test to easily reproduce this problem on which is included below or more easily browsed at:
 
 
Please let me know if I can provide you with better information and thanks a lot for looking at this since I'm really stumped at the moment :-)

OK, got it now.  Turns out the answer is in the clear-as-mud error message, sort of:


Warning: Incorrect string value: '\xF0\x90\x8C\xBC\xF0\x90...'

x'f0908cbc' is a valid 4-byte UTF-8 value, only MySQL doesn't support 4-byte UTF-8.  Its 'utf8' character set only supports up to 3-byte UTF-8, see:

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

Support for 4-byte UTF-8 is listed near the top of what's coming in 6.0:

http://dev.mysql.com/doc/refman/6.0/en/mysql-nutshell.html

I've seen references to turning off strict mode in MySQL to get rid of problems like this, but haven't played with that at all.  I'm not sure what MySQL will do with the problem data if you turn off strict mode.

If you want to just remove the problematic 4-byte UTF-8 codes from the file, they're only present in the Gothic (line 47) and 2nd Vietnamese (line 123) of the "I Can Eat Glass" translations.  Without those two translations MySQL accepts the data.

Karen

Niklas Collin

unread,
Mar 23, 2008, 8:47:04 AM3/23/08
to django...@googlegroups.com
Thanks a million, Karen.
 
I can't believe how quickly you found the problem!
 
We're going to support japanese and russian but we'll see later if they pose a problem and upgrade the db in that case.
 
Thanks for being so helpful. I wish I could buy you a beer sometime but you probably live nowhere near Sweden :-)
 
Best regards,
Niklas


From: django...@googlegroups.com [mailto:django...@googlegroups.com] On Behalf Of Karen Tracey
Sent: den 21 mars 2008 20:49
To: django...@googlegroups.com
Subject: Re: Problem storing utf-8 in MySQL

Malcolm Tredinnick

unread,
Mar 23, 2008, 9:07:40 AM3/23/08
to django...@googlegroups.com

On Sun, 2008-03-23 at 13:47 +0100, Niklas Collin wrote:
> Thanks a million, Karen.
>
> I can't believe how quickly you found the problem!
>
> We're going to support japanese and russian but we'll see later if
> they pose a problem and upgrade the db in that case.

One of the reasons MySQL doesn't support 4-byte UTF-8 codepoints is
because they're not needed for pretty much all practical purposes.
Except for some very ancient scripts and a few of the very rare CJK
characters, everything you'll need is inside the Basic Multilingual
Plane (which requires only three bytes in UTF-8). Not having access to
those rare CJK codepoints isn't going to be a showstopper; they're rare
for a reason.

Malcolm

--
Save the whales. Collect the whole set.
http://www.pointy-stick.com/blog/

Rob Hudson

unread,
Apr 13, 2008, 4:28:46 PM4/13/08
to Django users
On Mar 21, 4:19 pm, "Karen Tracey" <kmtra...@gmail.com> wrote:
> OK, got it now.  Turns out the answer is in the clear-as-mud error message,
> sort of:
>
> Warning: Incorrect string value: '\xF0\x90\x8C\xBC\xF0\x90...'
>
> x'f0908cbc' is a valid 4-byte UTF-8 value, only MySQL doesn't support 4-byte
> UTF-8.  Its 'utf8' character set only supports up to 3-byte UTF-8, see:
>
> http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
>
> Support for 4-byte UTF-8 is listed near the top of what's coming in 6.0:
>
> http://dev.mysql.com/doc/refman/6.0/en/mysql-nutshell.html
>
> I've seen references to turning off strict mode in MySQL to get rid of
> problems like this, but haven't played with that at all.  I'm not sure what
> MySQL will do with the problem data if you turn off strict mode.
>
> If you want to just remove the problematic 4-byte UTF-8 codes from the file,
> they're only present in the Gothic (line 47) and 2nd Vietnamese (line 123)
> of the "I Can Eat Glass" translations.  Without those two translations MySQL
> accepts the data.
>
> Karen

For those of us having a very similar problem elsewhere, what's a good
general solution?

1. Check that the table is indeed utf8.
2. ???

Does attempting to re-encode the UTF8 string work? Try the mysql
strict mode thing you mentioned?

Thanks,
Rob

Malcolm Tredinnick

unread,
Apr 13, 2008, 9:26:12 PM4/13/08
to django...@googlegroups.com
Rob,

On Sun, 2008-04-13 at 13:28 -0700, Rob Hudson wrote:
[...]


> For those of us having a very similar problem elsewhere, what's a good
> general solution?

Depends on the general problem. :-)

Do you mean that you are coming across a case of trying to store 4-byte
UTF-8 sequences in MySQL? Or do you mean that you've managed to create a
database table with the incorrect encoding and only realised this after
you've stored data in the table? Or is there a third possibility?

Regards,
Malcolm

--
Why be difficult when, with a little bit of effort, you could be
impossible.
http://www.pointy-stick.com/blog/

Rob Hudson

unread,
Apr 14, 2008, 12:30:06 AM4/14/08
to Django users
On Apr 13, 6:26 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> Do you mean that you are coming across a case of trying to store 4-byte
> UTF-8 sequences in MySQL? Or do you mean that you've managed to create a
> database table with the incorrect encoding and only realised this after
> you've stored data in the table? Or is there a third possibility?

Thanks for the follow up...

In my particular case I have an old database with content that has
built up over the years from a PHP website. I'm re-building the site
in Django and am working on a database migration script. For various
text fields in the old tables I was hitting this same error,
"Incorrect string value: '\x[something or other]' for column
'[something]' at row 1".

I actually ended solving it by telling MySQLdb to "use_unicode", e.g.:
db = MySQLdb.connect(host=OLD_DB_HOST, user=OLD_DB_USER,
passwd=OLD_DB_PASS, db=OLD_DB_NAME, use_unicode=True)

I believe the old tables were latin-1 encoded so it seems like this
forces them to be unicode encoded which made pushing into the new
tables as utf8 work correctly?

Thanks,
Rob

Malcolm Tredinnick

unread,
Apr 14, 2008, 12:40:42 AM4/14/08
to django...@googlegroups.com

Might work. Depends if it really is all latin1 encoded or really UTF-8
that has been shoved into a latin1 table (PHP has known problems with
trying to do that, apparently).

This thread -- [1] -- might be of some interest to you if you're
looking to re-encode your data. Both the suggestions there are
essentially the same process.

[1]
http://groups.google.com/group/django-users/browse_frm/thread/7f7b48dae87f38b4/73d0954bc8ea8673#73d0954bc8ea8673

Regards,
Malcolm

--
Remember that you are unique. Just like everyone else.
http://www.pointy-stick.com/blog/

Reply all
Reply to author
Forward
0 new messages