UNICODE database API How to

6 views
Skip to first unread message

yml

unread,
Apr 17, 2006, 9:01:00 AM4/17/06
to Django users
Hello Djangonauts,

I am tryng to import non ascii data stored in a file to do so I am
using a regular expression to parse the lines of the files in order
separates the "department number" from the department names. This is
working fine my problem comes when I am trying to use the Django's DB
API in order to save the data in a model that is described below.
The problems happens when matches[1] contains non ascii characters like
éàçàè...
r=Department(department=matches[1],department_number=matches[0],country="France")
r.save()
I guess that I should add something to my script called
loader_departments.py in order to support this.
It would be great if someone could explain what to me
Thank you for your help.


=========
models
=========
class Department(meta.Model):
department = meta.CharField(maxlength=30)
department_number = meta.IntegerField()
country = meta.CharField(maxlength=30)
def __repr__(self):
return self.department + " "+str(self.department_number)
class META:
admin = meta.Admin(
list_display =
('department','department_number','country'),
)


the file where my data are stored look like this:
==========
departments.txt
============
89 Yonne

90 Territoire de Belfort

91 Essone

92 Hauts de Seine

93 Seine Saint-Denis

94 Val de Marne

95 Val d'Oise

971 Guadeloupe

972 Martinique

973 Guyane

974 Réunion

=============
loader_departments.py
==============

import os,codecs
from django.models.announceManager import *

os.chdir(os.path.abspath("E:\\instal\\django\\view_servicealapersonne\\votreservice\\_initialLoad"))
f = codecs.open("departments.txt",encoding='utf-8')

import re


regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)

for l in f.xreadlines():
print "the current line is :"+l
try:
matches = regexobj.search(l).groups()
print "matches " +str(matches)

r=Department(department=matches[1],department_number=matches[0],country="France")
r.save()
print "ok"
except:
print "ko"

f.close()

Ivan Sagalaev

unread,
Apr 17, 2006, 10:10:23 AM4/17/06
to django...@googlegroups.com
yml wrote:

> I guess that I should add something to my script called
>loader_departments.py in order to support this.
>
>

Yes. You are getting contents of your file in unicode but Django's ORM
expect strings to be in byte strings. So you have to encode them into
whatever encoding your DB is right before you pass them to a model
constructor.

This:

>r=Department(department=matches[1],department_number=matches[0],country="France")
> r.save()
>
>

becomes this (assuming your DB in utf-8):

r=Department(
department=matches[1].encode('utf-8'),
department_number=matches[0].encode('utf-8'),
country="France"
)

yml

unread,
Apr 17, 2006, 10:51:43 AM4/17/06
to Django users
Ivan hi,
I try your recommandation but unfortunatly it is not working I am
getting the following error message:

In [38]: r=Department(department=matches[1].encode('utf-8'),


department_number=matches[0].encode('utf-8'), country="France")

---------------------------------------------------------------------------
exceptions.UnicodeDecodeError Traceback (most
recent call last)

E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\<console>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1:
ordinal not in range(128)

I am using mysql as DB I supposed that it does support the unicode but
I am not sure. How can I make sure that it does?
Thank you

fre...@pythonware.com

unread,
Apr 17, 2006, 11:03:25 AM4/17/06
to Django users
yml wrote:

> I try your recommandation but unfortunatly it is not working I am
> getting the following error message:
>
> In [38]: r=Department(department=matches[1].encode('utf-8'),
> department_number=matches[0].encode('utf-8'), country="France")
> ---------------------------------------------------------------------------
> exceptions.UnicodeDecodeError Traceback (most
> recent call last)
>
> E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\<console>
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1:
> ordinal not in range(128)

that usually means that the data you're encoding is an 8-bit string
with non-ASCII characters in it, rather than a Unicode string.

make sure you're using the right source encoding (iso-8859-1), and that
that you pass Unicode strings to the database.

(I haven't used mysql with Django, but I'm quite sure that the Django
ORM handles Unicode correctly).

</F>

Ivan Sagalaev

unread,
Apr 17, 2006, 2:30:33 PM4/17/06
to django...@googlegroups.com
yml wrote:

>Ivan hi,
>I try your recommandation but unfortunatly it is not working I am
>getting the following error message:
>
>In [38]: r=Department(department=matches[1].encode('utf-8'),
>department_number=matches[0].encode('utf-8'), country="France")
>---------------------------------------------------------------------------
>exceptions.UnicodeDecodeError Traceback (most
>recent call last)
>
>E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\<console>
>
>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1:
>ordinal not in range(128)
>
>

Hm... Since you do this in console this may be caused by double-encoding
your input. Do you read the actual file or enter test string from
keyboard like this:

l = u'974 Réunion'

?

In this case you get byte-encoded symbols treated as unicode and
byte-encoded again (in short: a mess :-) )

Ivan Sagalaev

unread,
Apr 17, 2006, 2:33:29 PM4/17/06
to django...@googlegroups.com
fre...@pythonware.com wrote:

>(I haven't used mysql with Django, but I'm quite sure that the Django
>ORM handles Unicode correctly).
>
>

It depends very much on the definition of "correctly". Django's ORM
expect single-byte strings which for unicode means utf-8.

fre...@pythonware.com

unread,
Apr 17, 2006, 2:46:26 PM4/17/06
to Django users
Ivan Sagalaev wrote:

> It depends very much on the definition of "correctly". Django's ORM
> expect single-byte strings which for unicode means utf-8.

>>> from django.models.page import pages
>>> p = pages.Page(path=u"/föö", source=u"/bär")
>>> p.save()

>>> q = pages.get_object(path__exact=u"/föö")
>>> q.path
u'/f\xf6\xf6'

hmm...

</F>

Ivan Sagalaev

unread,
Apr 17, 2006, 3:49:10 PM4/17/06
to django...@googlegroups.com
fre...@pythonware.com wrote:

>>>>from django.models.page import pages
>>>>p = pages.Page(path=u"/föö", source=u"/bär")
>>>>p.save()
>>>>
>>>>

Interesting. I get here a 'Segmentation fault'. Both in in m-r and trunk...

>>>>q = pages.get_object(path__exact=u"/föö")
>>>>q.path
>>>>
>>>>
>u'/f\xf6\xf6'
>
>

This is more interesting. I never get unicode strings from queries. This
may well depend on what underlying db lib returns. I use psycopg and my
DB is in UTF-8. What is yours?

yml

unread,
Apr 18, 2006, 4:34:10 AM4/18/06
to Django users
Hello Thank you for your help but so far I do not have any success.
I am reading form a the lines from a file.

Here it is the kind of error I am getting :

the current line is :07 ArdFche

matches ('07', 'Ard\xe8che\r\n')
Traceback (most recent call last):
File
"E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\loade
r_departements.py", line 19, in ?
r


=Department(department=matches[1].encode('utf-8'),department_number=matche
s[0].encode('utf-8'),country="France")

File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5:
invalid dat
a

the word that is causing trouble is Ardèche. in order to parse my data
I am using the following scripts:

===============
My script to parse the data


=============
import os,codecs
from django.models.announceManager import *

os.chdir(os.path.abspath("E:\\instal\\django\\view_servicealapersonne\\votreservice\\_initialLoad"))
f = codecs.open("departments.txt",encoding='utf-8')

import re
regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)

for l in f.xreadlines():
print "the current line is :"+l
try:
matches = regexobj.search(l).groups()

except:
print "this is an empty line"
print "matches " +str(matches)
r
=Department(department=matches[1].encode('utf-8'),department_number=matches[0].encode('utf-8'),country="France")
r.save()
print "ok"
print r

f.close()


=================
my data
=================

01 Ain

02 Aisne

03 Allier

04 Alpes de Haute Provence

05 Hautes Alpes

06 Alpes Maritimes

07 Ardèche <============ this line crash

08 Ardennes

09 Ariège


thank you for your help

yml

unread,
Apr 18, 2006, 6:11:12 AM4/18/06
to Django users
Here it is the script that is working for me.
This was done thanks to this great pages
(http://effbot.org/zone/unicode-objects.htm) it took me 3 days to find
it on internet.
Thank you google and of course thank you to the author :-)


# -*- coding: utf8 -*-


import os,codecs
from django.models.announceManager import *

import re


os.chdir(os.path.abspath("E:\\instal\\django\\view_servicealapersonne\\votreservice\\_initialLoad"))

f = open("departments.txt")


regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)


for l in f.readlines():
fileencoding = "iso-8859-1"
txt = l.decode(fileencoding)
print "the current line is :"+txt.encode(fileencoding, "replace")
try:
matches = regexobj.search(txt).groups()


except:
print "this is an empty line"
print "matches " +str(matches)
r

=Department(department=matches[1].encode('utf-8','replace'),department_number=matches[0].encode('utf-8'),country="France")

Gábor Farkas

unread,
Apr 18, 2006, 6:55:23 AM4/18/06
to django...@googlegroups.com

> =Department(department=matches[1].encode('utf-8','replace')

hi,

you can remove that ['replace']. there shouldn't be any unicode
character that cannot be represented in utf-8, so the error-condition
for which you specify the behaviour is never going to happen.

gabor

Ivan Sagalaev

unread,
Apr 18, 2006, 8:00:38 AM4/18/06
to django...@googlegroups.com
yml wrote:

>f = open("departments.txt")
>regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)
>
>
>for l in f.readlines():
> fileencoding = "iso-8859-1"
>
>

Ah! I was just about to suggest this. Your file seems to be not in utf-8
actually. And since you were opening it with codecs.open("...",
encoding='utf-8') this was causing the error.

yml

unread,
Apr 18, 2006, 12:10:25 PM4/18/06
to Django users
Thank you Ivan,
Happy to see that in any case my pb would have been solved today.
;-)

cla...@gmail.com

unread,
Apr 18, 2006, 8:33:04 PM4/18/06
to Django users
An easier way would be to use the codecs module to open the file for
reading or writing in a particular encoding (the system default
encoding is assumed when using open() or file(), which is apparently
different from iso-8859-1 in the case above). I have to deal with
encoded files quite frequently, so I have a class that allows me to
pass an optional encoding value when getting file objects. The
getFile() classmethod that returns a file object looks something like
this (don't forget to import codecs):

def getFile(self, filename, mode='r', enc=None):
"""Return a file object"""
f = None
if enc:
try:
f = codecs.open(filename, mode, enc)
except:
raise
else:
try:
f = file(filename, mode)
except:
raise
return f
getFile = classmethod(getFile)


You can also change the encoding used on an open file object. Here's
that method:

def setFileEncoding(self, fileobj, enc):
"""Set a file obj's encoding"""
fileobj = codecs.lookup(enc)[-1](fileobj)
setFileEncoding = classmethod(setFileEncoding)

Reply all
Reply to author
Forward
0 new messages