I am tryng to import non ascii data stored in a file to do so I am
using a regular expression to parse the lines of the files in order
separates the "department number" from the department names. This is
working fine my problem comes when I am trying to use the Django's DB
API in order to save the data in a model that is described below.
The problems happens when matches[1] contains non ascii characters like
éàçàè...
r=Department(department=matches[1],department_number=matches[0],country="France")
r.save()
I guess that I should add something to my script called
loader_departments.py in order to support this.
It would be great if someone could explain what to me
Thank you for your help.
=========
models
=========
class Department(meta.Model):
department = meta.CharField(maxlength=30)
department_number = meta.IntegerField()
country = meta.CharField(maxlength=30)
def __repr__(self):
return self.department + " "+str(self.department_number)
class META:
admin = meta.Admin(
list_display =
('department','department_number','country'),
)
the file where my data are stored look like this:
==========
departments.txt
============
89 Yonne
90 Territoire de Belfort
91 Essone
92 Hauts de Seine
93 Seine Saint-Denis
94 Val de Marne
95 Val d'Oise
971 Guadeloupe
972 Martinique
973 Guyane
974 Réunion
=============
loader_departments.py
==============
import os,codecs
from django.models.announceManager import *
os.chdir(os.path.abspath("E:\\instal\\django\\view_servicealapersonne\\votreservice\\_initialLoad"))
f = codecs.open("departments.txt",encoding='utf-8')
import re
regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)
for l in f.xreadlines():
print "the current line is :"+l
try:
matches = regexobj.search(l).groups()
print "matches " +str(matches)
r=Department(department=matches[1],department_number=matches[0],country="France")
r.save()
print "ok"
except:
print "ko"
f.close()
> I guess that I should add something to my script called
>loader_departments.py in order to support this.
>
>
Yes. You are getting contents of your file in unicode but Django's ORM
expect strings to be in byte strings. So you have to encode them into
whatever encoding your DB is right before you pass them to a model
constructor.
This:
>r=Department(department=matches[1],department_number=matches[0],country="France")
> r.save()
>
>
becomes this (assuming your DB in utf-8):
r=Department(
department=matches[1].encode('utf-8'),
department_number=matches[0].encode('utf-8'),
country="France"
)
In [38]: r=Department(department=matches[1].encode('utf-8'),
department_number=matches[0].encode('utf-8'), country="France")
---------------------------------------------------------------------------
exceptions.UnicodeDecodeError Traceback (most
recent call last)
E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\<console>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1:
ordinal not in range(128)
I am using mysql as DB I supposed that it does support the unicode but
I am not sure. How can I make sure that it does?
Thank you
> I try your recommandation but unfortunatly it is not working I am
> getting the following error message:
>
> In [38]: r=Department(department=matches[1].encode('utf-8'),
> department_number=matches[0].encode('utf-8'), country="France")
> ---------------------------------------------------------------------------
> exceptions.UnicodeDecodeError Traceback (most
> recent call last)
>
> E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\<console>
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1:
> ordinal not in range(128)
that usually means that the data you're encoding is an 8-bit string
with non-ASCII characters in it, rather than a Unicode string.
make sure you're using the right source encoding (iso-8859-1), and that
that you pass Unicode strings to the database.
(I haven't used mysql with Django, but I'm quite sure that the Django
ORM handles Unicode correctly).
</F>
>Ivan hi,
>I try your recommandation but unfortunatly it is not working I am
>getting the following error message:
>
>In [38]: r=Department(department=matches[1].encode('utf-8'),
>department_number=matches[0].encode('utf-8'), country="France")
>---------------------------------------------------------------------------
>exceptions.UnicodeDecodeError Traceback (most
>recent call last)
>
>E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\<console>
>
>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1:
>ordinal not in range(128)
>
>
Hm... Since you do this in console this may be caused by double-encoding
your input. Do you read the actual file or enter test string from
keyboard like this:
l = u'974 Réunion'
?
In this case you get byte-encoded symbols treated as unicode and
byte-encoded again (in short: a mess :-) )
>(I haven't used mysql with Django, but I'm quite sure that the Django
>ORM handles Unicode correctly).
>
>
It depends very much on the definition of "correctly". Django's ORM
expect single-byte strings which for unicode means utf-8.
> It depends very much on the definition of "correctly". Django's ORM
> expect single-byte strings which for unicode means utf-8.
>>> from django.models.page import pages
>>> p = pages.Page(path=u"/föö", source=u"/bär")
>>> p.save()
>>> q = pages.get_object(path__exact=u"/föö")
>>> q.path
u'/f\xf6\xf6'
hmm...
</F>
>>>>from django.models.page import pages
>>>>p = pages.Page(path=u"/föö", source=u"/bär")
>>>>p.save()
>>>>
>>>>
Interesting. I get here a 'Segmentation fault'. Both in in m-r and trunk...
>>>>q = pages.get_object(path__exact=u"/föö")
>>>>q.path
>>>>
>>>>
>u'/f\xf6\xf6'
>
>
This is more interesting. I never get unicode strings from queries. This
may well depend on what underlying db lib returns. I use psycopg and my
DB is in UTF-8. What is yours?
Here it is the kind of error I am getting :
the current line is :07 ArdFche
matches ('07', 'Ard\xe8che\r\n')
Traceback (most recent call last):
File
"E:\instal\django\view_servicealapersonne\votreservice\_initialLoad\loade
r_departements.py", line 19, in ?
r
=Department(department=matches[1].encode('utf-8'),department_number=matche
s[0].encode('utf-8'),country="France")
File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5:
invalid dat
a
the word that is causing trouble is Ardèche. in order to parse my data
I am using the following scripts:
===============
My script to parse the data
=============
import os,codecs
from django.models.announceManager import *
os.chdir(os.path.abspath("E:\\instal\\django\\view_servicealapersonne\\votreservice\\_initialLoad"))
f = codecs.open("departments.txt",encoding='utf-8')
import re
regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)
for l in f.xreadlines():
print "the current line is :"+l
try:
matches = regexobj.search(l).groups()
except:
print "this is an empty line"
print "matches " +str(matches)
r
=Department(department=matches[1].encode('utf-8'),department_number=matches[0].encode('utf-8'),country="France")
r.save()
print "ok"
print r
f.close()
=================
my data
=================
01 Ain
02 Aisne
03 Allier
04 Alpes de Haute Provence
05 Hautes Alpes
06 Alpes Maritimes
07 Ardèche <============ this line crash
08 Ardennes
09 Ariège
thank you for your help
# -*- coding: utf8 -*-
import os,codecs
from django.models.announceManager import *
import re
os.chdir(os.path.abspath("E:\\instal\\django\\view_servicealapersonne\\votreservice\\_initialLoad"))
f = open("departments.txt")
regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)
for l in f.readlines():
fileencoding = "iso-8859-1"
txt = l.decode(fileencoding)
print "the current line is :"+txt.encode(fileencoding, "replace")
try:
matches = regexobj.search(txt).groups()
except:
print "this is an empty line"
print "matches " +str(matches)
r
=Department(department=matches[1].encode('utf-8','replace'),department_number=matches[0].encode('utf-8'),country="France")
hi,
you can remove that ['replace']. there shouldn't be any unicode
character that cannot be represented in utf-8, so the error-condition
for which you specify the behaviour is never going to happen.
gabor
>f = open("departments.txt")
>regexobj = re.compile("([0-9]+)\s+([\w\s?]+)",re.UNICODE)
>
>
>for l in f.readlines():
> fileencoding = "iso-8859-1"
>
>
Ah! I was just about to suggest this. Your file seems to be not in utf-8
actually. And since you were opening it with codecs.open("...",
encoding='utf-8') this was causing the error.
def getFile(self, filename, mode='r', enc=None):
"""Return a file object"""
f = None
if enc:
try:
f = codecs.open(filename, mode, enc)
except:
raise
else:
try:
f = file(filename, mode)
except:
raise
return f
getFile = classmethod(getFile)
You can also change the encoding used on an open file object. Here's
that method:
def setFileEncoding(self, fileobj, enc):
"""Set a file obj's encoding"""
fileobj = codecs.lookup(enc)[-1](fileobj)
setFileEncoding = classmethod(setFileEncoding)