ORM, Oracle and UTF-8 encoding problem.

376 views
Skip to first unread message

Jani Tiainen

unread,
Jan 8, 2013, 10:34:51 AM1/8/13
to django...@googlegroups.com
Hi,

I've been trying to save UTF-8 characters to oracle database without
success.

I've verified that database is indeed UTF-8 capable.

I can insert UTF-8 characters directly using cx_Oracle.

But when I use ORM it will trash characters.

Model I use:

class MyTest(models.Model):
txt = CharField(max_length=128)


s = u'0 \u0442\u0435\u0441\u0442 test'

i = MyTest()
i.txt = s
i.save()

i2 = MyTest.objects.get(id=i.id)
print i2.txt

u'0 \xbf\xbf\xbf\xbf test'


So what happens here? It looks like Django trashes my unicode string at
some (unknown point).

Additional note:

If I use cursor() from Django connection object strings get broken also.
So it must be django Oracle backend doing something evil for me.

--
Jani Tiainen

- Well planned is half done and a half done has been sufficient before...

akaariai

unread,
Jan 8, 2013, 2:00:39 PM1/8/13
to Django users
I created the following test case into django's test suite modeltests/
basic/tests.py:
def test_unicode(self):
# Note: from __future__ import unicode_literals is in
effect...
a = Article.objects.create(headline='0
\u0442\u0435\u0441\u0442 test', pub_date=datetime.n ow())
self.assertEqual(Article.objects.get(pk=a.pk).headline, '0
\u0442\u0435\u0441\u0442 test' )

This does pass on Oracle when using Django's master branch, both with
Python 2.7 and 3.3.

Django's backend is doing all sorts of trickery behind the scenes to
get correct unicode handling. I am not sure where the problem is. What
Django version are you using?

- Anssi

Jani Tiainen

unread,
Jan 9, 2013, 1:56:49 AM1/9/13
to django...@googlegroups.com
8.1.2013 21:00, akaariai kirjoitti:
> I created the following test case into django's test suite modeltests/
> basic/tests.py:
> def test_unicode(self):
> # Note: from __future__ import unicode_literals is in
> effect...
> a = Article.objects.create(headline='0
> \u0442\u0435\u0441\u0442 test', pub_date=datetime.n ow())
> self.assertEqual(Article.objects.get(pk=a.pk).headline, '0
> \u0442\u0435\u0441\u0442 test' )
>
> This does pass on Oracle when using Django's master branch, both with
> Python 2.7 and 3.3.
>
> Django's backend is doing all sorts of trickery behind the scenes to
> get correct unicode handling. I am not sure where the problem is. What
> Django version are you using?

Sorry about forgotting version info. I tested with 1.3.1 and 1.4.1 and
both gave same behaviour.

And I know that there is quite a lot of trickery going on. I'll try to
figure out what causes that problem.

Jani Tiainen

unread,
Jan 9, 2013, 2:38:28 AM1/9/13
to django...@googlegroups.com
Tested against latest master. Same behaviour.

In Oracle backend base.py is following piece of code:

# Check whether cx_Oracle was compiled with the WITH_UNICODE option.
This will
# also be True in Python 3.0.
if int(Database.version.split('.', 1)[0]) >= 5 and not hasattr(Database,
'UNICODE'):
convert_unicode = force_text
else:
convert_unicode = force_bytes

Which was added in <https://github.com/django/django/commit/dcf3be7a62>

Thing is that my cx_Oracle is version 5.1.2, it has cx_Oracle.UNICODE
definition.

And Django uses smart_str / force_bytes.

If I remove that and use convert_unicode as force_text / force_unicode
everything works as expected.

Jani Tiainen

unread,
Jan 9, 2013, 3:53:50 AM1/9/13
to django...@googlegroups.com
Ok, found source of the problem - but I don't know the solution.

I'm using Oracle client 10.2.0.3.0. It seems that unicode doesn't work
there.

I compiled cx_Oracle against 11g instantclient 11.2 and it worked just fine.

So it must be something that Django assumes with Oracle and unicode
capability.

I had cx_Oracle.UNICODE defined always which is checked in the code. I
don't really know why.


9.1.2013 8:56, Jani Tiainen kirjoitti:

Ian

unread,
Jan 9, 2013, 5:28:12 AM1/9/13
to django...@googlegroups.com
On Wednesday, January 9, 2013 12:38:28 AM UTC-7, Jani Tiainen wrote:
Tested against latest master. Same behaviour.

In Oracle backend base.py is following piece of code:

# Check whether cx_Oracle was compiled with the WITH_UNICODE option.
This will
# also be True in Python 3.0.
if int(Database.version.split('.', 1)[0]) >= 5 and not hasattr(Database,
'UNICODE'):
     convert_unicode = force_text
else:
     convert_unicode = force_bytes

Which was added in <https://github.com/django/django/commit/dcf3be7a62>

Thing is that my cx_Oracle is version 5.1.2, it has cx_Oracle.UNICODE
definition.

That sounds correct.  The cx_Oracle.UNICODE type constant is present when cx_Oracle is compiled *without* the WITH_UNICODE option (which no longer exists in 5.1 anyway).

 
And Django uses smart_str / force_bytes.

If I remove that and use convert_unicode as force_text / force_unicode
everything works as expected.

Strange, in 5.1 it shouldn't make any difference which is used, as long as your NLS_LANG is getting set properly in the backend.  What is your server setup?  It seems that sometimes that can get interfered with if you have other services using Oracle in the same process.  It shouldn't hurt anything though for us to do an additional check for cx_Oracle 5.1+ and always use force_text in that case.

Jani Tiainen

unread,
Jan 9, 2013, 5:55:07 AM1/9/13
to django...@googlegroups.com
Server is running Oracle Database 10g Release 10.2.0.5.0 - 64bit
Production. (EE edition)

and charset info:
NLS_CHARACTERSET WE8ISO8859P1
NLS_NCHAR_CHARACTERSET AL16UTF16

When cx_Oracle (Version 5.1.2) is compiled against 10.2.0.3 client:

I can insert unicode characters directly using cx_Oracle.
I can't insert unicode characters using ORM
I can't insert unicode characters using Django connection.cursor()

When cx_Oracle is compiled against instantclient 11.2 (multinational
version) I can do all of the above without the problems.

Ian Kelly

unread,
Jan 9, 2013, 12:21:02 PM1/9/13
to django...@googlegroups.com
On Wed, Jan 9, 2013 at 3:55 AM, Jani Tiainen <red...@gmail.com> wrote:
> Server is running Oracle Database 10g Release 10.2.0.5.0 - 64bit Production.
> (EE edition)
>
> and charset info:
> NLS_CHARACTERSET WE8ISO8859P1
> NLS_NCHAR_CHARACTERSET AL16UTF16

Sorry, I meant your web server setup.

Jani Tiainen

unread,
Jan 10, 2013, 1:40:42 AM1/10/13
to django...@googlegroups.com
Windows 7, development server.

Staging server Ubuntu something (propably 10.04 LTS) 64bit

And symptoms were consistent. For some reason Django does something bad
when it uses smart_str (and whatever that is in 1.5).

If we just force using force_unicode everything works except in older
versions of cx_Oracle (our server had 5.0.4 or something) connection
strings can't be unicode for some reason.

Ian Kelly

unread,
Jan 10, 2013, 1:59:40 AM1/10/13
to django...@googlegroups.com
On Wed, Jan 9, 2013 at 11:40 PM, Jani Tiainen <red...@gmail.com> wrote:
> If we just force using force_unicode everything works except in older
> versions of cx_Oracle (our server had 5.0.4 or something) connection strings
> can't be unicode for some reason.

Sure, that's why the check exists in the first place. Prior to 5.1
cx_Oracle could be built either with Unicode or without. If the
former, it would accept only unicode strings and would raise an
exception on byte strings. If the latter, it would be exactly the
opposite.

Does it work for you using force_bytes with 5.0.4?

Jani Tiainen

unread,
Jan 10, 2013, 3:14:10 AM1/10/13
to django...@googlegroups.com
That's on my production server that runs 1.3.x version. smart_str (which
detection selects) does not work.

using force_unicode works (except for connection string).

Also depending on what OCI client 10.2.0.5 or instant client 11.2 is
used when compiling cx_Oracle causes variation. 10.2.0.5 doesn't work
with smart_str while 11.2 does work.

Both can take plain unicode (u'<some unicode stuff here>') when using
just cx_Oracle commands without any problems.

Note:

If I add manually some unicode to database Django can read it without
any problems.
Reply all
Reply to author
Forward
0 new messages