Inserting Chinese characters in Oracle database

470 views
Skip to first unread message

Stefan Schwarzer

unread,
Jan 12, 2018, 7:14:35 AM1/12/18
to sqlalchemy
I have trouble inserting Chinese characters into an Oracle database with SQLAlchemy (and to some extent with cx_Oracle). I don't specifically need Chinese characters but I used them in a unit test to see if my code (hopefully) can handle "any" unicode characters.

The setup:

- SQLAlchemy 1.1.15 and 1.2
- Oracle database 11.2
- cx_Oracle 6.1
- environment variable set from the shell ( export NLS_LANG="German_Germany.UTF8" ) and in the code

Database table:

CREATE TABLE UTF8_TEST (
  TEXT NVARCHAR2(10)
)



1) SQLAlchemy, version 1.1.15 and 1.2

import os
os.environ["NLS_LANG"] = "German_Germany.UTF8"

import sqlalchemy as sa


metadata = sa.MetaData()

test_table = sa.Table("utf8_test", metadata, sa.Column("text", sa.Unicode(10)))


def test(host, port, service_name, user, password):
    connect_string = "oracle+cx_oracle://{}:{}@{}:{}/?service_name={}".format(user, password, host, port, service_name)
    engine = sa.engine.create_engine(connect_string, encoding="utf8")
    insert = test_table.insert().values(text="\u7535\u8111")
    with engine.connect() as connection:
        connection.execute(insert)


if __name__ == "__main__":
    db_config = ...
    test(db_config.host, db_config.port, db_config.service_name, db_config.user, db_config.password)


When this code is executed, it inserts something into the database, but it shows up as two upside-down question marks when viewed with Oracle SQL Developer. This doesn't seem to be an encoding/rendering problem in SQL Developer (see below for code snippet 3).


2) cx_Oracle, without `Cursor.setinputsizes`

import os
os.environ["NLS_LANG"] = "German_Germany.UTF8"

import cx_Oracle


def test(host, port, service_name, user, password):
    dsn = cx_Oracle.makedsn(host, port, service_name=service_name)
    conn = cx_Oracle.connect(user, password, dsn=dsn, encoding="utf8", nencoding="utf8")
    cur = conn.cursor()
    cur.execute("INSERT INTO UTF8_TEST (TEXT) VALUES (:t)", {"t": "\u7535\u8111"})
    cur.close()
    conn.commit()


if __name__ == "__main__":
    db_config = ...
    test(db_config.host, db_config.port, db_config.service_name, db_config.user, db_config.password)


This code has the same effect as the SQLAlchemy code above.


3) cx_Oracle with `Cursor.setinputsizes`

import os
os.environ["NLS_LANG"] = "German_Germany.UTF8"

import cx_Oracle


def test(host, port, service_name, user, password):
    dsn = cx_Oracle.makedsn(host, port, service_name=service_name)
    conn = cx_Oracle.connect(user, password, dsn=dsn, encoding="utf8", nencoding="utf8")
    cur = conn.cursor()
    # Using `setinputsizes` makes the difference.
    cur.execute("SELECT TEXT FROM UTF8_TEST")
    description = cur.description
    # `t` is name of bind parameter below.
    cur.setinputsizes(t=description[0][1])

    cur.execute("INSERT INTO UTF8_TEST (TEXT) VALUES (:t)", {"t": "\u7535\u8111"})
    cur.close()
    conn.commit()


if __name__ == "__main__":
    db_config = ...
    test(db_config.host, db_config.port, db_config.service_name, db_config.user, db_config.password)


With the `setinputsizes` call, the Chinese characters appear in Oracle SQL Developer as they should.


In case it matters, I tried to use some German special characters (the string "äß") and this works for all three variants. These two characters are encoded in UTF-8 to one byte each whereas the Chinese characters are encoded to several bytes each.


Should the above code for SQLAlchemy work as-is and insert the Chinese charaters (I expected it)?

Is there a workaround for now to have SQLAlchemy insert the correct characters? For example, is there a way to tell SQLAlchemy to use `Cursor.setinputsizes` for inserts? (I assume this would apply to updates, too.)

Best regards,
Stefan

Mike Bayer

unread,
Jan 12, 2018, 10:33:21 AM1/12/18
to sqlal...@googlegroups.com
On Fri, Jan 12, 2018 at 7:14 AM, Stefan Schwarzer
<sschw...@sschwarzer.net> wrote:
> I have trouble inserting Chinese characters into an Oracle database with
> SQLAlchemy (and to some extent with cx_Oracle). I don't specifically need
> Chinese characters but I used them in a unit test to see if my code
> (hopefully) can handle "any" unicode characters.

I can't reproduce your result, I'm not using any SQL tool I am instead
just getting the same data back and asserting it matches. Per the
author of cx_Oracle the setinputsizes call is no longer needed in
most cases in the 6.x series, but also in the 1.x series, the
autosetinputsizes call for UNICODE and STRING was omitted because I
still got poorer results. Round-tripping of unicode characters is
widely tested with cx_Oracle / SQLAlchemy.

In SQLAlchemy 1.1 series and earlier, you can specify
exclude_setinputsizes=() to have STRING be part of the automatic
setinputsizes call. In SQLAlchemy 1.2 these features were all removed
as there was never any reason to pass most datatypes to setinputsizes.
in 1.2 you can still say
engine.dialect._include_setinputsizes.add(cx_Oracle.STRING) to re-add
it, but this should not be necessary (my test works with or without
it).

Also which version of Python you're using matters, however I get a
round trip under both pythons.

To round trip it, do this - note I had to expand the size of the
VARCHAR to fit your string, it was giving me a "data too large" error
before, so that might be a clue:

import os
os.environ["NLS_LANG"] = "German_Germany.UTF8"

import sqlalchemy as sa


metadata = sa.MetaData()

test_table = sa.Table("utf8_test", metadata, sa.Column("text", sa.Unicode(20)))

e = sa.create_engine("oracle+cx_oracle://scott:tiger@oracle1120/xe",
echo='debug')

metadata.drop_all(e)
metadata.create_all(e)

data = "\u7535\u8111"
insert = test_table.insert().values(text=data)
with e.connect() as connection:
connection.execute(insert)

sqla_result = connection.scalar(sa.select([test_table.c.text]))

assert sqla_result == data

please try that out and see if you aren't getting a true result.
note also that the table is being dropped and created within the
script itself.

when I run in Python 3 I even see the correct characters displayed:

2018-01-12 10:21:37,658 INFO sqlalchemy.engine.base.Engine INSERT INTO
utf8_test (text) VALUES (:text)
2018-01-12 10:21:37,658 INFO sqlalchemy.engine.base.Engine {'text': '电脑'}
2018-01-12 10:21:37,663 INFO sqlalchemy.engine.base.Engine COMMIT
2018-01-12 10:21:37,668 INFO sqlalchemy.engine.base.Engine SELECT utf8_test.text
FROM utf8_test
2018-01-12 10:21:37,668 INFO sqlalchemy.engine.base.Engine {}
2018-01-12 10:21:37,672 DEBUG sqlalchemy.engine.base.Engine Col ('TEXT',)
2018-01-12 10:21:37,672 DEBUG sqlalchemy.engine.base.Engine Row ('电脑',)

The next step would be to raise this on cx_Oracle's tracker
(https://github.com/oracle/python-cx_Oracle/issues) to understand why
your setup needs setinputsizes to be called and mine doesn't (and has
not for years).
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Mike Bayer

unread,
Jan 12, 2018, 10:34:37 AM1/12/18
to sqlal...@googlegroups.com
here's "data too large" for the size of 10:

sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) ORA-12899:
value too large for column "SCOTT"."UTF8_TEST"."TEXT" (actual: 12,
maximum: 10) [SQL: u'INSERT INTO utf8_test (text) VALUES (:text)']
[parameters: {'text': '\\u7535\\u8111'}] (Background on this error at:
http://sqlalche.me/e/4xp6)

Stefan Schwarzer

unread,
Jan 12, 2018, 2:23:40 PM1/12/18
to sqlal...@googlegroups.com
Mike, many thanks for your feedback!

On 2018-01-12 16:33, Mike Bayer wrote:
> On Fri, Jan 12, 2018 at 7:14 AM, Stefan Schwarzer
> <sschw...@sschwarzer.net> wrote:
>> I have trouble inserting Chinese characters into an Oracle database with
>> SQLAlchemy (and to some extent with cx_Oracle). I don't specifically need
>> Chinese characters but I used them in a unit test to see if my code
>> (hopefully) can handle "any" unicode characters.
>
> I can't reproduce your result, I'm not using any SQL tool I am instead
> just getting the same data back and asserting it matches.

I'm doing the same in the unit test. I saw different characters
from what I had supposedly inserted and checked with the SQL tool
to get a clue whether the insert or the select was the problem.

> Per the
> author of cx_Oracle the setinputsizes call is no longer needed in
> most cases in the 6.x series, but also in the 1.x series, the
> autosetinputsizes call for UNICODE and STRING was omitted because I
> still got poorer results. Round-tripping of unicode characters is
> widely tested with cx_Oracle / SQLAlchemy.
>
> In SQLAlchemy 1.1 series and earlier, you can specify
> exclude_setinputsizes=() to have STRING be part of the automatic
> setinputsizes call.

At the moment, I'm using SQLAlchemy 1.1.15 at the moment due to
the fractional-second problem (described in my other mail), but
would like to switch to SQLAlchemy 1.2.x when the fractional-second
problem is fixed there.

> In SQLAlchemy 1.2 these features were all removed
> as there was never any reason to pass most datatypes to setinputsizes.
> in 1.2 you can still say
> engine.dialect._include_setinputsizes.add(cx_Oracle.STRING) to re-add
> it, but this should not be necessary (my test works with or without
> it).
>
> Also which version of Python you're using matters, however I get a
> round trip under both pythons.

Sorry, I forgot to mention this. The Python version was 3.6.3,
if I remember correctly.

> To round trip it, do this - note I had to expand the size of the
> VARCHAR to fit your string, it was giving me a "data too large" error
> before, so that might be a clue:

I'll try your script next week when I have access to the database
system.

Best regards,
Stefan

Stefan Schwarzer

unread,
Jan 15, 2018, 3:18:58 PM1/15/18
to sqlal...@googlegroups.com
On 2018-01-12 16:33, Mike Bayer wrote:> On Fri, Jan 12, 2018 at 7:14 AM, Stefan Schwarzer
> <sschw...@sschwarzer.net> wrote:
> In SQLAlchemy 1.1 series and earlier, you can specify
> exclude_setinputsizes=() to have STRING be part of the automatic
> setinputsizes call. In SQLAlchemy 1.2 these features were all removed
> as there was never any reason to pass most datatypes to setinputsizes.
> in 1.2 you can still say
> engine.dialect._include_setinputsizes.add(cx_Oracle.STRING) to re-add
> it, but this should not be necessary (my test works with or without
> it).
>
> Also which version of Python you're using matters, however I get a
> round trip under both pythons.
>
> To round trip it, do this - note I had to expand the size of the
> VARCHAR to fit your string, it was giving me a "data too large" error
> before, so that might be a clue:
> [...]

I tried your example and got the same mismatch between
original and actual value as before. Also your workaround
that I tried for SQLAlchemy 1.2 didn't work for me. I
haven't tried the workaround for SQLAlchemy 1.1.15. As
described below, I (partially) found out why the workaround
for version 1.2 didn't work.

Since your code worked for you, but not for me, I concluded
that probably the setup of your database is different from
"mine" (I don't control the server). I searched for
information on encodings in Oracle and found this document:

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/nlspg/supporting-multilingual-databases-with-unicode.html#GUID-AA09A60E-123E-457C-ACE1-89E4634E492C

So far I had assumed that if I use `NVARCHAR2`, the column
data would be "unicode" and the Oracle driver would do all
the necessary conversions.

However, with `SELECT * FROM V$NLS_PARAMETERS` I got

NLS_CHARACTERSET WE8MSWIN1252
NLS_NCHAR_CHARACTERSET AL16UTF16

What are the values for your database?

As far as I understand the above Oracle document, the first
encoding, `NLS_CHARACTERSET`, is used for the names of SQL
entity names like table and column names, and also for
columns defined with `VARCHAR2`, `CLOB` etc. (character data
types without the `N` prefix). The second encoding,
`NLS_NCHAR_CHARACTERSET` is used for `NVARCHAR2`, `NCLOB`
etc. (character data types with the `N` prefix).

According to the document, Oracle nowadays recommends
`AL32UTF8` for the database encoding and not using
`NVARCHAR2` etc. for columns because `AL32UTF8` for
`VARCHAR2` can generally encode unicode.

I assume that I won't be able to have the above database
encodings changed. In this case, what do you recommend when
using the database with SQLAlchemy? As far as I understand,
if I use `VARCHAR` or `VARCHAR2` (as other tables in the
database so far), I won't be able to store anything that
doesn't belong in CP1252. If I use `NVARCHAR2`, I _should_
be able to use Chinese characters.

Why not? I changed your suggested code for SQLAlchemy 1.2
to include

e.dialect._include_setinputsizes.add(cx_Oracle.STRING)
e.dialect._include_setinputsizes.add(cx_Oracle.NCHAR)
e.dialect._include_setinputsizes.add(cx_Oracle.NCLOB)

but still inserting the Chinese characters failed.

With some debugging, I noticed that in
`engine.default.DefaultExecutionContext.set_input_sizes`
before the call to `cursor.setinputsizes`, the `text` column
has a corresponding `cx_Oracle.STRING` in the `inputsizes`
dictionary. However, the type of `text` (an `NVARCHAR2`
column) should be `cx_Oracle.NCHAR`. This is also what I get
in the cursor description after the `SELECT` invocation
before the `cursor.setinputsizes` in my working pure-cx_Oracle
example in my original posting. If I change `cx_Oracle.STRING`
to `cx_Oracle.NCHAR` in
`DefaultExecutionContext.set_input_sizes` in the debugger,
the Chinese characters appear in the database as expected.

See also
https://github.com/oracle/python-cx_Oracle/issues/119

If you don't have a suspicion why the wrong type for
the `NVARCHAR2` column is used, I could try to do more
debugging and hopefully find out why.

What do you think?

Best regards,
Stefan

Mike Bayer

unread,
Jan 15, 2018, 5:17:36 PM1/15/18
to sqlal...@googlegroups.com
the issue you refer to with Anthony refers to a new use case for the
cx_Oracle DBAPI, where we use setinputsizes() again and use a
different datatype for strings. So that would need to be added to
SQLAlchemy's cx_Oracle dialect as an option, which can either be on a
per-datatype basis or engine-wide (engine-wide is easier to implement
and use however I'm not sure which is more appropriate).

You can probably get this to work right now doing this:

from sqlalchemy.dialects.oracle import _OracleString
class NCHAR(_OracleNVarChar):
def get_dbapi_type(self, dbapi):
return cx_Oracle.NCHAR

and then add cx_Oracle.NCHAR to the dialect._include_setinputsizes list.

I've proposed a few ways we might be able to add this API here:

https://bitbucket.org/zzzeek/sqlalchemy/issues/4163/cx_oracle-requires-method-to-force-use-of





>
> What do you think?
>
> Best regards,
> Stefan
>

Stefan Schwarzer

unread,
Jan 16, 2018, 8:50:59 AM1/16/18
to sqlal...@googlegroups.com
Hi Mike,

thanks again!

On 2018-01-15 23:17, Mike Bayer wrote:
> On Mon, Jan 15, 2018 at 3:18 PM, Stefan Schwarzer
> <sschw...@sschwarzer.net> wrote:
>> On 2018-01-12 16:33, Mike Bayer wrote:> On Fri, Jan 12, 2018 at 7:14 AM, Stefan Schwarzer
>>> <sschw...@sschwarzer.net> wrote:
> the issue you refer to with Anthony refers to a new use case for the
> cx_Oracle DBAPI, where we use setinputsizes() again and use a
> different datatype for strings. So that would need to be added to
> SQLAlchemy's cx_Oracle dialect as an option, which can either be on a
> per-datatype basis or engine-wide (engine-wide is easier to implement
> and use however I'm not sure which is more appropriate).
>
> You can probably get this to work right now doing this:
>
> from sqlalchemy.dialects.oracle import _OracleString
> class NCHAR(_OracleNVarChar):
> def get_dbapi_type(self, dbapi):
> return cx_Oracle.NCHAR
>
> and then add cx_Oracle.NCHAR to the dialect._include_setinputsizes list.

The following works for me:


import sqlalchemy.types
import sqlalchemy.dialects.oracle.cx_oracle as cx_oracle_dialect


class OracleNVarChar(cx_oracle_dialect._OracleNVarChar):
def get_dbapi_type(self, dbapi):
return cx_Oracle.NCHAR

cx_oracle_dialect.OracleDialect_cx_oracle.colspecs[sqlalchemy.types.Unicode] = OracleNVarChar


metadata = sa.MetaData()

test_table = sa.Table("utf8_test", metadata, sa.Column("text", sa.Unicode(20)))

connect_string = "oracle+cx_oracle://..."
engine = sa.create_engine(connect_string, echo='debug')
engine.dialect._include_setinputsizes.add(cx_Oracle.NCHAR)


Changes to your version:

- Changed import. I wasn't sure whether your import or
the base class was wrong.

- Added assignment to `colspecs` dictionary

> I've proposed a few ways we might be able to add this API here:
>
> https://bitbucket.org/zzzeek/sqlalchemy/issues/4163/cx_oracle-requires-method-to-force-use-of

I added a comment to this ticket.

Best regards,
Stefan
Reply all
Reply to author
Forward
0 new messages