WE8ISO8859P1 convert to AL32UTF8 unicode character set question

lsllcm

unread,

Apr 6, 2009, 8:45:23 AM4/6/09

to

Hi Friends,

I have one question about CSSCAN when I do conversion from
WE8ISO8859P1 to AL32UTF8 unicode character set.

create table aaa (c1 varchar2(10));
insert into aaa values ('sys.…Med');

After csscan the data, get the application exception data as below

---------------------------------------------------------------------------------------------------------
[Application data individual exceptions]
User : JACKY
Table : AAA
Column: C1
Type : VARCHAR2(10)
Number of Exceptions : 1
Max Post Conversion Data Size: 9
ROWID Exception Type Size Cell Data(first 30 bytes)
------------------ ------------------ -----
------------------------------
AAAYtgAAEAAAEKPAAA lossy conversion sys.…Med
------------------ ------------------ -----
------------------------------

But after use convert method and check it, the data can be converted
successfully after use below sql

select case when convert(convert(C1, 'AL32UTF8',
'WE8ISO8859P1') ,'WE8ISO8859P1', 'AL32UTF8') = C1 then 0 else
-999999+111111 END from aaa;

Is report of cssacn not correct?

Thanks
Jacky

lsllcm

unread,

Apr 6, 2009, 9:30:32 PM4/6/09

to

On Apr 6, 8:45 pm, lsllcm <lsl...@gmail.com> wrote:
> Hi Friends,
>
> I have one question about CSSCAN when I do conversion from
> WE8ISO8859P1 to AL32UTF8 unicode character set.
>
> create table aaa (c1 varchar2(10));

> insert into aaa values ('sys....Med');

>
> After csscan the data, get the application exception data as below
>

> ----------------------------------------------------------------------------------------------------------

> [Application data individual exceptions]
> User : JACKY
> Table : AAA
> Column: C1
> Type : VARCHAR2(10)
> Number of Exceptions : 1
> Max Post Conversion Data Size: 9
> ROWID Exception Type Size Cell Data(first 30 bytes)
> ------------------ ------------------ -----
> ------------------------------

> AAAYtgAAEAAAEKPAAA lossy conversion sys....Med

> ------------------ ------------------ -----
> ------------------------------
>
> But after use convert method and check it, the data can be converted
> successfully after use below sql
>
> select case when convert(convert(C1, 'AL32UTF8',
> 'WE8ISO8859P1') ,'WE8ISO8859P1', 'AL32UTF8') = C1 then 0 else
> -999999+111111 END from aaa;
>
> Is report of cssacn not correct?
>
> Thanks
> Jacky

The string is "sys.Â Med"

Thanks

lsllcm

unread,

Apr 6, 2009, 11:35:54 PM4/6/09

to

I have one question about CSSCAN when I do conversion from
WE8ISO8859P1 to AL32UTF8 unicode character set.

create table aaa (c1 varchar2(10));

insert into aaa values ('sys.…Med');

After csscan the data, get the application exception data as below

------------------------------------------------

[Application data individual exceptions]
User : JACKY
Table : AAA
Column: C1
Type : VARCHAR2(10)
Number of Exceptions : 1
Max Post Conversion Data Size: 9
ROWID Exception Type Size Cell Data(first 30 bytes)
------------------ ------------------ -----
------------------------------

AAAYtgAAEAAAEKPAAA lossy conversion sys.…Med
------------------ ------------------ -----
------------------------------

But after use convert method and check it, the data can be converted
successfully after use below sql

select case when convert(convert(C1, 'AL32UTF8',
'WE8ISO8859P1') ,'WE8ISO8859P1', 'AL32UTF8') = C1 then 0 else
-999999+111111 END from aaa;

Below is dump of data

1. dump in AL32UTF8 character set
select dump(convert(c1,'AL32UTF8')) from aaa
DUMP(CONVERT(C1,'AL32UTF8'))
----------------------------------------------------
Typ=1 Len=9:
115,121,115,46,194,133,77,101,100
2. dump in WE8ISO8859P1 character set
select dump(c1) from aaa
DUMP(C1)
----------------------------------------------------
Typ=1 Len=8:
115,121,115,46,133,77,101,100

Laurenz Albe

unread,

Apr 7, 2009, 4:34:49 AM4/7/09

to

lsllcm wrote:
> I have one question about CSSCAN when I do conversion from
> WE8ISO8859P1 to AL32UTF8 unicode character set.
>
> create table aaa (c1 varchar2(10));

> insert into aaa values ('sys..Med');

>
> After csscan the data, get the application exception data as below
>
> ------------------------------------------------
> [Application data individual exceptions]
> User : JACKY
> Table : AAA
> Column: C1
> Type : VARCHAR2(10)
> Number of Exceptions : 1
> Max Post Conversion Data Size: 9
> ROWID Exception Type Size Cell Data(first 30 bytes)
> ------------------ ------------------ -----------------------------------

>AAAYtgAAEAAAEKPAAA lossy conversion sys..Med
> ------------------ ------------------ -----------------------------------
>
[...]

>
> Below is dump of data
>
> 1. dump in AL32UTF8 character set
> select dump(convert(c1,'AL32UTF8')) from aaa
> DUMP(CONVERT(C1,'AL32UTF8'))
> ----------------------------------------------------
> Typ=1 Len=9:
> 115,121,115,46,194,133,77,101,100
> 2. dump in WE8ISO8859P1 character set
> select dump(c1) from aaa
> DUMP(C1)
> ----------------------------------------------------
> Typ=1 Len=8:
> 115,121,115,46,133,77,101,100

> Is report of cssacn not correct?

The report of csscan is correct.

The problem is the fifth character. In the insert statement above it looks
like an ellipsis (three dots), and it ends up as 133 or hex 85 in your
database, so I assume that you mean UNICODE hex 2026, the horizontal
ellipsis, which is hex 85 in the Windows character set WE8MSWIN1252.

The problem is that your database character set is not WE8MSWIN1252,
where everything would be fine, but WE8ISO8859P1, where hex 85
does not exist.

So csscan warns you that if you convert this to AL32UTF8, hex 85,
which is an illegal value that has no meaning in WE8ISO8859P1,
will be converted to the default replacement character ¿

Your mistake is probably a wrong client character set.
Your NLS_LANG contains WE8ISO8859P1, and your client operating
system is Windows, correct?
That would mean that you are suffering from an Oracle "bug", namely
that character input is not checked for validity when client character set
and server character set are the same.
That way you end up with currupt data in your database.

Yours,
Laurenz Albe

lsllcm

unread,

Apr 7, 2009, 10:19:26 PM4/7/09

to

> Laurenz Albe- Hide quoted text -
>
> - Show quoted text -

Thanks for your comments:

Below is my test case

1. in WE8ISO8859P1 characterset db on linux platform,
1.1 create table aaa (c1 varchar2(10));
insert into aaa values ('sys.…Med');

1.2 select dump(c1) from aaa

DUMP(C1)
----------------------------------------------------
Typ=1 Len=8:
115,121,115,46,133,77,101,100

1.3 exp userid=jacky/jacky file=aaa.dmp tables=aaa

2. in AL32UTF characterset db on another linux platform
2.1 set NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
2.2 imp userid=system file=aaa.dmp fromuser=jacky touser=jacky
ignore=y
2.3 select dump(convert(c1,'AL32UTF8')) from aaa

DUMP(CONVERT(C1,'AL32UTF8'))
----------------------------------------------------
Typ=1 Len=9:
115,121,115,46,194,133,77,101,100

After converted, the char with dump number "133" is converted to char
with dump number "194,133". It is not same as the char default
replacement character ¿.

I am not sure if "133" in WE8ISO8859P1 can map to "194,133" in
AL32UTF8.
OR
if "133" in WE8MSWIN1252 can map to "194,133" in AL32UTF8.

But from result in UI, they are same.

Could you help review it again?

Laurenz Albe

unread,

Apr 8, 2009, 4:31:31 AM4/8/09

to

lsllcm wrote:
>>> I have one question about CSSCAN when I do conversion from
>>> WE8ISO8859P1 to AL32UTF8 unicode character set.
>>>
>>> create table aaa (c1 varchar2(10));

>>> insert into aaa values ('sys.…Med');

>>>
>>> After csscan the data, get the application exception data as below
>>>

[...]
>>>AAAYtgAAEAAAEKPAAA lossy conversion sys..Med

[...]
>>> Below is dump of data
>>>

[...]

>>> select dump(c1) from aaa
>>> DUMP(C1)
>>> ----------------------------------------------------
>>> Typ=1 Len=8:
>>> 115,121,115,46,133,77,101,100
>>>
>>> Is report of cssacn not correct?
>>
>> The report of csscan is correct.
>>
>> The problem is the fifth character. In the insert statement above it looks
>> like an ellipsis (three dots), and it ends up as 133 or hex 85 in your
>> database, so I assume that you mean UNICODE hex 2026, the horizontal
>> ellipsis, which is hex 85 in the Windows character set WE8MSWIN1252.
>>
>> The problem is that your database character set is not WE8MSWIN1252,
>> where everything would be fine, but WE8ISO8859P1, where hex 85
>> does not exist.
>>
>> So csscan warns you that if you convert this to AL32UTF8, hex 85,
>> which is an illegal value that has no meaning in WE8ISO8859P1,
>> will be converted to the default replacement character ¿
>>
>> Your mistake is probably a wrong client character set.
>> Your NLS_LANG contains WE8ISO8859P1, and your client operating
>> system is Windows, correct?
>> That would mean that you are suffering from an Oracle "bug", namely
>> that character input is not checked for validity when client character set
>> and server character set are the same.
>> That way you end up with currupt data in your database.
>
>

> Thanks for your comments:
>
> Below is my test case
>

[...]

> 1.3 exp userid=jacky/jacky file=aaa.dmp tables=aaa
>
> 2. in AL32UTF characterset db on another linux platform
> 2.1 set NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
> 2.2 imp userid=system file=aaa.dmp fromuser=jacky touser=jacky ignore=y
> 2.3 select dump(convert(c1,'AL32UTF8')) from aaa
> DUMP(CONVERT(C1,'AL32UTF8'))
> ----------------------------------------------------
> Typ=1 Len=9:
> 115,121,115,46,194,133,77,101,100

Hmm, why convert()?
If the database where you imported is AL32UTF8, the data should
already be in UTF-8 encoding.
Do you get the same result if you omit convert()?
I would expect so.

> After converted, the char with dump number "133" is converted to char
> with dump number "194,133". It is not same as the char default
> replacement character ¿.

You are right (I admit that I didn't test it).
Obviously Oracle interprets hex 85 in WE8ISO8859P1 as
UNICODE hex 85, which is a control character ("next line").

> I am not sure if "133" in WE8ISO8859P1 can map to "194,133" in
> AL32UTF8.
> OR
> if "133" in WE8MSWIN1252 can map to "194,133" in AL32UTF8.

UNICODE hex 85 can be converted to AL32UTF8, and the result is
hex C285 or binary 194,133.

The problem is that this is NOT the "horizontal ellipsis"
character that you want.
And this is what csscan complains about:
Hex 85 is not a correct ISO8859-1 character.
In the light of your exp/imp experiment, I would say that you
are right to be suspicious of the csscan warning, because
Oracle contradicts itself here: on the one hand it warns you
(correctly) that hex 85 is not a valid character, on the other
hand it happily converts it into UNICODE "next line".

> But from result in UI, they are same.
>
> Could you help review it again?

Ok, the case hase become a little more convoluted, but the basic
things have not changed.

You see "correct" results in your user interface, but the
data in the database are wrong.

You store a Windows "horizontal ellipsis" character in a
WE8ISO8859P1 database, but your client character set is not
set to the (correct) WE8MSWIN1252 character set, so Oracle
interprets the hex 85 that it gets from Windows as
a WE8ISO8859P1 character, that is (depending on viewpoint)
either garbage or a control character, but certainly not
a horizontal ellipsis.

When csscan sees this, it judges the character as garbage.
exp/imp, however, interprets it as control character and
translates it to AL32UTF8 as such.

As long as you keep your Oracle client character set to
WE8ISO8859P1 *and* interpret the resulting character as
Windows character, everything will appear to be fine.
Basically you're doing the reverse of the above.

But it is a sleeping time bomb.
For example, you will never be able to use this database
with JDBC, because Java cannot use WE8ISO8859P1.

The character you want (horizontal ellipsis) is
hex E280A6 (or decimal 226,128,166) in AL32UTF8, and
that is what should be in your database if everything had
gone correctly.

I recommend that you do the following (provided you have
only Windows characters in your database:

- Backup the original database.
- Change the database character set from WE8ISO8859P1
to WE8MSWIN1252 using CSALTER as described in
http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14225/ch12scanner.htm#sthref1702
- Verify that csscan does not report any problems now.
- Now use exp/imp to transfer the data into your
AL32UTF8 database.
- Change your client character set to WE8MSWIN1252.

Then everything should look ok again, and in addition
it *is* ok.

Yours,
Laurenz Albe

lsllcm

unread,

Apr 8, 2009, 11:24:25 AM4/8/09

to

Thanks for your comments:

I use convert function is to test becuase the result of convert is the
same as importing the same string from a WE8ISO8859P1 database to a
AL32UTF8 database. (From oracle document)

If the varchar2 can be converted to target characterset and can be
converted back too. That means that the varchar2 can be converted
successfully.

But as you said that when the char does not exist in db character set.
It is one problem. It can cause unpredictable conversion.

Fortunately, we use jdbc program to operate on oracle db. Java is
UTF-16 encoded, Data retrieved from or inserted into the database must
be converted from UTF-16 to the database character set or the national
character set and vice versa. (From oracle document)

Each character is converted from UTF-16 to WE8ISO8859P1. It also can
be converted from WE8ISO8859P1 to UTF-16. Because UTF-16 can be
converted to AL32UTF8, so all characters can be converted to AL32UTF8
characterset.

What do you think about?

I will test the way you metioned.

1. create one db with characterset WE8MSWIN1252, and insert data, and
convert it.
2. create one db with characterset WE8ISO8859P1, and insert data, and
alter database characterset to WE8MSWIN1252. Then convert it.

Thanks
Jacky

lsllcm

unread,

Apr 9, 2009, 1:19:39 AM4/9/09

to

After tested, csscan also report application data exception when
prepare to change character set from WE8ISO8859P1 to WE8MSWIN1252, and
we cannot use csalter.plb to change the database character set.

Database character set
WE8ISO8859P1
FROMCHAR
WE8ISO8859P1
TOCHAR
WE8MSWIN1252
Scan NCHAR data?
NO
Array fetch buffer size
1024000
Number of processes
1
Capture convertible data?
NO
------------------------------
------------------------------------------------

[Data Dictionary individual exceptions]

[Application data individual exceptions]

User : JACKY
Table : AAA
Column: C1

Type : VARCHAR2(1000)

Number of Exceptions : 1
Max Post Conversion Data Size: 9

ROWID Exception Type Size Cell Data(first 30
bytes)
------------------ ------------------ -----
------------------------------

AAALaRAAEAAAAAQAAA lossy conversion sys.匨
ed
------------------ ------------------ -----
------------------------------

Laurenz Albe

unread,

Apr 9, 2009, 5:38:24 AM4/9/09

to

lsllcm wrote:
> I use convert function is to test becuase the result of convert is the
> same as importing the same string from a WE8ISO8859P1 database to a
> AL32UTF8 database. (From oracle document)

Ah, good to know.

> If the varchar2 can be converted to target characterset and can be
> converted back too. That means that the varchar2 can be converted
> successfully.
>
> But as you said that when the char does not exist in db character set.
> It is one problem. It can cause unpredictable conversion.

Right. Not "unpredictable", but the wrong data that are in your
database now will remain wrong.

> Fortunately, we use jdbc program to operate on oracle db. Java is
> UTF-16 encoded, Data retrieved from or inserted into the database must
> be converted from UTF-16 to the database character set or the national
> character set and vice versa. (From oracle document)

Java does not use UTF-16, but UCS-2.
However, this is probably not important in your case.

> Each character is converted from UTF-16 to WE8ISO8859P1. It also can
> be converted from WE8ISO8859P1 to UTF-16. Because UTF-16 can be
> converted to AL32UTF8, so all characters can be converted to AL32UTF8
> characterset.
>
> What do you think about?

I don't get it. What do you plan to do?

> I will test the way you metioned.
>
> 1. create one db with characterset WE8MSWIN1252, and insert data, and
> convert it.
> 2. create one db with characterset WE8ISO8859P1, and insert data, and
> alter database characterset to WE8MSWIN1252. Then convert it.

You must also set the correct client character set with NLS_LANG.
If the client character set is WE8ISO8859P1 and the database is WE8MSWIN1252,
you will not be able to insert character hex 85 (question mark syndrome).
Also if the client character set is WE8MSWIN1252 and the database is
WE8ISO8859P1, it will not work.
If both client and database character set are WE8ISO8859P1 (as it is
now), you can insert hex 85 but it is wrong.
Only if both client and database character set are WE8MSWIN1252,
the inserted hex 85 will be the correct character.

Yours,
Laurenz Albe

Laurenz Albe

unread,

Apr 9, 2009, 5:44:40 AM4/9/09

to

> AAALaRAAEAAAAAQAAA lossy conversion sys.…Med
> ------------------ ------------------ -----------------------------------

There's mot to changing the database character set, Metalink Note 555823.1
covers it in detail.

First, you have to be certain that all the non-ASCII characters in the database
are WE8MSWIN1252.

Then you would have to run csscan with FROMCHAR=WE8MSWIN1252 TOCHAR=WE8MSWIN1252.
There should be no errors.

Then you can convert the database.

Yours,
Laurenz Albe

lsllcm

unread,

Apr 9, 2009, 7:55:13 AM4/9/09

to

On Apr 9, 5:44 pm, "Laurenz Albe" <inv...@spam.to.invalid> wrote:
> lsllcm wrote:
> > After tested, csscan also report application data exception when
> > prepare to change character set from WE8ISO8859P1 to WE8MSWIN1252, and
> > we cannot use csalter.plb to change the database character set.
>
> > Database character set
> > WE8ISO8859P1
> > FROMCHAR
> > WE8ISO8859P1
> > TOCHAR
> > WE8MSWIN1252
> > Scan NCHAR data?
> > NO
> > Array fetch buffer size
> > 1024000
> > Number of processes
> > 1
> > Capture convertible data?
> > NO

> > ------------------------------------------------------------------------------

>
> > [Data Dictionary individual exceptions]
>
> > [Application data individual exceptions]
>
> > User : JACKY
> > Table : AAA
> > Column: C1
> > Type : VARCHAR2(1000)
> > Number of Exceptions : 1
> > Max Post Conversion Data Size: 9
>
> > ROWID Exception Type Size Cell Data(first 30
> > bytes)
> > ------------------ ------------------ -----------------------------------
> > AAALaRAAEAAAAAQAAA lossy conversion sys.…Med
> > ------------------ ------------------ -----------------------------------
>
> There's mot to changing the database character set, Metalink Note 555823.1
> covers it in detail.
>
> First, you have to be certain that all the non-ASCII characters in the database
> are WE8MSWIN1252.
>
> Then you would have to run csscan with FROMCHAR=WE8MSWIN1252 TOCHAR=WE8MSWIN1252.
> There should be no errors.
>
> Then you can convert the database.
>
> Yours,

> Laurenz Albe- Hide quoted text -
>
> - Show quoted text -

On Apr 9, 5:44 pm, "Laurenz Albe" <inv...@spam.to.invalid> wrote:
> lsllcm wrote:
> > After tested, csscan also report application data exception when
> > prepare to change character set from WE8ISO8859P1 to WE8MSWIN1252, and
> > we cannot use csalter.plb to change the database character set.
>
> > Database character set
> > WE8ISO8859P1
> > FROMCHAR
> > WE8ISO8859P1
> > TOCHAR
> > WE8MSWIN1252
> > Scan NCHAR data?
> > NO
> > Array fetch buffer size
> > 1024000
> > Number of processes
> > 1
> > Capture convertible data?
> > NO

> > ------------------------------------------------------------------------------

>
> > [Data Dictionary individual exceptions]
>
> > [Application data individual exceptions]
>
> > User : JACKY
> > Table : AAA
> > Column: C1
> > Type : VARCHAR2(1000)
> > Number of Exceptions : 1
> > Max Post Conversion Data Size: 9
>
> > ROWID Exception Type Size Cell Data(first 30
> > bytes)
> > ------------------ ------------------ -----------------------------------
> > AAALaRAAEAAAAAQAAA lossy conversion sys.…Med
> > ------------------ ------------------ -----------------------------------
>
> There's mot to changing the database character set, Metalink Note 555823.1
> covers it in detail.
>
> First, you have to be certain that all the non-ASCII characters in the database
> are WE8MSWIN1252.
>
> Then you would have to run csscan with FROMCHAR=WE8MSWIN1252 TOCHAR=WE8MSWIN1252.
> There should be no errors.
>
> Then you can convert the database.
>
> Yours,

> Laurenz Albe- Hide quoted text -
>
> - Show quoted text -

yes, Java used UCS-2 initially, and added UTF-16 supplementary
character support in J2SE 5.0. I refer to the document.
http://en.wikipedia.org/wiki/UTF-16

Below command does work.

csscan \"sys as sysdba\" full=y fromchar=WE8MSWIN1252
tochar=WE8MSWIN1252

It can alter database characterset.

But a little more complex.

I use java to read the both WE8ISO8859P1 and WE8MSWIN1252 dbs

1 in WE8ISO8859P1 db, I print rs.getBytes("c1") array, the result is
as below, so it should be already unicode, and it does not do any
conversion.
125
115
121
115
46
-123 ===========>> as same as 256-123 = 133
77
101
100
0

2 in WE8MSWIN1252 db, I print rs.getBytes("c1") array, the result is
as below, looks it is converted after it is read.
125
115
121
115
46
-65 ===========>> as same as 256-65 = 191
77
101
100
0

If we use item 1 to convert, they are all wrong, but the UI are same
even they are wrong.
If we use item 2 to convert, the before is wrong, but after convert,
it is correct. but the UI will be different.

To be consistent, I choose item 1. At least, the data is not lost from
UI in both before and after..

If client cannot afford it, he/she should correct it at very early
time.

This is not one technical question, it is choose question.

Thank your help, it is helpful

Michael Austin

unread,

Apr 9, 2009, 10:48:20 AM4/9/09

to

lsllcm wrote:
> On Apr 9, 5:44 pm, "Laurenz Albe" <inv...@spam.to.invalid> wrote:
>> lsllcm wrote:
>>> After tested, csscan also report application data exception when
>>> prepare to change character set from WE8ISO8859P1 to WE8MSWIN1252, and
>>> we cannot use csalter.plb to change the database character set.
>>> Database character set
>>> WE8ISO8859P1
>>> FROMCHAR
>>> WE8ISO8859P1
>>> TOCHAR
>>> WE8MSWIN1252
>>> Scan NCHAR data?
>>> NO
>>> Array fetch buffer size
>>> 1024000
>>> Number of processes
>>> 1
>>> Capture convertible data?
>>> NO

>>> ---------------------------------------------------------------------------ï¿½---

>>> [Data Dictionary individual exceptions]
>>> [Application data individual exceptions]
>>> User : JACKY
>>> Table : AAA
>>> Column: C1
>>> Type : VARCHAR2(1000)
>>> Number of Exceptions : 1
>>> Max Post Conversion Data Size: 9
>>> ROWID Exception Type Size Cell Data(first 30
>>> bytes)
>>> ------------------ ------------------ -----------------------------------

>>> AAALaRAAEAAAAAQAAA lossy conversion sys.ï¿½Med

>>> ---------------------------------------------------------------------------ï¿½---

>>> [Data Dictionary individual exceptions]
>>> [Application data individual exceptions]
>>> User : JACKY
>>> Table : AAA
>>> Column: C1
>>> Type : VARCHAR2(1000)
>>> Number of Exceptions : 1
>>> Max Post Conversion Data Size: 9
>>> ROWID Exception Type Size Cell Data(first 30
>>> bytes)
>>> ------------------ ------------------ -----------------------------------

>>> AAALaRAAEAAAAAQAAA lossy conversion sys.ï¿½Med

I liked it when there was only one character set ASCII to deal with... :)

joel garry

unread,

Apr 9, 2009, 12:33:19 PM4/9/09

to

On Apr 9, 7:48 am, Michael Austin <maus...@firstdbasource.com> wrote:
> lsllcm wrote:

>
> I liked it when there was only one character set ASCII to deal with... :)

You mean, you aren't old enough to remember EBCDIC? :-)

(I was sooooo happy to get my first PDP-11/34 job with VT-52's...
Think fast, what's 8224?)

jg
--
@home.com is bogus. Now causing a flap in Old Town San Diego
gallery: http://media.signonsandiego.com/img/photos/2009/04/02/4_2-b4-bell_t350.jpg
Older classics: http://culturepopped.blogspot.com/2007/04/suddenly-last-supper.html

Michael Austin

unread,

Apr 9, 2009, 1:46:52 PM4/9/09

to

joel garry wrote:
> On Apr 9, 7:48 am, Michael Austin <maus...@firstdbasource.com> wrote:
>> lsllcm wrote:
>
>> I liked it when there was only one character set ASCII to deal with... :)
>
> You mean, you aren't old enough to remember EBCDIC? :-)
>

Yes, but it was generally only used (still is) on IBM mainframes. :)
Working at DEC, we had EBCDIC2ASCII translators as well as bidirectional
DECNET-2-SNAgateway translators. :)

lsllcm

unread,

Apr 10, 2009, 12:13:49 AM4/10/09

to

On Apr 10, 1:46 am, Michael Austin <maus...@firstdbasource.com> wrote:
> joel garry wrote:
> > On Apr 9, 7:48 am, Michael Austin <maus...@firstdbasource.com> wrote:
> >>lsllcmwrote:
>
> >> I liked it when there was only one character set ASCII to deal with... :)
>
> > You mean, you aren't old enough to remember EBCDIC? :-)
>
> Yes, but it was generally only used (still is) on IBM mainframes. :)
> Working at DEC, we had EBCDIC2ASCII translators as well as bidirectional
> DECNET-2-SNAgateway translators. :)
>
>
>
> > (I was sooooo happy to get my first PDP-11/34 job with VT-52's...
> > Think fast, what's 8224?)
>
> > jg
> > --
> > @home.com is bogus. Now causing a flap in Old Town San Diego

> > gallery: http://media.signonsandiego.com/img/photos/2009/04/02/4_2-b4-bell_t35...
> > Older classics:http://culturepopped.blogspot.com/2007/04/suddenly-last-supper.html- Hide quoted text -

>
> - Show quoted text -

That is good idea to only has US7ASCII character.

Do you mean that convert current all non-US7ASCII character into
US7ASCII char, then do the conversion?

Where can I find this kind of tool such as EBCDIC2ASCII translators?

Thanks

Laurenz Albe

unread,

Apr 10, 2009, 8:23:43 AM4/10/09

to

lsllcm wrote:
> But a little more complex.
>
> I use java to read the both WE8ISO8859P1 and WE8MSWIN1252 dbs
>
> 1 in WE8ISO8859P1 db, I print rs.getBytes("c1") array, the result is
> as below, so it should be already unicode, and it does not do any
> conversion.
> 125
> 115
> 121
> 115
> 46
> -123 ===========>> as same as 256-123 = 133
> 77
> 101
> 100
> 0

rs.getBytes()?? Why would you select a string as Bytes?
Can you show your code?

I inserted your byte sequence into a WE8ISO8859P1 database,
and selected it with getString() with JDBC, and I got what I expected:

sys.¿Med

> 2 in WE8MSWIN1252 db, I print rs.getBytes("c1") array, the result is
> as below, looks it is converted after it is read.
> 125
> 115
> 121
> 115
> 46
> -65 ===========>> as same as 256-65 = 191
> 77
> 101
> 100
> 0
>
> If we use item 1 to convert, they are all wrong, but the UI are same
> even they are wrong.
> If we use item 2 to convert, the before is wrong, but after convert,
> it is correct. but the UI will be different.

I do not understand.
Your example 2 does not seem to be correct, but I don't know what you did.

> To be consistent, I choose item 1. At least, the data is not lost from
> UI in both before and after..
>
> If client cannot afford it, he/she should correct it at very early
> time.
>
> This is not one technical question, it is choose question.

As I said, it's a time bomb.
Choose to either fix it now or maybe blow up later.

For example as soon as the customer wishes to insert a character that
is not in the Windows character set.

You can choose to ignore it, but I would at least tell my customer
that there are corrupt data in the database.

Yours,
Laurenz Albe

Michael Austin

unread,

Apr 10, 2009, 9:32:09 AM4/10/09

to

http://lmgtfy.com/?q=ebcdic2ascii

joel garry

unread,

Apr 10, 2009, 2:16:59 PM4/10/09

to

On Apr 9, 9:13 pm, lsllcm <lsl...@gmail.com> wrote:
> On Apr 10, 1:46 am, Michael Austin <maus...@firstdbasource.com> wrote:
>
>
>
>
>
> > joel garry wrote:
> > > On Apr 9, 7:48 am, Michael Austin <maus...@firstdbasource.com> wrote:
> > >>lsllcmwrote:
>
> > >> I liked it when there was only one character set ASCII to deal with... :)
>
> > > You mean, you aren't old enough to remember EBCDIC? :-)
>
> > Yes, but it was generally only used (still is) on IBM mainframes. :)
> > Working at DEC, we had EBCDIC2ASCII translators as well as bidirectional
> > DECNET-2-SNAgateway translators. :)
>
> > > (I was sooooo happy to get my first PDP-11/34 job with VT-52's...
> > > Think fast, what's 8224?)
>
> > > jg
> > > --
> > > @home.com is bogus. Now causing a flap in Old Town San Diego
> > > gallery: http://media.signonsandiego.com/img/photos/2009/04/02/4_2-b4-bell_t35...

> > > Older classics:http://culturepopped.blogspot.com/2007/04/suddenly-last-supper.html-Hide quoted text -

>
> > - Show quoted text -
>
> That is good idea to only has US7ASCII character.
>
> Do you mean that convert current all non-US7ASCII character into
> US7ASCII char, then do the conversion?
>
> Where can I find this kind of tool such as EBCDIC2ASCII translators?
>
> Thanks

Since I (and anyone happening upon this later) can't tell if you are
joking, please understand that for many years, us7ascii was the
default for Oracle databases, often causing no end of trouble with the
sort of issue you have, ie, the client software allows entering
characters that are not in the db characterset. The trouble would
often happen when exp/imp would be used to import into an 8 bit
characterset, losing the 8th bit as Oracle oh-so-helpfully
mistranslates the characters. I'm sure this is one of the reasons for
csscan.

Laurenz seems to have good advice here, also dig into the
globalization manual, and check metalink for the notes on nls.

Ignore us old folk reminiscing about days long gone.

jg
--
@home.com is bogus.

"Remember rivers, Nellie?" - Firesign Theatre

lsllcm

unread,

Apr 11, 2009, 8:04:32 AM4/11/09

to

On Apr 10, 8:23 pm, "Laurenz Albe" <inv...@spam.to.invalid> wrote:
> lsllcmwrote:

> Laurenz Albe- Hide quoted text -
>

> - Show quoted text -- Hide quoted text -

>
> - Show quoted text -

Thanks for your comments:

1. I use rs.getBytes() to get raw data from db, not after converted.
The test on WE8MSWIN1252 is not correct. It occurs when I use 10.2.0.1
jdbc driver with db 10.2.0.4. After I use 10.2.0.4 jdbc driver, it
prints the bytes as same as data in db.

2. The rs.getString looks does some conversion. But I am clear how it
converts between oracle db and java client.

3. Because using System.out.println() should also do some conversion,
I choose jsp to test the case.
<-----------------------------------code begin
------------------------------>
HttpServletRequest h_req = (HttpServletRequest) request;
HttpServletResponse h_res = (HttpServletResponse) response;

h_res.setContentType("text/html");
h_res.setHeader("Pragma", "No-cache");
h_res.addHeader("Cache-Control", "no-cache");
h_res.addHeader("Expires", "Thu, 01 Jan 1970 00:00:01 GMT");

h_req.setCharacterEncoding("utf-8");
h_res.setCharacterEncoding("utf-8");

Context initContext = new InitialContext();
Context envContext = (Context)initContext.lookup("java:/comp/
env");
DataSource ds = (DataSource)envContext.lookup("jdbc/myoracle");
Connection conn = ds.getConnection();

PreparedStatement ps = null;
ResultSet rs = null;
String sql = "SELECT * from aaa ";

ps = conn.prepareStatement(sql);

rs = ps.executeQuery();

byte[] aaa = new byte[1024];
String bb = "";
while (rs.next())
{
aaa = rs.getBytes("c1");
bb = rs.getString("c1");
out.println(bb);
}
<-----------------------------------code end
------------------------------>

3.1 to connect to WE8ISO8859P1 db, it prints below in web page, the
hex 85 is displayed like space char.
sys. Med

3.2 to connect to WE8MSWIN1252 db, it prints below in web page, the
hex 83 is displayed as correct char
sys....Med

3.3 to connect to AL32UTF8 db, I insert two rows, one is from
WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
page,
SQL> select dump (c1) from aaa;

DUMP(C1)
--------------------------------------------------------------------------------
Typ=1 Len=9: 115,121,115,46,194,133,77,101,100
Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100

print in web page---------------------------------------------
sys. Med sys....Med

Your suggest is right, we need to make data correct even if it is
wrong currectly.

4. The rs.getString looks does some conversion. But I am clear how it
converts between oracle db and java client. From document B19306_01/
server.102/b14225/ch7progrunicode.htm#sthref924.

document
========================================================================================
Data Conversion for Thin Drivers
SQL statements are always converted to either the database character
set or to UTF-8 by the driver before they are sent to the database for
processing. When the database character set is either US7ASCII or
WE8ISO8859P1, the driver converts the SQL statement to the database
character set. Otherwise, the driver converts the SQL statement to
UTF-8 and notifies the database that a SQL statement requires further
conversion before being processed. The database, in turn, converts the
SQL statements from UTF-8 to the database character set. The database,
in turn, converts the SQL statement to the database character set.
========================================================================================

from document, I write one java application program and want to verify
it

==============================================Code begin
=====================================
byte[] aaa = new byte[1024];
while (rs.next())
{
bb = rs.getString("c1");
}

aaa = bb.getBytes();
for (int mm=0;mm<aaa.length;mm++)
{
System.out.println(aaa[mm]);
}
//print default java character set
System.out.println(Charset.defaultCharset() );
===================================code end
=============================================

4.1 test on WE8ISO8859P1 db
115
121
115
46
63 ---- looks convert to US7ASCII char, after verify convert
(c1,'US7ASCII')
77
101
100
--default java character set
windows-1252

4.2 test on WE8MSWIN1252 db.
115
121
115
46
-123 -- does not do any conversion
77
101
100
windows-1252

4.3 test on AL32UTF8 db
--first row
115
121
115
46
63 ---- looks convert to US7ASCII char, after verify convert
(c1,'US7ASCII')
77
101
100
--second row
115
121
115
46
-123 -- does not do any conversion
77
101
100
15
--default java character set
windows-1252

I cannot understand two sides

java uses UCS-2 (UTF-16) encoding, one char should have two bytes, but
when use String.getBytes, it converts to windows-1252. The result is
to print one byte for one char.

I will do the test again, and post here

Thanks for your suggestion again --:)

lsllcm

unread,

Apr 11, 2009, 8:07:06 AM4/11/09

to

On Apr 11, 2:16 am, joel garry <joel-ga...@home.com> wrote:

> On Apr 9, 9:13 pm,lsllcm<lsl...@gmail.com> wrote:
>
>
>
>
>
> > On Apr 10, 1:46 am, Michael Austin <maus...@firstdbasource.com> wrote:
>
> > > joel garry wrote:
> > > > On Apr 9, 7:48 am, Michael Austin <maus...@firstdbasource.com> wrote:
> > > >>lsllcmwrote:
>
> > > >> I liked it when there was only one character set ASCII to deal with... :)
>
> > > > You mean, you aren't old enough to remember EBCDIC? :-)
>
> > > Yes, but it was generally only used (still is) on IBM mainframes. :)
> > > Working at DEC, we had EBCDIC2ASCII translators as well as bidirectional
> > > DECNET-2-SNAgateway translators. :)
>
> > > > (I was sooooo happy to get my first PDP-11/34 job with VT-52's...
> > > > Think fast, what's 8224?)
>
> > > > jg
> > > > --
> > > > @home.com is bogus. Now causing a flap in Old Town San Diego
> > > > gallery: http://media.signonsandiego.com/img/photos/2009/04/02/4_2-b4-bell_t35...

> > > > Older classics:http://culturepopped.blogspot.com/2007/04/suddenly-last-supper.html-Hidequoted text -

>
> > > - Show quoted text -
>
> > That is good idea to only has US7ASCII character.
>
> > Do you mean that convert current all non-US7ASCII character into
> > US7ASCII char, then do the conversion?
>
> > Where can I find this kind of tool such as EBCDIC2ASCII translators?
>
> > Thanks
>
> Since I (and anyone happening upon this later) can't tell if you are
> joking, please understand that for many years, us7ascii was the
> default for Oracle databases, often causing no end of trouble with the
> sort of issue you have, ie, the client software allows entering
> characters that are not in the db characterset. The trouble would
> often happen when exp/imp would be used to import into an 8 bit
> characterset, losing the 8th bit as Oracle oh-so-helpfully
> mistranslates the characters. I'm sure this is one of the reasons for
> csscan.
>
> Laurenz seems to have good advice here, also dig into the
> globalization manual, and check metalink for the notes on nls.
>
> Ignore us old folk reminiscing about days long gone.
>
> jg
> --
> @home.com is bogus.

> "Remember rivers, Nellie?" - Firesign Theatre- Hide quoted text -

>
> - Show quoted text -

Yes, really thanks

lsllcm

unread,

Apr 12, 2009, 1:14:31 AM4/12/09

to

I have done more test on conversion between java client and oracle db
server. Below are test code.

-----------Code begin -------------------------

byte[] bRay = null;
char quote = '"';
int ndx;
// String sInitial = "Enc" + "\u4E94";
String sInitial = rs.getString("c1");

try { bRay = sInitial.getBytes("UTF-16"); }
catch( UnsupportedEncodingException uee )
{
System.out.println( "Exception: " + uee);
}

System.out.println( quote + sInitial + quote +
" String as UTF-16, " +
"bRay length: " + bRay.length + "." );
for( ndx = 0; ndx < bRay.length; ndx++ )
{
System.out.print( Integer.toHexString( bRay[ ndx++ ] ) + "
" );
System.out.print( Integer.toHexString( bRay[ ndx ] ) + "
" );
}

System.out.println("\n");

OutputStreamWriter osw = new OutputStreamWriter( System.out );

bRay = sInitial.getBytes();
System.out.println( quote + sInitial + quote +
" String as platform default - " +
osw.getEncoding() +
", bRay length: " + bRay.length + "." );
for( ndx = 0; ndx < bRay.length; ndx++ )
{
System.out.print( Integer.toHexString( bRay[ ndx ] ) + "
" );
}

System.out.println("\n");

try
{
sInitial = new String( sInitial.getBytes("UTF-8"), "UTF-8");
bRay = sInitial.getBytes("UTF-8");
}
catch( UnsupportedEncodingException uee )
{
System.out.println( "Exception: " + uee);
}

System.out.println( quote + sInitial + quote +
" String as UTF-8, " +
"bRay length: " + bRay.length + "." );

for( ndx = 0; ndx < bRay.length; ndx++ )
{
System.out.print( Integer.toHexString( bRay[ ndx ] ) + "
" );
}
-----------Code begin -------------------------

1. Below are test result when the db is AL32UTF8 when the row has data
"sys.…Med ."

-----------test result begin -------------------
"sys.匨ed" String as UTF-16, bRay length: 18..
fffffffe ffffffff 0 73 0 79 0 73 0 2e 20 26 0 4d 0 65
0 64

"sys.匨ed" String as platform default - Cp1252, bRay length: 8..
73 79 73 2e ffffff85 4d 65 64

"sys.匨ed" String as UTF-8, bRay length: 10..
73 79 73 2e ffffffe2 ffffff80 ffffffa6 4d 65 64
windows
-----------test result end ----------------------

2. Below are test result when the db is WE8ISO8859P1 when the row has
data "sys. Med ."

-----------test result begin -------------------
"sys.?Med" String as UTF-16, bRay length: 18.
fffffffe ffffffff 0 73 0 79 0 73 0 2e 0 ffffff85 0 4d 0
65 0 64

"sys.?Med" String as platform default - Cp1252, bRay length: 8.
73 79 73 2e 3f 4d 65 64

"sys.?Med" String as UTF-8, bRay length: 9.
73 79 73 2e ffffffc2 ffffff85 4d 65 64 windows-1252
-----------test result end ----------------------

4. I use FileOutputStream to print the data into file and open it by
using IE, the result is same as in JSP.

--------code begin------------
FileOutputStream fs=new FileOutputStream( "aa.htm");

OutputStreamWriter oswaa = new OutputStreamWriter( fs, "UTF-8" );

bb = rs.getString("c1");

oswaa.write(bb,0,bb.length());
--------code end------------

Conclusion:

The java client read data from oracle and translate it into UTF-16,
but how to print the data from java applicaiton, we need to do
encoding-decoding work.

Laurenz Albe

unread,

Apr 14, 2009, 4:14:15 AM4/14/09

to

lsllcm wrote:
[...]

> 3.3 to connect to AL32UTF8 db, I insert two rows, one is from
> WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
> page,
> SQL> select dump (c1) from aaa;
>
> DUMP(C1)
> --------------------------------------------------------------------------------
> Typ=1 Len=9: 115,121,115,46,194,133,77,101,100

194,133 is the control character hex 85 from WE8ISO8859P1,
converted to UTF-8, right.

> Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100

And 226,128,166 is the correct UTF-8 representation of
the "horizontal ellipsis" character.

> print in web page---------------------------------------------
> sys. Med sys....Med
>
> Your suggest is right, we need to make data correct even if it is
> wrong currectly.

I think that we have finally understood each other completly.

What I had been missing out on is that there is some post-processing
of the characters to display them in your web application.
That is of course also a place where things can go wrong,
even if you get the right characters out from the database.

Now that I have seen your Java code, I also understand how
you got the results you displayed:
You were using java.lang.String.getBytes(), which uses the
platform's default character set, in your case WINDOWS-1252.

So the correct result for your test case would be "-123"
for the weird character, which is the hex 85 you started with.

I tried with your string and the following Java code on Windows:

ResultSet rs = stmt.executeQuery("....");
rs.next();
System.out.println("Using getBytes():");
byte[] arr = rs.getBytes(1);
for (int i=0; i<arr.length; ++i)
System.out.println("Byte " + i + " = " + arr[i]);
System.out.println("Using getString().getBytes():");
arr = rs.getString(1).getBytes();
for (int i=0; i<arr.length; ++i)
System.out.println("Byte " + i + " = " + arr[i]);
System.out.println(Charset.defaultCharset());

and the result was:

Using getBytes():
Byte 0 = 115
Byte 1 = 121
Byte 2 = 115
Byte 3 = 46
Byte 4 = -30
Byte 5 = -128
Byte 6 = -90
Byte 7 = 77
Byte 8 = 101
Byte 9 = 100
Using getString().getBytes():
Byte 0 = 115
Byte 1 = 121
Byte 2 = 115
Byte 3 = 46
Byte 4 = -123
Byte 5 = 77
Byte 6 = 101
Byte 7 = 100
windows-1252

Right.

The java.sql.ResultSet.getBytes() method did not convert
anything and got the bytes in database encoding (AL32UTF8).
java.sql.ResultSet.getBytes() got the characters in UCS-2,
and java.lang.String.getBytes() converts that into
Windows encoding.

That will of course only work if the character exists
in Windows encoding. If you have anything else, e.g.
a "r" (UNICODE hex 0159, Czech r with hacek), the result
would be a question mark because Java cannot convert the
character to Windows 1252.

Yours,
Laurenz Albe

lsllcm

unread,

Apr 14, 2009, 9:35:00 PM4/14/09

to

On Apr 14, 4:14 pm, "Laurenz Albe" <inv...@spam.to.invalid> wrote:
> lsllcmwrote:
>

> [...]
>
> > 3.3 to connect to AL32UTF8 db, I insert two rows, one is from
> > WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
> > page,
> > SQL> select dump (c1) from aaa;
>
> > DUMP(C1)

> > --------------------------------------------------------------------------------

Yes, it is clear now.

Thanks
Jacky

Frank van Bortel

unread,

Apr 15, 2009, 5:01:31 AM4/15/09

to

lsllcm wrote:
[snip]

>
> 1. create one db with characterset WE8MSWIN1252, and insert data, and
> convert it.
> 2. create one db with characterset WE8ISO8859P1, and insert data, and
> alter database characterset to WE8MSWIN1252. Then convert it.
>
> Thanks
> Jacky

No need for that - a database stores whatever you throw at it.
I can store "special" characters in a US7ASCII database.
And retrieve them.

Provided the tool set cat actually display your "special"
characters...
--

Regards,
Frank van Bortel

lsllcm

unread,

Apr 15, 2009, 9:25:39 AM4/15/09

to

On Apr 15, 2:01 am, Frank van Bortel <frank.van.bor...@gmail.com>
wrote:
> lsllcmwrote:

The AL32UTF8 is used to support multiple languages. It is multiple
bytes character set.

Thanks

Laurenz Albe

unread,

Apr 15, 2009, 10:38:21 AM4/15/09

to

Frank van Bortel wrote:
> No need for that - a database stores whatever you throw at it.
> I can store "special" characters in a US7ASCII database.
> And retrieve them.
>
> Provided the tool set cat actually display your "special"
> characters...

You can choose to ignore and circumvent the character set
awareness of a database system and have your application handle
these issues. Oracle is pretty lax in that respect and will
let you do what you want.

Three disadvantages:
- You might to reinvent the wheel.
- You can get into trouble if the requirements change and
the database will need to be accessed and used in new
ways (JDBC, database link to AL32UTF8 database, ...).
- Oracle utilities may scream at you or behave in
surprising ways if the database does not contain what
they think it does.

Yours,
Laurenz Albe

Frank van Bortel

unread,

Apr 20, 2009, 2:21:04 PM4/20/09

to

I wanted to share this
http://www.oracle.com/technology/tech/globalization/htdocs/nls_lang%20faq.htm