I have one question about CSSCAN when I do conversion from
WE8ISO8859P1 to AL32UTF8 unicode character set.
create table aaa (c1 varchar2(10));
insert into aaa values ('sys.…Med');
After csscan the data, get the application exception data as below
---------------------------------------------------------------------------------------------------------
[Application data individual exceptions]
User : JACKY
Table : AAA
Column: C1
Type : VARCHAR2(10)
Number of Exceptions : 1
Max Post Conversion Data Size: 9
ROWID Exception Type Size Cell Data(first 30 bytes)
------------------ ------------------ -----
------------------------------
AAAYtgAAEAAAEKPAAA lossy conversion sys.…Med
------------------ ------------------ -----
------------------------------
But after use convert method and check it, the data can be converted
successfully after use below sql
select case when convert(convert(C1, 'AL32UTF8',
'WE8ISO8859P1') ,'WE8ISO8859P1', 'AL32UTF8') = C1 then 0 else
-999999+111111 END from aaa;
Is report of cssacn not correct?
Thanks
Jacky
The string is "sys. Med"
Thanks
create table aaa (c1 varchar2(10));
insert into aaa values ('sys.…Med');
After csscan the data, get the application exception data as below
------------------------------------------------
[Application data individual exceptions]
User : JACKY
Table : AAA
Column: C1
Type : VARCHAR2(10)
Number of Exceptions : 1
Max Post Conversion Data Size: 9
ROWID Exception Type Size Cell Data(first 30 bytes)
------------------ ------------------ -----
------------------------------
AAAYtgAAEAAAEKPAAA lossy conversion sys.…Med
------------------ ------------------ -----
------------------------------
But after use convert method and check it, the data can be converted
successfully after use below sql
select case when convert(convert(C1, 'AL32UTF8',
'WE8ISO8859P1') ,'WE8ISO8859P1', 'AL32UTF8') = C1 then 0 else
-999999+111111 END from aaa;
Below is dump of data
1. dump in AL32UTF8 character set
select dump(convert(c1,'AL32UTF8')) from aaa
DUMP(CONVERT(C1,'AL32UTF8'))
----------------------------------------------------
Typ=1 Len=9:
115,121,115,46,194,133,77,101,100
2. dump in WE8ISO8859P1 character set
select dump(c1) from aaa
DUMP(C1)
----------------------------------------------------
Typ=1 Len=8:
115,121,115,46,133,77,101,100
> Is report of cssacn not correct?
The report of csscan is correct.
The problem is the fifth character. In the insert statement above it looks
like an ellipsis (three dots), and it ends up as 133 or hex 85 in your
database, so I assume that you mean UNICODE hex 2026, the horizontal
ellipsis, which is hex 85 in the Windows character set WE8MSWIN1252.
The problem is that your database character set is not WE8MSWIN1252,
where everything would be fine, but WE8ISO8859P1, where hex 85
does not exist.
So csscan warns you that if you convert this to AL32UTF8, hex 85,
which is an illegal value that has no meaning in WE8ISO8859P1,
will be converted to the default replacement character ¿
Your mistake is probably a wrong client character set.
Your NLS_LANG contains WE8ISO8859P1, and your client operating
system is Windows, correct?
That would mean that you are suffering from an Oracle "bug", namely
that character input is not checked for validity when client character set
and server character set are the same.
That way you end up with currupt data in your database.
Yours,
Laurenz Albe
Thanks for your comments:
Below is my test case
1. in WE8ISO8859P1 characterset db on linux platform,
1.1 create table aaa (c1 varchar2(10));
insert into aaa values ('sys.…Med');
1.2 select dump(c1) from aaa
DUMP(C1)
----------------------------------------------------
Typ=1 Len=8:
115,121,115,46,133,77,101,100
1.3 exp userid=jacky/jacky file=aaa.dmp tables=aaa
2. in AL32UTF characterset db on another linux platform
2.1 set NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
2.2 imp userid=system file=aaa.dmp fromuser=jacky touser=jacky
ignore=y
2.3 select dump(convert(c1,'AL32UTF8')) from aaa
DUMP(CONVERT(C1,'AL32UTF8'))
----------------------------------------------------
Typ=1 Len=9:
115,121,115,46,194,133,77,101,100
After converted, the char with dump number "133" is converted to char
with dump number "194,133". It is not same as the char default
replacement character ¿.
I am not sure if "133" in WE8ISO8859P1 can map to "194,133" in
AL32UTF8.
OR
if "133" in WE8MSWIN1252 can map to "194,133" in AL32UTF8.
But from result in UI, they are same.
Could you help review it again?
Hmm, why convert()?
If the database where you imported is AL32UTF8, the data should
already be in UTF-8 encoding.
Do you get the same result if you omit convert()?
I would expect so.
> After converted, the char with dump number "133" is converted to char
> with dump number "194,133". It is not same as the char default
> replacement character ¿.
You are right (I admit that I didn't test it).
Obviously Oracle interprets hex 85 in WE8ISO8859P1 as
UNICODE hex 85, which is a control character ("next line").
> I am not sure if "133" in WE8ISO8859P1 can map to "194,133" in
> AL32UTF8.
> OR
> if "133" in WE8MSWIN1252 can map to "194,133" in AL32UTF8.
UNICODE hex 85 can be converted to AL32UTF8, and the result is
hex C285 or binary 194,133.
The problem is that this is NOT the "horizontal ellipsis"
character that you want.
And this is what csscan complains about:
Hex 85 is not a correct ISO8859-1 character.
In the light of your exp/imp experiment, I would say that you
are right to be suspicious of the csscan warning, because
Oracle contradicts itself here: on the one hand it warns you
(correctly) that hex 85 is not a valid character, on the other
hand it happily converts it into UNICODE "next line".
> But from result in UI, they are same.
>
> Could you help review it again?
Ok, the case hase become a little more convoluted, but the basic
things have not changed.
You see "correct" results in your user interface, but the
data in the database are wrong.
You store a Windows "horizontal ellipsis" character in a
WE8ISO8859P1 database, but your client character set is not
set to the (correct) WE8MSWIN1252 character set, so Oracle
interprets the hex 85 that it gets from Windows as
a WE8ISO8859P1 character, that is (depending on viewpoint)
either garbage or a control character, but certainly not
a horizontal ellipsis.
When csscan sees this, it judges the character as garbage.
exp/imp, however, interprets it as control character and
translates it to AL32UTF8 as such.
As long as you keep your Oracle client character set to
WE8ISO8859P1 *and* interpret the resulting character as
Windows character, everything will appear to be fine.
Basically you're doing the reverse of the above.
But it is a sleeping time bomb.
For example, you will never be able to use this database
with JDBC, because Java cannot use WE8ISO8859P1.
The character you want (horizontal ellipsis) is
hex E280A6 (or decimal 226,128,166) in AL32UTF8, and
that is what should be in your database if everything had
gone correctly.
I recommend that you do the following (provided you have
only Windows characters in your database:
- Backup the original database.
- Change the database character set from WE8ISO8859P1
to WE8MSWIN1252 using CSALTER as described in
http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14225/ch12scanner.htm#sthref1702
- Verify that csscan does not report any problems now.
- Now use exp/imp to transfer the data into your
AL32UTF8 database.
- Change your client character set to WE8MSWIN1252.
Then everything should look ok again, and in addition
it *is* ok.
Yours,
Laurenz Albe
I use convert function is to test becuase the result of convert is the
same as importing the same string from a WE8ISO8859P1 database to a
AL32UTF8 database. (From oracle document)
If the varchar2 can be converted to target characterset and can be
converted back too. That means that the varchar2 can be converted
successfully.
But as you said that when the char does not exist in db character set.
It is one problem. It can cause unpredictable conversion.
Fortunately, we use jdbc program to operate on oracle db. Java is
UTF-16 encoded, Data retrieved from or inserted into the database must
be converted from UTF-16 to the database character set or the national
character set and vice versa. (From oracle document)
Each character is converted from UTF-16 to WE8ISO8859P1. It also can
be converted from WE8ISO8859P1 to UTF-16. Because UTF-16 can be
converted to AL32UTF8, so all characters can be converted to AL32UTF8
characterset.
What do you think about?
I will test the way you metioned.
1. create one db with characterset WE8MSWIN1252, and insert data, and
convert it.
2. create one db with characterset WE8ISO8859P1, and insert data, and
alter database characterset to WE8MSWIN1252. Then convert it.
Thanks
Jacky
After tested, csscan also report application data exception when
prepare to change character set from WE8ISO8859P1 to WE8MSWIN1252, and
we cannot use csalter.plb to change the database character set.
Database character set
WE8ISO8859P1
FROMCHAR
WE8ISO8859P1
TOCHAR
WE8MSWIN1252
Scan NCHAR data?
NO
Array fetch buffer size
1024000
Number of processes
1
Capture convertible data?
NO
------------------------------
------------------------------------------------
[Data Dictionary individual exceptions]
[Application data individual exceptions]
User : JACKY
Table : AAA
Column: C1
Type : VARCHAR2(1000)
Number of Exceptions : 1
Max Post Conversion Data Size: 9
ROWID Exception Type Size Cell Data(first 30
bytes)
------------------ ------------------ -----
------------------------------
AAALaRAAEAAAAAQAAA lossy conversion sys.匨
ed
------------------ ------------------ -----
------------------------------
Ah, good to know.
> If the varchar2 can be converted to target characterset and can be
> converted back too. That means that the varchar2 can be converted
> successfully.
>
> But as you said that when the char does not exist in db character set.
> It is one problem. It can cause unpredictable conversion.
Right. Not "unpredictable", but the wrong data that are in your
database now will remain wrong.
> Fortunately, we use jdbc program to operate on oracle db. Java is
> UTF-16 encoded, Data retrieved from or inserted into the database must
> be converted from UTF-16 to the database character set or the national
> character set and vice versa. (From oracle document)
Java does not use UTF-16, but UCS-2.
However, this is probably not important in your case.
> Each character is converted from UTF-16 to WE8ISO8859P1. It also can
> be converted from WE8ISO8859P1 to UTF-16. Because UTF-16 can be
> converted to AL32UTF8, so all characters can be converted to AL32UTF8
> characterset.
>
> What do you think about?
I don't get it. What do you plan to do?
> I will test the way you metioned.
>
> 1. create one db with characterset WE8MSWIN1252, and insert data, and
> convert it.
> 2. create one db with characterset WE8ISO8859P1, and insert data, and
> alter database characterset to WE8MSWIN1252. Then convert it.
You must also set the correct client character set with NLS_LANG.
If the client character set is WE8ISO8859P1 and the database is WE8MSWIN1252,
you will not be able to insert character hex 85 (question mark syndrome).
Also if the client character set is WE8MSWIN1252 and the database is
WE8ISO8859P1, it will not work.
If both client and database character set are WE8ISO8859P1 (as it is
now), you can insert hex 85 but it is wrong.
Only if both client and database character set are WE8MSWIN1252,
the inserted hex 85 will be the correct character.
Yours,
Laurenz Albe
There's mot to changing the database character set, Metalink Note 555823.1
covers it in detail.
First, you have to be certain that all the non-ASCII characters in the database
are WE8MSWIN1252.
Then you would have to run csscan with FROMCHAR=WE8MSWIN1252 TOCHAR=WE8MSWIN1252.
There should be no errors.
Then you can convert the database.
Yours,
Laurenz Albe
On Apr 9, 5:44 pm, "Laurenz Albe" <inv...@spam.to.invalid> wrote:
> lsllcm wrote:
> > After tested, csscan also report application data exception when
> > prepare to change character set from WE8ISO8859P1 to WE8MSWIN1252, and
> > we cannot use csalter.plb to change the database character set.
>
> > Database character set
> > WE8ISO8859P1
> > FROMCHAR
> > WE8ISO8859P1
> > TOCHAR
> > WE8MSWIN1252
> > Scan NCHAR data?
> > NO
> > Array fetch buffer size
> > 1024000
> > Number of processes
> > 1
> > Capture convertible data?
> > NO
> > ------------------------------------------------------------------------------
>
> > [Data Dictionary individual exceptions]
>
> > [Application data individual exceptions]
>
> > User : JACKY
> > Table : AAA
> > Column: C1
> > Type : VARCHAR2(1000)
> > Number of Exceptions : 1
> > Max Post Conversion Data Size: 9
>
> > ROWID Exception Type Size Cell Data(first 30
> > bytes)
> > ------------------ ------------------ -----------------------------------
> > AAALaRAAEAAAAAQAAA lossy conversion sys.…Med
> > ------------------ ------------------ -----------------------------------
>
> There's mot to changing the database character set, Metalink Note 555823.1
> covers it in detail.
>
> First, you have to be certain that all the non-ASCII characters in the database
> are WE8MSWIN1252.
>
> Then you would have to run csscan with FROMCHAR=WE8MSWIN1252 TOCHAR=WE8MSWIN1252.
> There should be no errors.
>
> Then you can convert the database.
>
> Yours,
> Laurenz Albe- Hide quoted text -
>
> - Show quoted text -
yes, Java used UCS-2 initially, and added UTF-16 supplementary
character support in J2SE 5.0. I refer to the document.
http://en.wikipedia.org/wiki/UTF-16
Below command does work.
csscan \"sys as sysdba\" full=y fromchar=WE8MSWIN1252
tochar=WE8MSWIN1252
It can alter database characterset.
But a little more complex.
I use java to read the both WE8ISO8859P1 and WE8MSWIN1252 dbs
1 in WE8ISO8859P1 db, I print rs.getBytes("c1") array, the result is
as below, so it should be already unicode, and it does not do any
conversion.
125
115
121
115
46
-123 ===========>> as same as 256-123 = 133
77
101
100
0
2 in WE8MSWIN1252 db, I print rs.getBytes("c1") array, the result is
as below, looks it is converted after it is read.
125
115
121
115
46
-65 ===========>> as same as 256-65 = 191
77
101
100
0
If we use item 1 to convert, they are all wrong, but the UI are same
even they are wrong.
If we use item 2 to convert, the before is wrong, but after convert,
it is correct. but the UI will be different.
To be consistent, I choose item 1. At least, the data is not lost from
UI in both before and after..
If client cannot afford it, he/she should correct it at very early
time.
This is not one technical question, it is choose question.
Thank your help, it is helpful
I liked it when there was only one character set ASCII to deal with... :)
>
> I liked it when there was only one character set ASCII to deal with... :)
You mean, you aren't old enough to remember EBCDIC? :-)
(I was sooooo happy to get my first PDP-11/34 job with VT-52's...
Think fast, what's 8224?)
jg
--
@home.com is bogus. Now causing a flap in Old Town San Diego
gallery: http://media.signonsandiego.com/img/photos/2009/04/02/4_2-b4-bell_t350.jpg
Older classics: http://culturepopped.blogspot.com/2007/04/suddenly-last-supper.html
Yes, but it was generally only used (still is) on IBM mainframes. :)
Working at DEC, we had EBCDIC2ASCII translators as well as bidirectional
DECNET-2-SNAgateway translators. :)
That is good idea to only has US7ASCII character.
Do you mean that convert current all non-US7ASCII character into
US7ASCII char, then do the conversion?
Where can I find this kind of tool such as EBCDIC2ASCII translators?
Thanks
rs.getBytes()?? Why would you select a string as Bytes?
Can you show your code?
I inserted your byte sequence into a WE8ISO8859P1 database,
and selected it with getString() with JDBC, and I got what I expected:
sys.¿Med
> 2 in WE8MSWIN1252 db, I print rs.getBytes("c1") array, the result is
> as below, looks it is converted after it is read.
> 125
> 115
> 121
> 115
> 46
> -65 ===========>> as same as 256-65 = 191
> 77
> 101
> 100
> 0
>
> If we use item 1 to convert, they are all wrong, but the UI are same
> even they are wrong.
> If we use item 2 to convert, the before is wrong, but after convert,
> it is correct. but the UI will be different.
I do not understand.
Your example 2 does not seem to be correct, but I don't know what you did.
> To be consistent, I choose item 1. At least, the data is not lost from
> UI in both before and after..
>
> If client cannot afford it, he/she should correct it at very early
> time.
>
> This is not one technical question, it is choose question.
As I said, it's a time bomb.
Choose to either fix it now or maybe blow up later.
For example as soon as the customer wishes to insert a character that
is not in the Windows character set.
You can choose to ignore it, but I would at least tell my customer
that there are corrupt data in the database.
Yours,
Laurenz Albe
Since I (and anyone happening upon this later) can't tell if you are
joking, please understand that for many years, us7ascii was the
default for Oracle databases, often causing no end of trouble with the
sort of issue you have, ie, the client software allows entering
characters that are not in the db characterset. The trouble would
often happen when exp/imp would be used to import into an 8 bit
characterset, losing the 8th bit as Oracle oh-so-helpfully
mistranslates the characters. I'm sure this is one of the reasons for
csscan.
Laurenz seems to have good advice here, also dig into the
globalization manual, and check metalink for the notes on nls.
Ignore us old folk reminiscing about days long gone.
jg
--
@home.com is bogus.
"Remember rivers, Nellie?" - Firesign Theatre
Thanks for your comments:
1. I use rs.getBytes() to get raw data from db, not after converted.
The test on WE8MSWIN1252 is not correct. It occurs when I use 10.2.0.1
jdbc driver with db 10.2.0.4. After I use 10.2.0.4 jdbc driver, it
prints the bytes as same as data in db.
2. The rs.getString looks does some conversion. But I am clear how it
converts between oracle db and java client.
3. Because using System.out.println() should also do some conversion,
I choose jsp to test the case.
<-----------------------------------code begin
------------------------------>
HttpServletRequest h_req = (HttpServletRequest) request;
HttpServletResponse h_res = (HttpServletResponse) response;
h_res.setContentType("text/html");
h_res.setHeader("Pragma", "No-cache");
h_res.addHeader("Cache-Control", "no-cache");
h_res.addHeader("Expires", "Thu, 01 Jan 1970 00:00:01 GMT");
h_req.setCharacterEncoding("utf-8");
h_res.setCharacterEncoding("utf-8");
Context initContext = new InitialContext();
Context envContext = (Context)initContext.lookup("java:/comp/
env");
DataSource ds = (DataSource)envContext.lookup("jdbc/myoracle");
Connection conn = ds.getConnection();
PreparedStatement ps = null;
ResultSet rs = null;
String sql = "SELECT * from aaa ";
ps = conn.prepareStatement(sql);
rs = ps.executeQuery();
byte[] aaa = new byte[1024];
String bb = "";
while (rs.next())
{
aaa = rs.getBytes("c1");
bb = rs.getString("c1");
out.println(bb);
}
<-----------------------------------code end
------------------------------>
3.1 to connect to WE8ISO8859P1 db, it prints below in web page, the
hex 85 is displayed like space char.
sys. Med
3.2 to connect to WE8MSWIN1252 db, it prints below in web page, the
hex 83 is displayed as correct char
sys....Med
3.3 to connect to AL32UTF8 db, I insert two rows, one is from
WE8ISO8859P1 and another is from WE8MSWIN1252. it prints below in web
page,
SQL> select dump (c1) from aaa;
DUMP(C1)
--------------------------------------------------------------------------------
Typ=1 Len=9: 115,121,115,46,194,133,77,101,100
Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100
print in web page---------------------------------------------
sys. Med sys....Med
Your suggest is right, we need to make data correct even if it is
wrong currectly.
4. The rs.getString looks does some conversion. But I am clear how it
converts between oracle db and java client. From document B19306_01/
server.102/b14225/ch7progrunicode.htm#sthref924.
document
========================================================================================
Data Conversion for Thin Drivers
SQL statements are always converted to either the database character
set or to UTF-8 by the driver before they are sent to the database for
processing. When the database character set is either US7ASCII or
WE8ISO8859P1, the driver converts the SQL statement to the database
character set. Otherwise, the driver converts the SQL statement to
UTF-8 and notifies the database that a SQL statement requires further
conversion before being processed. The database, in turn, converts the
SQL statements from UTF-8 to the database character set. The database,
in turn, converts the SQL statement to the database character set.
========================================================================================
from document, I write one java application program and want to verify
it
==============================================Code begin
=====================================
byte[] aaa = new byte[1024];
while (rs.next())
{
bb = rs.getString("c1");
}
aaa = bb.getBytes();
for (int mm=0;mm<aaa.length;mm++)
{
System.out.println(aaa[mm]);
}
//print default java character set
System.out.println(Charset.defaultCharset() );
===================================code end
=============================================
4.1 test on WE8ISO8859P1 db
115
121
115
46
63 ---- looks convert to US7ASCII char, after verify convert
(c1,'US7ASCII')
77
101
100
--default java character set
windows-1252
4.2 test on WE8MSWIN1252 db.
115
121
115
46
-123 -- does not do any conversion
77
101
100
windows-1252
4.3 test on AL32UTF8 db
--first row
115
121
115
46
63 ---- looks convert to US7ASCII char, after verify convert
(c1,'US7ASCII')
77
101
100
--second row
115
121
115
46
-123 -- does not do any conversion
77
101
100
15
--default java character set
windows-1252
I cannot understand two sides
java uses UCS-2 (UTF-16) encoding, one char should have two bytes, but
when use String.getBytes, it converts to windows-1252. The result is
to print one byte for one char.
I will do the test again, and post here
Thanks for your suggestion again --:)
Yes, really thanks
-----------Code begin -------------------------
byte[] bRay = null;
char quote = '"';
int ndx;
// String sInitial = "Enc" + "\u4E94";
String sInitial = rs.getString("c1");
try { bRay = sInitial.getBytes("UTF-16"); }
catch( UnsupportedEncodingException uee )
{
System.out.println( "Exception: " + uee);
}
System.out.println( quote + sInitial + quote +
" String as UTF-16, " +
"bRay length: " + bRay.length + "." );
for( ndx = 0; ndx < bRay.length; ndx++ )
{
System.out.print( Integer.toHexString( bRay[ ndx++ ] ) + "
" );
System.out.print( Integer.toHexString( bRay[ ndx ] ) + "
" );
}
System.out.println("\n");
OutputStreamWriter osw = new OutputStreamWriter( System.out );
bRay = sInitial.getBytes();
System.out.println( quote + sInitial + quote +
" String as platform default - " +
osw.getEncoding() +
", bRay length: " + bRay.length + "." );
for( ndx = 0; ndx < bRay.length; ndx++ )
{
System.out.print( Integer.toHexString( bRay[ ndx ] ) + "
" );
}
System.out.println("\n");
try
{
sInitial = new String( sInitial.getBytes("UTF-8"), "UTF-8");
bRay = sInitial.getBytes("UTF-8");
}
catch( UnsupportedEncodingException uee )
{
System.out.println( "Exception: " + uee);
}
System.out.println( quote + sInitial + quote +
" String as UTF-8, " +
"bRay length: " + bRay.length + "." );
for( ndx = 0; ndx < bRay.length; ndx++ )
{
System.out.print( Integer.toHexString( bRay[ ndx ] ) + "
" );
}
-----------Code begin -------------------------
1. Below are test result when the db is AL32UTF8 when the row has data
"sys.…Med ."
-----------test result begin -------------------
"sys.匨ed" String as UTF-16, bRay length: 18..
fffffffe ffffffff 0 73 0 79 0 73 0 2e 20 26 0 4d 0 65
0 64
"sys.匨ed" String as platform default - Cp1252, bRay length: 8..
73 79 73 2e ffffff85 4d 65 64
"sys.匨ed" String as UTF-8, bRay length: 10..
73 79 73 2e ffffffe2 ffffff80 ffffffa6 4d 65 64
windows
-----------test result end ----------------------
2. Below are test result when the db is WE8ISO8859P1 when the row has
data "sys. Med ."
-----------test result begin -------------------
"sys.?Med" String as UTF-16, bRay length: 18.
fffffffe ffffffff 0 73 0 79 0 73 0 2e 0 ffffff85 0 4d 0
65 0 64
"sys.?Med" String as platform default - Cp1252, bRay length: 8.
73 79 73 2e 3f 4d 65 64
"sys.?Med" String as UTF-8, bRay length: 9.
73 79 73 2e ffffffc2 ffffff85 4d 65 64 windows-1252
-----------test result end ----------------------
4. I use FileOutputStream to print the data into file and open it by
using IE, the result is same as in JSP.
--------code begin------------
FileOutputStream fs=new FileOutputStream( "aa.htm");
OutputStreamWriter oswaa = new OutputStreamWriter( fs, "UTF-8" );
bb = rs.getString("c1");
oswaa.write(bb,0,bb.length());
--------code end------------
Conclusion:
The java client read data from oracle and translate it into UTF-16,
but how to print the data from java applicaiton, we need to do
encoding-decoding work.
194,133 is the control character hex 85 from WE8ISO8859P1,
converted to UTF-8, right.
> Typ=1 Len=10: 115,121,115,46,226,128,166,77,101,100
And 226,128,166 is the correct UTF-8 representation of
the "horizontal ellipsis" character.
> print in web page---------------------------------------------
> sys. Med sys....Med
>
> Your suggest is right, we need to make data correct even if it is
> wrong currectly.
I think that we have finally understood each other completly.
What I had been missing out on is that there is some post-processing
of the characters to display them in your web application.
That is of course also a place where things can go wrong,
even if you get the right characters out from the database.
Now that I have seen your Java code, I also understand how
you got the results you displayed:
You were using java.lang.String.getBytes(), which uses the
platform's default character set, in your case WINDOWS-1252.
So the correct result for your test case would be "-123"
for the weird character, which is the hex 85 you started with.
I tried with your string and the following Java code on Windows:
ResultSet rs = stmt.executeQuery("....");
rs.next();
System.out.println("Using getBytes():");
byte[] arr = rs.getBytes(1);
for (int i=0; i<arr.length; ++i)
System.out.println("Byte " + i + " = " + arr[i]);
System.out.println("Using getString().getBytes():");
arr = rs.getString(1).getBytes();
for (int i=0; i<arr.length; ++i)
System.out.println("Byte " + i + " = " + arr[i]);
System.out.println(Charset.defaultCharset());
and the result was:
Using getBytes():
Byte 0 = 115
Byte 1 = 121
Byte 2 = 115
Byte 3 = 46
Byte 4 = -30
Byte 5 = -128
Byte 6 = -90
Byte 7 = 77
Byte 8 = 101
Byte 9 = 100
Using getString().getBytes():
Byte 0 = 115
Byte 1 = 121
Byte 2 = 115
Byte 3 = 46
Byte 4 = -123
Byte 5 = 77
Byte 6 = 101
Byte 7 = 100
windows-1252
Right.
The java.sql.ResultSet.getBytes() method did not convert
anything and got the bytes in database encoding (AL32UTF8).
java.sql.ResultSet.getBytes() got the characters in UCS-2,
and java.lang.String.getBytes() converts that into
Windows encoding.
That will of course only work if the character exists
in Windows encoding. If you have anything else, e.g.
a "r" (UNICODE hex 0159, Czech r with hacek), the result
would be a question mark because Java cannot convert the
character to Windows 1252.
Yours,
Laurenz Albe
Yes, it is clear now.
Thanks
Jacky
No need for that - a database stores whatever you throw at it.
I can store "special" characters in a US7ASCII database.
And retrieve them.
Provided the tool set cat actually display your "special"
characters...
--
Regards,
Frank van Bortel
The AL32UTF8 is used to support multiple languages. It is multiple
bytes character set.
Thanks
You can choose to ignore and circumvent the character set
awareness of a database system and have your application handle
these issues. Oracle is pretty lax in that respect and will
let you do what you want.
Three disadvantages:
- You might to reinvent the wheel.
- You can get into trouble if the requirements change and
the database will need to be accessed and used in new
ways (JDBC, database link to AL32UTF8 database, ...).
- Oracle utilities may scream at you or behave in
surprising ways if the database does not contain what
they think it does.
Yours,
Laurenz Albe