finding non ascii characters in the database

Floyd Wellershaus

unread,

Aug 20, 2009, 6:59:58 AM8/20/09

to inform...@iiug.org

We're having some issues with ascii characters when migrating tables over to Oracle.
What would be the best way to determine which tables and columns contain non-ascii data ?

All I can think of now is to either unload the whole database and do a unix od command or some other grep for non-ascii characters,
or some query to select all rows of all tables with a where clause that selects non ascii characters.

any ideas would be appreciated.

Thanks,
floyd

Ian Michael Gumby

unread,

Aug 20, 2009, 11:08:16 AM8/20/09

to fl...@fwellers.com, inform...@iiug.org

Floyd,

First, may I offer you my condolences on migrating to Oracle. Yeah I support Oracle, but its like comparing an Abrams Tank to a T-72. And if you saw the Iraq wars, you can see how well the T-72 did. :-P

But to answer your question...

You can't do what you want to do.
If you dump the database I don't think you can get OD to work for you. (Although it may...)

What I was thinking was writing an ESQL/C program.

Starting at your system tables, for each table in a given database, find the columns that are character fields. (char, varchar, etc..). Write and prepare select statements that select all the columns from a table, and then parse those fields looking for non-ascii characters. Dump the row to a flat file log if you find any data. I believe you can use the ctype.h (isalnum()) function or write your own.

Shouldn't be too hard. (Dare I say trivial?)

But the larger question is why should you have issues with character sets above the ascii values?

Internally I believe Oracle supports UTF-8. So how are you exporting and importing the data?

-G

Date: Thu, 20 Aug 2009 05:59:58 -0500
Subject: finding non ascii characters in the database
From: fl...@fwellers.com
To: inform...@iiug.org

Hotmail® is up to 70% faster. Now good news travels really fast. Try it now.

Floyd Wellershaus

unread,

Aug 20, 2009, 12:25:37 PM8/20/09

to Ian Michael Gumby, inform...@iiug.org

Yea Gumby, thanks for the condolences. It'll take us ~5 years to fully be off of Informix, so I have a while to grow to .. er.. love..cough, oracle.

It seems that with our en_US.819 character set, multibyte characters get stored as single byte.
IOW a 25 char name that has some multibyte characters in it, stores well in a char(25) field.
But when we bring that data over with our oracle ODI etl process ( basically select table to table ), it won't fit in a char(25) field on oracle, because of the multi-byte characters.

I'll check out the isalpha and isalnum.
thanks,
floyd

----- Original Message -----
From: "Ian Michael Gumby" <im_g...@hotmail.com>
Sent: Thu, August 20, 2009 11:08
Subject: RE: finding non ascii characters in the database

Floyd,

First, may I offer you my condolences on migrating to Oracle. Yeah I support Oracle, but its like comparing an Abrams Tank to a T-72. And if you saw the Iraq wars, you can see how well the T-72 did. :-P

But to answer your question...

You can't do what you want to do.
If you dump the database I don't think you can get OD to work for you. (Although it may...)

What I was thinking was writing an ESQL/C program.

Starting at your system tables, for each table in a given database, find the columns that are character fields. (char, varchar, etc..). Write and prepare select statements that select all the columns from a table, and then parse those fields looking for non-ascii characters. Dump the row to a flat file log if you find any data. I believe you can use the ctype.h (isalnum()) function or write your own.

Shouldn't be too hard. (Dare I say trivial?)

But the larger question is why should you have issues with character s ets above the ascii values?

Ian Michael Gumby

unread,

Aug 20, 2009, 2:23:27 PM8/20/09

to fl...@fwellers.com, inform...@iiug.org

Floyd,

That's not right.
Arrgh! Now I have to think back about internationalization...

I'm going from memory and you may want to ping this off Mark Townsend ...

You should be able to set up the correct character set. As I said internally Oracle should be supporting UTF-8 so a char(25) should hold 25 UTF-8 characters and not just 25 bytes of characters.

You could go to the OTN (Oracle's Technical Network) site and there's a couple of open forums too.

It sounds like you have a homegrown ETL tool. Using the Oracle load process, you should be able to have Oracle filter out the rows it can't load in to a temp flat file, or you could get the error message and then vi the flat file to be loaded to find and remove the 'bad' row.

But even still, if you're doing a select * from informix table A and then insert in to Oracle's table A, you could write a very trivial Java application to do this.

Java's default is UTF-8 character set. Fernando can correct me, but when you select data from Informix via the JDBC adaptor, you should get a UTF-8 representation of the data. When you insert in to Oracle, you should be inserting UTF-8 character sets. (You may have to check some settings but the basic idea should work.)

My guess is that your current process is not handling the multi-byte characters as a single 'character', but as multiple 'character's. So instead of a string of 25 characters, you're seeing more characters.

Does that make sense?

The point is that I don't think you should be feeling all of this pain. The solution should be pretty straight forward.

-G

Date: Thu, 20 Aug 2009 11:25:37 -0500

Subject: RE: finding non ascii characters in the database

From: fl...@fwellers.com
To: im_g...@hotmail.com; inform...@iiug.org

Yea Gumby, thanks for the condolences. It'll take us ~5 years to fully be off of Informix, so I have a while to grow to .. er.. love..cough, oracle.

It seems that with our en_US.819 character set, multibyte characters get stored as single byte.
IOW a 25 char name that has some multibyte characters in it, stores well in a char(25) field.
But when we bring that data over with our oracle ODI etl process ( basically select table to table ), it won't fit in a char(25) field on oracle, because of the multi-byte characters.

I'll check out the isalpha and isalnum.
thanks,
floyd

----- Original Message -----
From: "Ian Michael Gumby" <im_g...@hotmail.com>
Sent: Thu, August 20, 2009 11:08
Subject: RE: finding non ascii characters in the database

Floyd,

First, may I offer you my condolences on migrating to Oracle. Yeah I support Oracle, but its like comparing an Abrams Tank to a T-72. And if you saw the Iraq wars, you can see how well the T-72 did. :-P

But to answer your question...

You can't do what you want to do.
If you dump the database I don't think you can get OD to work for you. (Although it may...)

What I was thinking was writing an ESQL/C program.

Starting at your system tables, for each table in a given database, find the columns that are character fields. (char, varchar, etc..). Write and prepare select statements that select all the columns from a table, and then parse those fields looking for non-ascii characters. Dump the row to a flat file log if you find any data. I believe you can use the ctype.h (isalnum()) function or write your own.

Shouldn't be too hard. (Dare I say trivial?)

But the larger question is why should you have issues with character sets above the ascii values?

Internally I believe Oracle supports UTF-8. So how are you exporting and importing the data?

-G

Date: Thu, 20 Aug 2009 05:59:58 -0500
Subject: finding non ascii characters in the database
From: fl...@fwellers.com
To: inform...@iiug.org

We're having some issues with ascii characters when migrating tables over to Oracle.
What would be the best way to determine which tables and columns contain non-ascii data ?

All I can think of now is to either unload the whole database and do a unix od command or some other grep for non-ascii characters,
or some query to select all rows of all tables with a where clause that selects non ascii characters.

any ideas would be appreciated.

Thanks,
floyd

Hotmail® is up to 70% faster. Now good news travels really fast. Try it now.

Windows Live: Keep your friends up to date with what you do online. Find out more.

Floyd Wellershaus

unread,

Aug 20, 2009, 4:24:37 PM8/20/09

to Ian Michael Gumby, inform...@iiug.org

Thanks again Michael. I don't fully understand all of this, but my big takeaway from your post is that something in our settings or environment may be handling the multi-byte characters that are coming from Informix, as multiple characters. That seems logical.

I am not doing the ETL, but I think we're just using oracle's ODI product. I think it connects via jdbc and does an Oracle sqlloader process to get the rows into oracle. We're letting ODI do all of this because we want to capture changes during migrations etc..
ODI is putting triggers on all the tables selected for the migration of a certain app, stores any dml into it's own temp schema and process away based on our mappings and other things we tell it to do.
It's a pretty cool product.

Thanks,
floyd

----- Original Message -----
From: "Ian Michael Gumby" <im_g...@hotmail.com>
Sent: Thu, August 20, 2009 14:23
Subject: RE: finding non ascii characters in the database

Floyd,

That's not right.
Arrgh! Now I have to think back about internationalization...

I'm going from memory and you may want to ping this off Mark Townsend ...

You should be able to set up the correct character set. As I said internally Oracle should be supporting UTF-8 so a char(25) should hold 25 UTF-8 characters and not just 25 bytes of characters.

You could go to the OTN (Oracle's Technical Network) site and there's a couple of open forums too.

It sounds like you have a homegrown ETL tool. Using the Oracle load process, you should be able to have Oracle filter out the rows it can't load in to a temp flat file, or you could get the error message and then vi the flat file to be loaded to find and remove the 'bad' row.

But even still, if you're doing a select * from informix table A and then insert in to Oracle's table A, you could write a very trivial Java application to do this.

Java's default is UTF-8 character s et. Fernando can correct me, but when you select data from Informix via the JDBC adaptor, you should get a UTF-8 representation of the data. When you insert in to Oracle, you should be inserting UTF-8 character sets. (You may have to check some settings but the basic idea should work.)

But the larger question is why should you have issues with character s ets above the ascii values?

Internally I believe Oracle supports UTF-8. So how are you exporting and importing the data?

-G

Date: Thu, 20 Aug 2009 05:59:58 -0500
Subject: finding non ascii characters in the database
From: fl...@fwellers.com
To: inform...@iiug.org

We're having some issues with ascii characters when migrating tables over to Oracle.
What would be the best way to determine which tables and columns contain non-ascii data ?

All I can think of now is to either unload the whole database and do a unix od command or some other grep for non-ascii characters,
or some query to select all rows of all tables with a where clause that selects non ascii characters.

any ideas would be appreciated.

Thanks,
floyd

Hotmail® is up to 70% faster. Now good news travels really fast. Try it now.