Loading a data file containing character fields with different encodings

gerrit....@gmail.com

unread,

Sep 12, 2008, 8:21:30 AM9/12/08

to

Given a DB created on IDS 11.50 running on XP with

DB_LOCALE=en_us.utf8

and given a table t

create table t(
s1 TEXT in blobdbs,
s2 VARCHAR(254,100)
)

and given an import file containing the hexadecimal sequence

C4 7C C3 84 7C 0D 0A

in other words given an import file that contains the german Umlaut
'Ä'
in both 8859-1 (C4) as well as UTF-8 encoding (C3 84) in the same
line of data (same record) I do not manage to get the data loaded
using
the 'load' or the 'dbimport' command. Both commands rightously break.

In reality the data file is large ( > 1GB ) thus impossible for me to
manually
inspect and containing all sorts of character data in languages like
Maltese
thus impossible for to understand and modify.

I know the file has been produced by some IDS from a table as above.
But I lack the line of communication to the producers and certainly
lack
knowledge and experience with the IDS product. To me it is all
astonishing to find
two encodings in one record. May I ask the experts on a suggestion on
what to do
or where to read up to get this data loaded as is?

Thank you in advance, Gerrit

Fernando Nunes

unread,

Sep 12, 2008, 9:32:10 AM9/12/08

to

Two different codesets in the same record is a deadend... well... almost...
Let's see...

- The data could be considered "corrupt" as it's impossible for the DB to
handle it as it should be
- I can't think of an application that can handle that...

But you say it was unloaded from an IDS instance... Well, this is a long story,
but it's possible...
Now you want to "load as is"... Can you tell us what errors are you getting?
And which CLIENT_LOCALE and DB_LOCALE are you using?
If you load the data with DB_LOCALE = CLIENT_LOCALE it will not do codeset
conversion. But depending on the client tool you may get some errors...

Regards

--
Fernando Nunes
Portugal

http://informix-technology.blogspot.com
My email works... but I don't check it frequently...

Ian Michael Gumby

unread,

Sep 12, 2008, 10:00:17 AM9/12/08

to

On Sep 12, 8:32 am, Fernando Nunes <domusonl...@gmail.com> wrote:

If you're storing UTF-8 in the database, then going from the database
to the database with no outside application, the data should be ok.

If you're unloading, then the database will/should, convert to the
CLIENT_LOCALE characterset or in Oracle whatever the NLS_LANG is set
to.
This can cause problems because if its set to a character set encoding
that cant display your characterset, data is lost and is replaced with
a strange character. You can't convert it back.

On Unix, we have set our encoding to .AL32UTF8 which should allow you
to properly show 'latin-1' character sets.

In writing applications to manipulate or move data, we've used Python.
The trick is to read from the data source, convert via
unicode(string,'utf8') to get a generic unicode object representation,
then write using the proper encoding. You will have trouble seeing
certain encoded sets like Thai, but if you write the file out as UTF8,
there are a couple of tools out there that will let you see it.

If you're starting with your file and its in UTF-8, write a small
python script using the codec module and open the file to read UTF-8,
then you can look at it.

If you're really good, you can open a python interpreter shell, and do
this by hand, then you can manipulate what you see.
And yeah, Crylic (Russian, Thai, Greek, etc all are a royal pain...
why can't everyone learn the queen's english? Then we'd sort of speak
the same language, only some will have a nicer accent than
others... :-)

HTH

-G

Unicode is a royal pain to work with and there are a couple of things
that can trip you up in Python.

HTH

-Mike

Fernando Nunes

unread,

Sep 12, 2008, 10:22:35 AM9/12/08

to

Ian Michael Gumby wrote:

> If you're unloading, then the database will/should, convert to the
> CLIENT_LOCALE characterset or in Oracle whatever the NLS_LANG is set
> to.
> This can cause problems because if its set to a character set encoding
> that cant display your characterset, data is lost and is replaced with
> a strange character. You can't convert it back.
>

In recent IDS versions, if there is no possible conversion you'll get an error
instead of a "strange character"... This is part of the "long story".
There is a parameter to implement the old (wrong) behavior
Regards.

gerrit....@gmail.com

unread,

Sep 12, 2008, 10:19:45 AM9/12/08

to

Fernando-

yes, sure. Loading it with

DB_LOCALE=en_us.utf8
CLIENT_LOCALE=en_us.utf8

I get:

34389: Illegal character has been found in the input string
847: Error in load file line 1.

It breaks at the first occurence of a non ASCII character (the C4)in
the field
s1 of type text. If there is ASCII only in this field s1 it loads
accurately any other
UTF-8 characters given in the field s2 of type varchar.

However when loading it with

DB_LOCALE=en_us.utf8
CLIENT_LOCALE=EN_US.CP1252

no error is reported. However the data encoded as UTF-8 (the C3 84)
in the field
s2 of type varchar is not imported accurately. You get the typical
'Ã„' at display.

Thanks

gerrit.schu...@gmail.com schrieb:

> Given a DB created on IDS 11.50 running on XP with
>
> DB_LOCALE=en_us.utf8
>
> and given a table t
>
> create table t(
> s1 TEXT in blobdbs,
> s2 VARCHAR(254,100)
> )
>
> and given an import file containing the hexadecimal sequence
>
> C4 7C C3 84 7C 0D 0A
>
> in other words given an import file that contains the german Umlaut

> '�'

Fernando Nunes

unread,

Sep 12, 2008, 11:07:53 AM9/12/08

to

Let's go back a little...
You're loading two columns, TEXT and VARCHAR.
When you say you have both codesets in the same record, you mean you have one
codeset in one column and another in the other column?
The error just happens on the TEXT column?

I'm afraid I didn't quite understood your problem...
Regards,

Ian Michael Gumby

unread,

Sep 12, 2008, 12:18:30 PM9/12/08

to Fernando Nunes, inform...@iiug.org

Huh?

Not at all.

Its very possible to have data encoded in two different code sets.
You can have column 1 containing a 'latin-1' transliteration of a street name while column 2 contains the street name in the
character set of the country. (Thai, Cyrilic, Greek, Arabic, Hebrew, whatever)

Perhaps the standard database tools will have issues, in loading but you can write very simple python scripts to do this.
(And yes, I have done exactly this to solve this problem.)

The python solution is as follows:

1) Using the codec module, open the file as a UTF-8 encoded file.
2) Read the line in using the utf-8 encoding.
3) Try to use the unicode() method to convert from utf-8
3a) If you fail, then try to use unicode() method to convert from 'latin1' (I'm assuming that the ISO characterset is ok under latin-1.
4) Now you have the column data in a generic unicode format.
5) Insert the data using the databases encoding.

This works using cx_Oracle and Oracle.
I can use pretty much the same script against IDS of course using the correct python adapter from Carston Hasse ?sp?
It will load your data and work in the same order of time as the standard sql loader.

Pretty straight forward.

> _______________________________________________
> Informix-list mailing list
> Inform...@iiug.org
> http://www.iiug.org/mailman/listinfo/informix-list

Get more out of the Web. Learn 10 hidden secrets of Windows Live. Learn Now

Fernando Nunes

unread,

Sep 12, 2008, 12:28:30 PM9/12/08

to

You're assuming the Database was created with UTF8?

Ian Michael Gumby

unread,

Sep 12, 2008, 12:53:28 PM9/12/08

to Fernando Nunes, inform...@iiug.org

I don't think that its an invalid assumption.
The data is coming from one database that contains UTF-8 characters and it appears that he's attempting to load
UTF-8 characters along with Latin-1 characters.

From what was described, it seems that the loader utility barfs on the fact that there are both latin1 and utf-8 encoding within the same record.

I'm going from memory and a fast read of the e-mails, but I thought the OP had said that he could try different locales and either one or the other field would read correctly. Which would imply its the fact that its the two encodings in the same file.

He could write a simple script to split the file in to two files. One containing the latin-1 character set column, the second containing the utf-8 column and of course both files containing the primary key information. Then use the loader tool.

However, it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding.

Depending on how fancy you want the script to get, you can do a lot of different things. You can even write a TCL/TK front end to do some GUI input work...

The point is that its not a major deal, at least as described.

HTH

-G

> From: domus...@gmail.com
> Subject: Re: Loading a data file containing character fields with different encodings

Fernando Nunes

unread,

Sep 12, 2008, 1:13:05 PM9/12/08

to

> ------------------------------------------------------------------------

> Get more out of the Web. Learn 10 hidden secrets of Windows Live. Learn
> Now

> <http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_getmore_092008>

I was asking you... after reading your post... You say it's normal to have
different codesets in the same record... I want to understand how you save it
in the database.

Ian Michael Gumby

unread,

Sep 12, 2008, 1:55:20 PM9/12/08

to Fernando Nunes, inform...@iiug.org

> I was asking you... after reading your post... You say it's normal to have
> different codesets in the same record... I want to understand how you save it
> in the database.
>

Well in a perfect world, everything in the database would be stored as UTF-8.
Latin-1 character sets would be re-encoded to utf-8. And we all live happily ever after.

However we don't live in a perfect world. Its possible that data could get stored as 'latin-1' in the database or as 'utf-8'.
In input files, its possible that it contains one column in 'latin-1' and another column in 'utf-8'.

Note "Normal to have different code sets in the record" was in reference to the input file.
When you have an issue of transliteration, its possible that the individual who's entering the "english" version, may have their configuration wrong, or they could be updating a spreadsheet.

The point is that even in a database, its possible that you have column A containing unicode characters and then when a user updates a phonetic or transliteration field, they have the wrong encoding so that the data gets stores as latin1 and not utf-8.

Normal also in the sense that it happens on a routine basis.

-G

Stay up to date on your PC, the Web, and your mobile phone with Windows Live. See Now

gerrit....@gmail.com

unread,

Sep 12, 2008, 2:19:06 PM9/12/08

to

> > Informix-l...@iiug.org
> >http://www.iiug.org/mailman/listinfo/informix-list
>
> _________________________________________________________________
> Get more out of the Web. Learn 10 hidden secrets of Windows Live.http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog...

Ian, Fernando -

thank you both for your help.

Fernando -

yes you got it. Within one line of data there is one column
encoded as non UTF-8. This column goes to a field of type TEXT.
And there is another column within the same line of data
that is encoded as UTF-8. This column goes to a field of type VARCHAR.

I am told the source of the data file is a 'UTF-8 DB' on an older
informix version
( 7.31, I am told) running on UNIX (Solaris, I assume). It would not
be surprising if
the data in the TEXT column is not kept accurately at the source DB
already since
it is not made use of. But it is kept!

Could you please elaborate a little on how to enforce the 'old (wrong)
behaviour' you
mentioned?

Ian -

yes, your suggestions are convincing. However they do not
make me happy yet because

1) I can guess only what the non UTF-8 codesets in the (huge) file
are. I can see from the
data that is refers to 22 languages and there are 'funny' other then
8859-1 encodings among.

2) I am reluctant to modify the data files since I am trying to set up
a system as close as
possible to the source system the data was exported from.

Fernando Nunes

unread,

Sep 12, 2008, 2:29:50 PM9/12/08

to

> ------------------------------------------------------------------------

> Get more out of the Web. Learn 10 hidden secrets of Windows Live. Learn
> Now

> <http://windowslive.com/connect/post/jamiethomson.spaces.live.com-Blog-cns!550F681DAD532637!5295.entry?ocid=TXT_TAGLM_WL_getmore_092008>

Ok. When I said that having more than a codeset in one record I was thinking in
the database side (OP told that the file was unloaded from an IDS instance).
Then you said "Uh not at all", but were thinking on the input file.

So now, I'm ok... :)

Fernando Nunes

unread,

Sep 12, 2008, 2:34:36 PM9/12/08

to

Then again... you say:

"The point is that even in a database, its possible that you have column A
containing unicode characters and then when a user updates a phonetic or
transliteration field, they have the wrong encoding so that the data gets
stores as latin1 and not utf-8."

And this doesn't look good to me... In recent versions of IDS this will not
happen. In older versions it would. That's the "long story"

Currently you can't connect to a database if you have an incorrect "DB_LOCALE".
So, the codeset in the database should be ok. Of course if you have an
incorrect CLIENT_LOCALE and you get your input from a different codeset you
will have problems, but whatever is stored in the database will be in codeset
specified by DB_LOCALE.

Regards.

Fernando Nunes

unread,

Sep 12, 2008, 3:23:00 PM9/12/08

to

The problem you're having will probably not be solved by what I was thinking...
At least completely. Let's take the long story path:

Older versions of IDS would do two bad things:
1- Would accept a connection from a client with the wrong DB_LOCALE setting.
This causes a very common issue: You create a DB on Unix with the default
codeset (8859-1 AKA latin-1). Then you use windows applications and you don't
configure the correct DB_LOCALE and CLIENT_LOCALE settings
(DB_LOCALE=en_us.819;CLIENT_LOCALE=en_us.CP1252). So it used
DB_LOCALE=CLIENT_LOCALE=en_us.CP1252). This caused that no conversion was done,
but you were puting CP1252 characters into an 819 database!

2- Whenever DB_LOCALE != CLIENT_LOCALE codeset conversion must happen. But
there are cases when there is no possible conversion (several CP1252 characters
don't have 819 representation). In these cases, depending on the codeset
conversion tables (specific from/to each codeset pair), it could do one of
these options (I believe I don't remember them all):
a) All characters without possible conversion will be converted into a
substitute character. So it's not possible to convert them back (!)
b) Each character without possible conversion will be converted to a
different "unused" character in the destination codeset. This would allow
conversion back
c) The character would be converted to a "similar" character
d) ?... I think they were four... but I maybe wrong ;)

So in recent versions these was changed. AFAIK all 11.x don't behave like this.
In v10, the issue 1 was fixed probably in 10.00.xC4 but I'm not sure and issue
2 was fixed in 01.00.xC7
In v9 I would have to look...

Note that CSDK also have implications on issue 1, and I think 2.90.TC6 was
changed so that it looks at the database real LOCALE in order to choose the
default value for DB_LOCALE

How can we revert this new (and correct) behavior...
For issue 1, set IFMX_UNDOC_B168163 to anything in the engine environment and
restart it.

For the second issue if you want the old behavior you have to set a option in
the ONCONFIG. I would have to check the name (it was reference as one but
apparently it was changed), but YOU DON'T WANT TO DO THIS ;)

So... From scratch. You have a file with two columns, on using UTF8 and the
other latin-1. And you want this data into a table within an UTF8 database.

I think you have two options:
1- Insert your data as UTF8 and get a correct database.
2- Insert your data "as is" and keep having a... err... corrupted (?) database

Solution for option 1:

1- Generate a third column in your file. You'll use it to uniquely identify
your rows
2- Create a table temp_utf8_col with two columns (integer, column using UTF8)
3- Create a table temp_CP1252_col with two column (integer, column using CP1252)
4- split your file in two (sequential column + utf8 and sequential column + 88591)
5- Set DB_LOCALE=CLIENT_LOCALE=en_us.utf8 and load the utf8 file into temp_utf8_col
6- Set DB_LOCALE=en_us.utf8 and CLIENT_LOCALE=en_us.CP1252 and load the CP1252
file into temp_CP1252_col
7- connect to the database and:
INSERT INTO final_table
SELECT col_utf8, col_88591
FROM temp_utf8_col utf8, temp_CP1252_col cp1252
WHERE utf8.col1 = cp1252.col2

You may have problems with this if there are CP1252 characters that don't have
representation in UTF8. This may seem impossible, but if you look closely at
some CP1252 characters... try writing a "-" in MS Word, or "quote... text...
quote" and you'll see what I mean

Solution for option 2:
- Set the variable mentioned above for issue 1
- restart the engine
- Follow the steps as for option 1, but when setting DB_LOCALE/CLIENT_LOCALE
use the same value for both. In other words, change step 6 to:
- Set DB_LOCALE=CLIENT_LOCALE=en_us.CP1252 and load the CP1252 file into
temp_CP1252_col

Obviously this serves as an example. You can do the steps in a different way,
create indexes on col1 on one of the tables for efficiency etc.

But:
- Option 2 is a dirty trick... Your data will be "corrupted" in some way
- Option 1 may mean that you won't be able to load the file with CP1252 unless
you change some data (Ian's idea of using a script language to convert to
unicode may be good, but if there are codes without conversion I don't know how
it handles...
- The variable to revert to old behavior may not be available on IDS 11.50. I'd
have to check...
- It would help to check the original environment taking into consideration the
"long story". If you understand how the data got "mixed up" it may help you to
fix it...

Well... I think that's it... I should put this in the blog...

Regards,

Ian Michael Gumby

unread,

Sep 12, 2008, 3:45:54 PM9/12/08

to

On Sep 12, 1:34 pm, Fernando Nunes <domusonl...@gmail.com> wrote:

> Currently you can't connect to a database if you have an incorrect "DB_LOCALE".
> So, the codeset in the database should be ok. Of course if you have an
> incorrect CLIENT_LOCALE and you get your input from a different codeset you
> will have problems, but whatever is stored in the database will be in codeset
> specified by DB_LOCALE.
>
> Regards.
>

Correct. This is also somewhat true of Oracle.

In dealing with Brazil road names, its possible that the data was
entered in to a spreadsheet and it contains 'latin-1' characters.
(Instead of utf-8 data)
If the environment variables are not set correctly, you can get
latin-1 characters entered in to the database which in Oracle should
be storing the columns as UTF-8 characters.

If you're getting your data from a third party vendor, its possible
that while you have a UTF-8 character set in one column like Thai,
Cyrillic, etc... and another column of transliteration data, the
transliteration data could be in latin-1. So you have to be careful!

This is why I really hate Unicode and wish that everyone wrote and
spoke the Queen's English. :-)

Obnoxio The Clown

unread,

Sep 12, 2008, 4:11:21 PM9/12/08

to informix-list@iiug.org >> informix-list

Ian Michael Gumby wrote:
> This is why I really hate Unicode and wish that everyone wrote and
> spoke the Queen's English. :-

The ironing is delicious.

--
Cheers,
Obnoxio the Clown

http://obotheclown.blogspot.com

Fernando Nunes

unread,

Sep 12, 2008, 4:33:09 PM9/12/08

to

Yes... but Bossa Nova in English would sound horrible ;)

Jonathan Leffler

unread,

Sep 13, 2008, 8:57:10 AM9/13/08

to

On Sep 12, 5:21 am, gerrit.schu...@gmail.com wrote:
> Given a DB created on IDS 11.50 running on XP with
>
> DB_LOCALE=en_us.utf8
>
> and given a table t
>
> create table t(
> s1 TEXT in blobdbs,
> s2 VARCHAR(254,100)
> )
>
> and given an import file containing the hexadecimal sequence
>
> C4 7C C3 84 7C 0D 0A
>
> in other words given an import file that contains the german Umlaut 'Ä'
> in both 8859-1 (C4) as well as UTF-8 encoding (C3 84) in the same
> line of data (same record) I do not manage to get the data loaded using
> the 'load' or the 'dbimport' command. Both commands rightously break.

There's a lot of discussion below, but there isn't much in the way of
fundamental analysis.

The DB_LOCALE is shown as en_us.utf8; what is not mentioned is the
CLIENT_LOCALE. There are two options, roughly:

1) CLIENT_LOCALE=en_us.cp1252 (since you are on Windows XP -- I'd use
8859-1 or -15 on Unix)
2) CLIENT_LOCALE=en_us.utf8

In the first case (CP1252), the load file data is interpreted as:

C3 U+00C3 LATIN CAPITAL LETTER A WITH TILDE
7C U+007C VERTICAL LINE
C4 U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS
84 U+201E DOUBLE LOW-9 QUOTATION MARK
7C U+007C VERTICAL LINE
0D U+000D CARRIAGE RETURN
0A U+000A LINE FEED

Since you're running on Windows XP, the CRLF line ending is OK; it
would cause some issues on Unix. Anyway, the data provided has a
meaning in CP1252. [See http://www.unicode.org/charts/PDF/U2000.pdf,
http://en.wikipedia.org/wiki/Windows-1252, and
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT]

In the second case (UTF8), the file is simply badly formed - it
contains an invalid sequence of bytes in UTF8. Specifically, the byte
0xC3 must be followed by a byte in the range 0x80..0xBF, but is
followed by 0x7C. That is invalid UTF8 data and the only thing any
conformant Unicode program can do is reject the data as malformed.
(Side note: valid UTF8 files cannot contain bytes 0xC0 or 0xC1 at all;
the only characters that can be encoded with those are in the range
0x00..0x7F and the UTF8 format requires that the data be formatted
using the shortest byte sequence that will serve. There are security
risks, in particular, in allowing 0xC0 0x80 as a synonym for 0x00, for
example, so the Unicode consortium defines that 0xC0 0x80 is malformed
and must be rejected by a conformant application.)

As far as IDS (GLS) is concerned, the data in the file is in one code
set -- either CP1252 or UTF8. You can choose which code set it is,
but all the data is in that one code set, and if you choose to treat
it as UTF8, then you have mal-formed data. Unfortunately, a single
file cannot contain data in two code sets -- or, more precisely, IDS
(GLS) will treat each file as containing a single code set, even if
you want to interpret some parts of the file as being in one code set
and others as being in another. The same applies to the database -
the character data in the database is in a single code set as far as
IDS is concerned - the DB_LOCALE determines which code set.

> In reality the data file is large ( > 1GB ) thus impossible for me to manually
> inspect and containing all sorts of character data in languages like Maltese
> thus impossible for to understand and modify.

Someone or something is going to have to disambiguate the data. If
you want the data to be treated as UTF8, then you will have to get
hold of a rather carefully written parser that can handle the Informix
load format rules, work with separating the data into fields, and then
translate the CP1252 portion (first field) into UTF8 while leaving the
second field unchanged (assuming it is valid UTF8 -- it should flag
invalid UTF8 sequences). The other transform - converting the UTF8
data to a SBCS - is not really possible unless the repertoire of UTF8
characters used is in fact a subset of some SBCS encoding.

I have bits and pieces that could be assembled to do the job.
Specifically, I have a program sbcs2utf8 for mapping data encoded in
an arbitrary single-byte code set (SBCS) to UTF8 (hence the name).
The CP1252.TXT file referenced above would almost do as the input file
- the program currently requires a simple 2-digit hex number such as
84 in the first column and the U+201E notation in the second column
where the CP1252.TXT file contains 0x84 0x201E. I suppose I should
modify the program so that the Unicode mapping files are directly
usable, though all it takes is a simple sed script to convert the
Unicode file to my format. The code in that program would handle the
SBCS to UTF8 mapping of the first column. There is code in SQLCMD
that would handle the data splitting. However, it would be a non-
trivial exercise in hacking to extract and mangle the code. If you
can find some code somewhere else to do the job, it will probably be
easier. However, most programs are designed to work with one code set
at a time.

> I know the file has been produced by some IDS from a table as above.

Probably by working with a database where the code set was a single-
byte code set (CP1252 or 8859-1) and in that database, IDS thinks
everything is in that code set, and only the applications know or care
that some parts are in UTF8 and other parts in CP1252. IDS certainly
has no knowledge of the 'UTF8 coding' in the second column, and cannot
enforce the correctness of the data, therefore. The fact that some
application knows to treat the second column as UTF8 data is
coincidental.

> But I lack the line of communication to the producers and certainly lack
> knowledge and experience with the IDS product. To me it is all
> astonishing to find two encodings in one record. May I ask the experts on a suggestion on
> what to do or where to read up to get this data loaded as is?

Good luck...

-=JL=-