DB_LOCALE=en_us.utf8
and given a table t
create table t(
s1 TEXT in blobdbs,
s2 VARCHAR(254,100)
)
and given an import file containing the hexadecimal sequence
C4 7C C3 84 7C 0D 0A
in other words given an import file that contains the german Umlaut
'Ä'
in both 8859-1 (C4) as well as UTF-8 encoding (C3 84) in the same
line of data (same record) I do not manage to get the data loaded
using
the 'load' or the 'dbimport' command. Both commands rightously break.
In reality the data file is large ( > 1GB ) thus impossible for me to
manually
inspect and containing all sorts of character data in languages like
Maltese
thus impossible for to understand and modify.
I know the file has been produced by some IDS from a table as above.
But I lack the line of communication to the producers and certainly
lack
knowledge and experience with the IDS product. To me it is all
astonishing to find
two encodings in one record. May I ask the experts on a suggestion on
what to do
or where to read up to get this data loaded as is?
Thank you in advance, Gerrit
Two different codesets in the same record is a deadend... well... almost...
Let's see...
- The data could be considered "corrupt" as it's impossible for the DB to
handle it as it should be
- I can't think of an application that can handle that...
But you say it was unloaded from an IDS instance... Well, this is a long story,
but it's possible...
Now you want to "load as is"... Can you tell us what errors are you getting?
And which CLIENT_LOCALE and DB_LOCALE are you using?
If you load the data with DB_LOCALE = CLIENT_LOCALE it will not do codeset
conversion. But depending on the client tool you may get some errors...
Regards
--
Fernando Nunes
Portugal
http://informix-technology.blogspot.com
My email works... but I don't check it frequently...
If you're storing UTF-8 in the database, then going from the database
to the database with no outside application, the data should be ok.
If you're unloading, then the database will/should, convert to the
CLIENT_LOCALE characterset or in Oracle whatever the NLS_LANG is set
to.
This can cause problems because if its set to a character set encoding
that cant display your characterset, data is lost and is replaced with
a strange character. You can't convert it back.
On Unix, we have set our encoding to .AL32UTF8 which should allow you
to properly show 'latin-1' character sets.
In writing applications to manipulate or move data, we've used Python.
The trick is to read from the data source, convert via
unicode(string,'utf8') to get a generic unicode object representation,
then write using the proper encoding. You will have trouble seeing
certain encoded sets like Thai, but if you write the file out as UTF8,
there are a couple of tools out there that will let you see it.
If you're starting with your file and its in UTF-8, write a small
python script using the codec module and open the file to read UTF-8,
then you can look at it.
If you're really good, you can open a python interpreter shell, and do
this by hand, then you can manipulate what you see.
And yeah, Crylic (Russian, Thai, Greek, etc all are a royal pain...
why can't everyone learn the queen's english? Then we'd sort of speak
the same language, only some will have a nicer accent than
others... :-)
HTH
-G
Unicode is a royal pain to work with and there are a couple of things
that can trip you up in Python.
HTH
-Mike
> If you're unloading, then the database will/should, convert to the
> CLIENT_LOCALE characterset or in Oracle whatever the NLS_LANG is set
> to.
> This can cause problems because if its set to a character set encoding
> that cant display your characterset, data is lost and is replaced with
> a strange character. You can't convert it back.
>
In recent IDS versions, if there is no possible conversion you'll get an error
instead of a "strange character"... This is part of the "long story".
There is a parameter to implement the old (wrong) behavior
Regards.
yes, sure. Loading it with
DB_LOCALE=en_us.utf8
CLIENT_LOCALE=en_us.utf8
I get:
34389: Illegal character has been found in the input string
847: Error in load file line 1.
It breaks at the first occurence of a non ASCII character (the C4)in
the field
s1 of type text. If there is ASCII only in this field s1 it loads
accurately any other
UTF-8 characters given in the field s2 of type varchar.
However when loading it with
DB_LOCALE=en_us.utf8
CLIENT_LOCALE=EN_US.CP1252
no error is reported. However the data encoded as UTF-8 (the C3 84)
in the field
s2 of type varchar is not imported accurately. You get the typical
'Ä' at display.
Thanks
gerrit.schu...@gmail.com schrieb:
> Given a DB created on IDS 11.50 running on XP with
>
> DB_LOCALE=en_us.utf8
>
> and given a table t
>
> create table t(
> s1 TEXT in blobdbs,
> s2 VARCHAR(254,100)
> )
>
> and given an import file containing the hexadecimal sequence
>
> C4 7C C3 84 7C 0D 0A
>
> in other words given an import file that contains the german Umlaut
> '�'
Let's go back a little...
You're loading two columns, TEXT and VARCHAR.
When you say you have both codesets in the same record, you mean you have one
codeset in one column and another in the other column?
The error just happens on the TEXT column?
I'm afraid I didn't quite understood your problem...
Regards,
You're assuming the Database was created with UTF8?
I was asking you... after reading your post... You say it's normal to have
different codesets in the same record... I want to understand how you save it
in the database.
Ian, Fernando -
thank you both for your help.
Fernando -
yes you got it. Within one line of data there is one column
encoded as non UTF-8. This column goes to a field of type TEXT.
And there is another column within the same line of data
that is encoded as UTF-8. This column goes to a field of type VARCHAR.
I am told the source of the data file is a 'UTF-8 DB' on an older
informix version
( 7.31, I am told) running on UNIX (Solaris, I assume). It would not
be surprising if
the data in the TEXT column is not kept accurately at the source DB
already since
it is not made use of. But it is kept!
Could you please elaborate a little on how to enforce the 'old (wrong)
behaviour' you
mentioned?
Ian -
yes, your suggestions are convincing. However they do not
make me happy yet because
1) I can guess only what the non UTF-8 codesets in the (huge) file
are. I can see from the
data that is refers to 22 languages and there are 'funny' other then
8859-1 encodings among.
2) I am reluctant to modify the data files since I am trying to set up
a system as close as
possible to the source system the data was exported from.
Ok. When I said that having more than a codeset in one record I was thinking in
the database side (OP told that the file was unloaded from an IDS instance).
Then you said "Uh not at all", but were thinking on the input file.
So now, I'm ok... :)
Then again... you say:
"The point is that even in a database, its possible that you have column A
containing unicode characters and then when a user updates a phonetic or
transliteration field, they have the wrong encoding so that the data gets
stores as latin1 and not utf-8."
And this doesn't look good to me... In recent versions of IDS this will not
happen. In older versions it would. That's the "long story"
Currently you can't connect to a database if you have an incorrect "DB_LOCALE".
So, the codeset in the database should be ok. Of course if you have an
incorrect CLIENT_LOCALE and you get your input from a different codeset you
will have problems, but whatever is stored in the database will be in codeset
specified by DB_LOCALE.
Regards.
The problem you're having will probably not be solved by what I was thinking...
At least completely. Let's take the long story path:
Older versions of IDS would do two bad things:
1- Would accept a connection from a client with the wrong DB_LOCALE setting.
This causes a very common issue: You create a DB on Unix with the default
codeset (8859-1 AKA latin-1). Then you use windows applications and you don't
configure the correct DB_LOCALE and CLIENT_LOCALE settings
(DB_LOCALE=en_us.819;CLIENT_LOCALE=en_us.CP1252). So it used
DB_LOCALE=CLIENT_LOCALE=en_us.CP1252). This caused that no conversion was done,
but you were puting CP1252 characters into an 819 database!
2- Whenever DB_LOCALE != CLIENT_LOCALE codeset conversion must happen. But
there are cases when there is no possible conversion (several CP1252 characters
don't have 819 representation). In these cases, depending on the codeset
conversion tables (specific from/to each codeset pair), it could do one of
these options (I believe I don't remember them all):
a) All characters without possible conversion will be converted into a
substitute character. So it's not possible to convert them back (!)
b) Each character without possible conversion will be converted to a
different "unused" character in the destination codeset. This would allow
conversion back
c) The character would be converted to a "similar" character
d) ?... I think they were four... but I maybe wrong ;)
So in recent versions these was changed. AFAIK all 11.x don't behave like this.
In v10, the issue 1 was fixed probably in 10.00.xC4 but I'm not sure and issue
2 was fixed in 01.00.xC7
In v9 I would have to look...
Note that CSDK also have implications on issue 1, and I think 2.90.TC6 was
changed so that it looks at the database real LOCALE in order to choose the
default value for DB_LOCALE
How can we revert this new (and correct) behavior...
For issue 1, set IFMX_UNDOC_B168163 to anything in the engine environment and
restart it.
For the second issue if you want the old behavior you have to set a option in
the ONCONFIG. I would have to check the name (it was reference as one but
apparently it was changed), but YOU DON'T WANT TO DO THIS ;)
So... From scratch. You have a file with two columns, on using UTF8 and the
other latin-1. And you want this data into a table within an UTF8 database.
I think you have two options:
1- Insert your data as UTF8 and get a correct database.
2- Insert your data "as is" and keep having a... err... corrupted (?) database
Solution for option 1:
1- Generate a third column in your file. You'll use it to uniquely identify
your rows
2- Create a table temp_utf8_col with two columns (integer, column using UTF8)
3- Create a table temp_CP1252_col with two column (integer, column using CP1252)
4- split your file in two (sequential column + utf8 and sequential column + 88591)
5- Set DB_LOCALE=CLIENT_LOCALE=en_us.utf8 and load the utf8 file into temp_utf8_col
6- Set DB_LOCALE=en_us.utf8 and CLIENT_LOCALE=en_us.CP1252 and load the CP1252
file into temp_CP1252_col
7- connect to the database and:
INSERT INTO final_table
SELECT col_utf8, col_88591
FROM temp_utf8_col utf8, temp_CP1252_col cp1252
WHERE utf8.col1 = cp1252.col2
You may have problems with this if there are CP1252 characters that don't have
representation in UTF8. This may seem impossible, but if you look closely at
some CP1252 characters... try writing a "-" in MS Word, or "quote... text...
quote" and you'll see what I mean
Solution for option 2:
- Set the variable mentioned above for issue 1
- restart the engine
- Follow the steps as for option 1, but when setting DB_LOCALE/CLIENT_LOCALE
use the same value for both. In other words, change step 6 to:
- Set DB_LOCALE=CLIENT_LOCALE=en_us.CP1252 and load the CP1252 file into
temp_CP1252_col
Obviously this serves as an example. You can do the steps in a different way,
create indexes on col1 on one of the tables for efficiency etc.
But:
- Option 2 is a dirty trick... Your data will be "corrupted" in some way
- Option 1 may mean that you won't be able to load the file with CP1252 unless
you change some data (Ian's idea of using a script language to convert to
unicode may be good, but if there are codes without conversion I don't know how
it handles...
- The variable to revert to old behavior may not be available on IDS 11.50. I'd
have to check...
- It would help to check the original environment taking into consideration the
"long story". If you understand how the data got "mixed up" it may help you to
fix it...
Well... I think that's it... I should put this in the blog...
Regards,
> Currently you can't connect to a database if you have an incorrect "DB_LOCALE".
> So, the codeset in the database should be ok. Of course if you have an
> incorrect CLIENT_LOCALE and you get your input from a different codeset you
> will have problems, but whatever is stored in the database will be in codeset
> specified by DB_LOCALE.
>
> Regards.
>
Correct. This is also somewhat true of Oracle.
In dealing with Brazil road names, its possible that the data was
entered in to a spreadsheet and it contains 'latin-1' characters.
(Instead of utf-8 data)
If the environment variables are not set correctly, you can get
latin-1 characters entered in to the database which in Oracle should
be storing the columns as UTF-8 characters.
If you're getting your data from a third party vendor, its possible
that while you have a UTF-8 character set in one column like Thai,
Cyrillic, etc... and another column of transliteration data, the
transliteration data could be in latin-1. So you have to be careful!
This is why I really hate Unicode and wish that everyone wrote and
spoke the Queen's English. :-)
The ironing is delicious.
--
Cheers,
Obnoxio the Clown
Yes... but Bossa Nova in English would sound horrible ;)
There's a lot of discussion below, but there isn't much in the way of
fundamental analysis.
The DB_LOCALE is shown as en_us.utf8; what is not mentioned is the
CLIENT_LOCALE. There are two options, roughly:
1) CLIENT_LOCALE=en_us.cp1252 (since you are on Windows XP -- I'd use
8859-1 or -15 on Unix)
2) CLIENT_LOCALE=en_us.utf8
In the first case (CP1252), the load file data is interpreted as:
C3 U+00C3 LATIN CAPITAL LETTER A WITH TILDE
7C U+007C VERTICAL LINE
C4 U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS
84 U+201E DOUBLE LOW-9 QUOTATION MARK
7C U+007C VERTICAL LINE
0D U+000D CARRIAGE RETURN
0A U+000A LINE FEED
Since you're running on Windows XP, the CRLF line ending is OK; it
would cause some issues on Unix. Anyway, the data provided has a
meaning in CP1252. [See http://www.unicode.org/charts/PDF/U2000.pdf,
http://en.wikipedia.org/wiki/Windows-1252, and
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT]
In the second case (UTF8), the file is simply badly formed - it
contains an invalid sequence of bytes in UTF8. Specifically, the byte
0xC3 must be followed by a byte in the range 0x80..0xBF, but is
followed by 0x7C. That is invalid UTF8 data and the only thing any
conformant Unicode program can do is reject the data as malformed.
(Side note: valid UTF8 files cannot contain bytes 0xC0 or 0xC1 at all;
the only characters that can be encoded with those are in the range
0x00..0x7F and the UTF8 format requires that the data be formatted
using the shortest byte sequence that will serve. There are security
risks, in particular, in allowing 0xC0 0x80 as a synonym for 0x00, for
example, so the Unicode consortium defines that 0xC0 0x80 is malformed
and must be rejected by a conformant application.)
As far as IDS (GLS) is concerned, the data in the file is in one code
set -- either CP1252 or UTF8. You can choose which code set it is,
but all the data is in that one code set, and if you choose to treat
it as UTF8, then you have mal-formed data. Unfortunately, a single
file cannot contain data in two code sets -- or, more precisely, IDS
(GLS) will treat each file as containing a single code set, even if
you want to interpret some parts of the file as being in one code set
and others as being in another. The same applies to the database -
the character data in the database is in a single code set as far as
IDS is concerned - the DB_LOCALE determines which code set.
> In reality the data file is large ( > 1GB ) thus impossible for me to manually
> inspect and containing all sorts of character data in languages like Maltese
> thus impossible for to understand and modify.
Someone or something is going to have to disambiguate the data. If
you want the data to be treated as UTF8, then you will have to get
hold of a rather carefully written parser that can handle the Informix
load format rules, work with separating the data into fields, and then
translate the CP1252 portion (first field) into UTF8 while leaving the
second field unchanged (assuming it is valid UTF8 -- it should flag
invalid UTF8 sequences). The other transform - converting the UTF8
data to a SBCS - is not really possible unless the repertoire of UTF8
characters used is in fact a subset of some SBCS encoding.
I have bits and pieces that could be assembled to do the job.
Specifically, I have a program sbcs2utf8 for mapping data encoded in
an arbitrary single-byte code set (SBCS) to UTF8 (hence the name).
The CP1252.TXT file referenced above would almost do as the input file
- the program currently requires a simple 2-digit hex number such as
84 in the first column and the U+201E notation in the second column
where the CP1252.TXT file contains 0x84 0x201E. I suppose I should
modify the program so that the Unicode mapping files are directly
usable, though all it takes is a simple sed script to convert the
Unicode file to my format. The code in that program would handle the
SBCS to UTF8 mapping of the first column. There is code in SQLCMD
that would handle the data splitting. However, it would be a non-
trivial exercise in hacking to extract and mangle the code. If you
can find some code somewhere else to do the job, it will probably be
easier. However, most programs are designed to work with one code set
at a time.
> I know the file has been produced by some IDS from a table as above.
Probably by working with a database where the code set was a single-
byte code set (CP1252 or 8859-1) and in that database, IDS thinks
everything is in that code set, and only the applications know or care
that some parts are in UTF8 and other parts in CP1252. IDS certainly
has no knowledge of the 'UTF8 coding' in the second column, and cannot
enforce the correctness of the data, therefore. The fact that some
application knows to treat the second column as UTF8 data is
coincidental.
> But I lack the line of communication to the producers and certainly lack
> knowledge and experience with the IDS product. To me it is all
> astonishing to find two encodings in one record. May I ask the experts on a suggestion on
> what to do or where to read up to get this data loaded as is?
Good luck...
-=JL=-