Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Loading UTF8 fixed length file using SQL Loader 8i

465 views
Skip to first unread message

Nicolas Mayoraz

unread,
Jan 4, 2005, 8:18:23 AM1/4/05
to
Hi all,

I'm trying to load an UTF8 file with fixed length record using SQL
Loader for Oracle 8i.

Unfortunately, SQL Loader does not make the difference between bytes
and characters. i.e. that if two bytes are used to code an accentuated
character (like é) SQL loader will consider that two caracters has
been read meaning that beginning of next field will be shifted.

Example:

LOAD DATA
CHARACTERSET UTF8
INFILE 'XXX.txt'
REPLACE

INTO TABLE tmp.MY_TABLE
(
LASTN POSITION (1:20), -- Last name
FIRSN POSITION (21:40), -- First name
STATU POSITION (41:41), -- Status
BIRDT POSITION (42:49) -- Date of birth
)

With this data :
Thévenat Jacques 119460508

I'll get in my DB :

Field 1 : Thévenat
Field 2 : Jacques (note the leading space)
Field 3 : (blank)
Field 4 : 11946050 (wrong date)

This because the "é" is coded like C3 A9 (= 2 bytes) in UTF8.

I know that there is a LENGTH SEMANTICS CHAR option in Oracle 9i but
I haven't found anything similar in 8i.

Does anybody know a work around ?

Thanks in advance.

Malcolm Dew-Jones

unread,
Jan 4, 2005, 1:35:59 PM1/4/05
to
Nicolas Mayoraz (nm...@dkweb.ch) wrote:
: Hi all,

: I'm trying to load an UTF8 file with fixed length record using SQL
: Loader for Oracle 8i.

But utf-8 isn't fixed length.

: Unfortunately, SQL Loader does not make the difference between bytes


: and characters. i.e. that if two bytes are used to code an accentuated
: character (like é) SQL loader will consider that two caracters has
: been read meaning that beginning of next field will be shifted.

: Example:

: LOAD DATA
: CHARACTERSET UTF8
: INFILE 'XXX.txt'
: REPLACE

: INTO TABLE tmp.MY_TABLE
: (
: LASTN POSITION (1:20), -- Last name
: FIRSN POSITION (21:40), -- First name
: STATU POSITION (41:41), -- Status
: BIRDT POSITION (42:49) -- Date of birth
: )

: With this data :
: Thévenat Jacques 119460508

: I'll get in my DB :

: Field 1 : Thévenat
: Field 2 : Jacques (note the leading space)
: Field 3 : (blank)
: Field 4 : 11946050 (wrong date)

: This because the "é" is coded like C3 A9 (= 2 bytes) in UTF8.

As a said, not fixed length data.

: I know that there is a LENGTH SEMANTICS CHAR option in Oracle 9i but


: I haven't found anything similar in 8i.

: Does anybody know a work around ?

One possibility...

Make the fields in the load file long enough to contain the maximum length
of utf-8 data for that field, and then pad the bytes of each field to be
that length when you create the load file, and then trim off the padding
as the field is inserted into the table.


--

This space not for rent.

HansF

unread,
Jan 4, 2005, 2:09:15 PM1/4/05
to
On Tue, 04 Jan 2005 05:18:23 -0800, Nicolas Mayoraz wrote:

> Hi all,
>
> I'm trying to load an UTF8 file with fixed length record using SQL
> Loader for Oracle 8i.
>
> Unfortunately, SQL Loader does not make the difference between bytes
> and characters. i.e. that if two bytes are used to code an accentuated
> character (like é) SQL loader will consider that two caracters has
> been read meaning that beginning of next field will be shifted.
>

(Just pointers, as I haven't tried this yet ...)

You might find Case Study 11 "Loading Data in the Unicode Character Set"
and Chapter 5 "Handling Different Character Encoding Schemes" of interest.
For Oracle9i R2 Look up Chapter 10 of the Utilities manual at
http://www.tahiti.com

HTH
/Hans

Nicolas Mayoraz

unread,
Jan 5, 2005, 8:58:06 AM1/5/05
to
Malcolm Dew-Jones (yf...@vtn1.victoria.tc.ca) wrote :

This post is a followup of previous thread with same name, but I had a
problem when replying :-(

>Nicolas Mayoraz (nm...@dkweb.ch) wrote:
>
>: I'm trying to load an UTF8 file with fixed length record using SQL
>: Loader for Oracle 8i.
>


>But utf-8 isn't fixed length.
>

I know utf-8 is not fixed length. I'm not referring to the byte length
(number of bytes needed to encode) but to the characters length
(number of characters seen when displayed on the screen) = number of
characters seized by the end-user


>One possibility...
>
>Make the fields in the load file long enough to contain the maximum
length
>of utf-8 data for that field, and then pad the bytes of each field to
be
>that length when you create the load file, and then trim off the
padding
>as the field is inserted into the table.

Unfortunately, I have no way to change the received file :-( But
thanks for the idea.

Anurag Varma

unread,
Jan 5, 2005, 7:35:48 PM1/5/05
to

"Nicolas Mayoraz" <nm...@dkweb.ch> wrote in message news:1f9a8c6.05010...@posting.google.com...

What happens if you define the ctl file as?:

LOAD DATA
CHARACTERSET UTF8
INFILE 'XXX.txt'
REPLACE

INTO TABLE tmp.MY_TABLE
(
LASTN char(20), -- Last name
FIRSN char(20), -- First name
STATU char(1), -- Status
BIRDT char(8) -- Date of birth
)

?

Anurag


Nicolas Mayoraz

unread,
Jan 7, 2005, 5:02:41 AM1/7/05
to
Anurag Varma (av...@hotmail.com) wrote

>What happens if you define the ctl file as?:
>
>LOAD DATA
>CHARACTERSET UTF8
> INFILE 'XXX.txt'
>REPLACE
>
> INTO TABLE tmp.MY_TABLE
> (
> LASTN char(20), -- Last name
> FIRSN char(20), -- First name
> STATU char(1), -- Status
> BIRDT char(8) -- Date of birth
>)
>
>?
>
>Anurag

With SQL Loader 8i it's not working properly. But I managed to install
SQL Loader 9i and it's working well with the LENGTH SEMANTICS CHAR
option and using the syntax you proposed above.

Best regards,

Nik

Anurag Varma

unread,
Jan 7, 2005, 6:02:16 PM1/7/05
to

"Nicolas Mayoraz" <nm...@dkweb.ch> wrote in message news:1f9a8c6.05010...@posting.google.com...

hmm .. well I see different results
.. I did a test on a 8.1.7.4 database and found that it works with the syntax I posted above.
I tested it with greek and french data.
Although POSITION(x:y) refers to bytes in 8i, the char (afaik) refers to characters.

Anurag


Anurag Varma

unread,
Jan 7, 2005, 6:03:41 PM1/7/05
to

"Anurag Varma" <av...@hotmail.com> wrote in message news:YxEDd.133509$AL5.1...@twister.nyroc.rr.com...
--snip--

> hmm .. well I see different results
> .. I did a test on a 8.1.7.4 database and found that it works with the syntax I posted above.
> I tested it with greek and french data.
> Although POSITION(x:y) refers to bytes in 8i, the char (afaik) refers to characters.
>
> Anurag
>
>
Just to confirm ... I did use 8i's sql*loader


0 new messages