How to move Unicode data from Excel/csv file to MySQL table?

Kai Schaetzl

unread,

Aug 5, 2009, 8:31:31 AM8/5/09

to

I'm hitting a problem with using LOAD DATA to import csv files with non-ASCII
languages to MySQL.
The data is from Excel sheets which I exported with the "Unicode (txt)" option.
This produces a tab-delimited file that is encoded UCS-2 Little Endian (according
to Notepad++). I convert this in Notepad++ either to ANSI or to ANSI/UTF-8
without BOM and then import to MySQL. The result is the same.
- only the first line gets imported
- there seems to be a problem with the detection of the line-end as the last
field also gets the first field of the next line attached to it (and then it
stops)
- non-ASCII characters seem to display correctly (the first line has only two,
though), so the way of encoding the characters is the right one

Example:
a_saefte;10;;nahrungsmittel;10_obst;S�fte;juices;Jus;Succhi;Zumos;(Meyve) sulari
is imported as:
a_saefte 10 nahrungsmittel 10_obst S�fte juices Jus Succhi Zumos (Meyve) sulari
a_zitrusfruechte

(copied from PHPMyAdmin, international characters may not get encoded correctly
in this posting, forget about that)
a_zitrusfruechte is from the next line but is included in the last field *with*
the linebreak!

Command:
mysql> LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES
TERMINATED BY '\r\n';
Query OK, 1 row affected (0.00 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

The lines *are* terminated by \r\n. Field order and count match exactly. The
problem happens only when I import "correctly" encoded characters. When I import
a csv delimited file (produced by Excel) all lines get imported but the encoding
is not correct, so that characters and the parts after them are missing.

I don't see any other way to export these data from Excel. Exporting to csv
spoils all characters which are not in the current Windows codepage.

MYSQL ist 5.0.45 on CentOS 5. Kollation of the table cells with international
characters is utf8_unicode_ci.

Can this problem be easily solved or what am I doing wrong? Thanks!

Kai
--
Conactive Internet Services, Berlin, Germany

Jonathan

unread,

Aug 5, 2009, 9:17:46 AM8/5/09

to

On 5-8-2009 14:31, Kai Schaetzl wrote:
> Example:
> a_saefte;10;;nahrungsmittel;10_obst;S�fte;juices;Jus;Succhi;Zumos;(Meyve) sulari
> is imported as:
> a_saefte 10 nahrungsmittel 10_obst S�fte juices Jus Succhi Zumos (Meyve) sulari
> a_zitrusfruechte

If your data is ecoded like above I am not sure this is the proper command:

> Command:
> mysql> LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.txt' INTO TABLE
> `allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES
> TERMINATED BY '\r\n';
> Query OK, 1 row affected (0.00 sec)
> Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

Since you do not enclose or escape your fields at all, perhaps you need
to adjust the ENCLOSED BY and ESCAPED BY sequences more like this?

LOAD DATA LOCAL INFILE
'/tmp/Allergene_2.1.ansi.txt'
INTO TABLE
`allergene`
FIELDS TERMINATED BY ';'
ENCLOSED BY ''
ESCAPED BY ''
LINES TERMINATED BY '\r\n';

> The lines *are* terminated by \r\n.

BTW Are you really sure that the line ending after conversion still is \r\n?

Kind regards,

Jonathan

Peter H. Coffin

unread,

Aug 5, 2009, 9:42:16 AM8/5/09

to

On Wed, 05 Aug 2009 14:31:31 +0200, Kai Schaetzl wrote:
>
> Command:
> mysql> LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.txt' INTO TABLE
> `allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES
> TERMINATED BY '\r\n';
> Query OK, 1 row affected (0.00 sec)
> Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
>
> The lines *are* terminated by \r\n. Field order and count match exactly. The
> problem happens only when I import "correctly" encoded characters. When I import
> a csv delimited file (produced by Excel) all lines get imported but the encoding
> is not correct, so that characters and the parts after them are missing.
>
> I don't see any other way to export these data from Excel. Exporting to csv
> spoils all characters which are not in the current Windows codepage.
>
> MYSQL ist 5.0.45 on CentOS 5. Kollation of the table cells with international
> characters is utf8_unicode_ci.
>
> Can this problem be easily solved or what am I doing wrong? Thanks!

So, table's utf8, file's utf8. Is your database connection utf8?

--
I picked up a Magic 8-Ball the other day and it said 'Outlook not so good.'
I said 'Sure, but Microsoft still ships it.'
-- Anonymous

Kai Schaetzl

unread,

Aug 5, 2009, 11:18:33 AM8/5/09

to

Jonathan schrieb am Wed, 05 Aug 2009 15:17:46 +0200:

> ENCLOSED BY ''
> ESCAPED BY ''

Thanks for the tip. Unfortunately: same result.

> > The lines *are* terminated by \r\n.
>
> BTW Are you really sure that the line ending after conversion still is \r\n?

Yes. Notepad++ shows CRLF at the end and a hex editor shows 0d 0a. It looks
like it cannot identify the \r\n sequence, right?

I've just looked at http://dev.mysql.com/doc/refman/5.0/en/load-data.html
and found the CHARACTER SET option. Unfortunately, it doesn't help either. It
clearly changes the encoding of the characters in the fields, but it doesn't
seem to affect the identification of the \r\n line-break characters. I also
tried CHARACTER SET usc2 on the original file, but this fails completely with
a lot of warnings (not shown).

Kai Schaetzl

unread,

Aug 5, 2009, 11:18:34 AM8/5/09

to

Peter H. Coffin schrieb am Wed, 5 Aug 2009 08:42:16 -0500:

> So, table's utf8, file's utf8. Is your database connection utf8?

I have now temporarily changed character_set_client, _connection and -
server. No change :-( I think you cannot expect one, anyway. I think this
only affects *data* and apparently there's only that one line considered
data. It doesn't affect \r\n. \r\n should always be \r\n, right? I wonder
if I just add ...

Ok, I added a "|" before the \r\n

LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.break.txt' INTO TABLE

`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '' ESCAPED BY '' LINES

TERMINATED BY '|';
Query OK, 105 rows affected, 30 warnings (0.00 sec)
Records: 105 Deleted: 0 Skipped: 0 Warnings: 30

And it's all showing up correctly. So, \r\n is not the correct syntax if
the character data is unicode or utf-8. I tried \x0d\x0a, \u000d\u000a,
\000d\000a. Nothing fits. What#s the correct way?

Btw, how can I enable output of those warnings? (Probably field too
small.)

Kai Schaetzl

unread,

Aug 5, 2009, 11:57:17 AM8/5/09

to

Kai Schaetzl schrieb am Wed, 05 Aug 2009 17:18:34 +0200:

> And it's all showing up correctly. So, \r\n is not the correct syntax if
> the character data is unicode or utf-8. I tried \x0d\x0a, \u000d\u000a,
> \000d\000a. Nothing fits. What#s the correct way?

It seems that character sequence gets completely ignored. If I use "|" to
terminate a line then the next record should contain the \r\n plus the real
data in the first field, shouldn't it? But it doesn't, it's just gone.
Another possibility is that it uses the "|" to terminate the line for data
input, then ignores everythign after it and starts in a new line with the
next record.

Kai Schaetzl

unread,

Aug 5, 2009, 1:16:06 PM8/5/09

to

Kai Schaetzl schrieb am Wed, 05 Aug 2009 17:57:17 +0200:

> It seems that character sequence gets completely ignored. If I use "|" to
> terminate a line then the next record should contain the \r\n plus the real
> data in the first field, shouldn't it? But it doesn't, it's just gone.

I spoke too soon, it wasn't visible in PHPMyAdmin. It's visible in my output
and in direct selects. The \r\n gets put in the first field. So, I really need
to know how I can match \r\n in this file.

Kai Schaetzl

unread,

Aug 5, 2009, 1:29:54 PM8/5/09

to

Kai Schaetzl schrieb am Wed, 05 Aug 2009 19:16:06 +0200:

> So, I really need
> to know how I can match \r\n in this file.

Ok, got it. First, you have to convert to unix format, so only one
character ends the line, then you use

LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.break.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '' ESCAPED BY '' LINES

TERMINATED BY 0x0a;

(note the missing quotes around the termination character!)
Hope this helps someone else in the same situation.

Peter H. Coffin

unread,

Aug 5, 2009, 4:37:37 PM8/5/09

to

Line termination defaults to line terminated by newline characters, so
you may more reliably get what you want by omitting it entirely.

--
Don't use this code for realtime control, for weapons systems, or for
anything else that may put life or limb at hazard. It isn't man-rated,
it isn't really thing-rated, and we don't claim that it's worth a good
G*dDamn for anything at all, at all.
-- Mike Andrews, on Java compilers

Kai Schaetzl

unread,

Aug 6, 2009, 1:31:21 PM8/6/09

to

Peter H. Coffin schrieb am Wed, 5 Aug 2009 15:37:37 -0500:

> Line termination defaults to line terminated by newline characters,

Oh, yes? I was under the impression that it's not optional. Yeah, you are
right, it's optional. However, it doesn't help much in this case.

It defaults to '\n' only, so I have to remove the \r, anyway. If I then
use nothing or LINES TERMINATED BY '\n' or LINES TERMINATED BY 0x0a it all
works.
Just '\r\n' doesn't work (with \r not stripped yet).
Looks like a bug when utf8/doublebyte characters are present: you can use
only one termination character. Maybe it's already fixed in a newer
version of MySQL.

Anyway, thanks for your help and making me think. :-)