Example:
a_saefte;10;;nahrungsmittel;10_obst;S�fte;juices;Jus;Succhi;Zumos;(Meyve) sulari
is imported as:
a_saefte 10 nahrungsmittel 10_obst S�fte juices Jus Succhi Zumos (Meyve) sulari
a_zitrusfruechte
(copied from PHPMyAdmin, international characters may not get encoded correctly
in this posting, forget about that)
a_zitrusfruechte is from the next line but is included in the last field *with*
the linebreak!
Command:
mysql> LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES
TERMINATED BY '\r\n';
Query OK, 1 row affected (0.00 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
The lines *are* terminated by \r\n. Field order and count match exactly. The
problem happens only when I import "correctly" encoded characters. When I import
a csv delimited file (produced by Excel) all lines get imported but the encoding
is not correct, so that characters and the parts after them are missing.
I don't see any other way to export these data from Excel. Exporting to csv
spoils all characters which are not in the current Windows codepage.
MYSQL ist 5.0.45 on CentOS 5. Kollation of the table cells with international
characters is utf8_unicode_ci.
Can this problem be easily solved or what am I doing wrong? Thanks!
Kai
--
Conactive Internet Services, Berlin, Germany
If your data is ecoded like above I am not sure this is the proper command:
> Command:
> mysql> LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.txt' INTO TABLE
> `allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES
> TERMINATED BY '\r\n';
> Query OK, 1 row affected (0.00 sec)
> Records: 1 Deleted: 0 Skipped: 0 Warnings: 0
Since you do not enclose or escape your fields at all, perhaps you need
to adjust the ENCLOSED BY and ESCAPED BY sequences more like this?
LOAD DATA LOCAL INFILE
'/tmp/Allergene_2.1.ansi.txt'
INTO TABLE
`allergene`
FIELDS TERMINATED BY ';'
ENCLOSED BY ''
ESCAPED BY ''
LINES TERMINATED BY '\r\n';
> The lines *are* terminated by \r\n.
BTW Are you really sure that the line ending after conversion still is \r\n?
Kind regards,
Jonathan
So, table's utf8, file's utf8. Is your database connection utf8?
--
I picked up a Magic 8-Ball the other day and it said 'Outlook not so good.'
I said 'Sure, but Microsoft still ships it.'
-- Anonymous
> ENCLOSED BY ''
> ESCAPED BY ''
Thanks for the tip. Unfortunately: same result.
> > The lines *are* terminated by \r\n.
>
> BTW Are you really sure that the line ending after conversion still is \r\n?
Yes. Notepad++ shows CRLF at the end and a hex editor shows 0d 0a. It looks
like it cannot identify the \r\n sequence, right?
I've just looked at http://dev.mysql.com/doc/refman/5.0/en/load-data.html
and found the CHARACTER SET option. Unfortunately, it doesn't help either. It
clearly changes the encoding of the characters in the fields, but it doesn't
seem to affect the identification of the \r\n line-break characters. I also
tried CHARACTER SET usc2 on the original file, but this fails completely with
a lot of warnings (not shown).
> So, table's utf8, file's utf8. Is your database connection utf8?
No.
| character_set_client | latin1
| character_set_connection | latin1
| character_set_database | utf8
| character_set_filesystem | binary
| character_set_results | latin1
| character_set_server | latin1
| character_set_system | utf8
| character_sets_dir | /usr/share/mysql/charsets/
| collation_connection | latin1_swedish_ci
| collation_database | utf8_unicode_ci
| collation_server | latin1_swedish_ci
I have now temporarily changed character_set_client, _connection and -
server. No change :-( I think you cannot expect one, anyway. I think this
only affects *data* and apparently there's only that one line considered
data. It doesn't affect \r\n. \r\n should always be \r\n, right? I wonder
if I just add ...
Ok, I added a "|" before the \r\n
LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.break.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '' ESCAPED BY '' LINES
TERMINATED BY '|';
Query OK, 105 rows affected, 30 warnings (0.00 sec)
Records: 105 Deleted: 0 Skipped: 0 Warnings: 30
And it's all showing up correctly. So, \r\n is not the correct syntax if
the character data is unicode or utf-8. I tried \x0d\x0a, \u000d\u000a,
\000d\000a. Nothing fits. What#s the correct way?
Btw, how can I enable output of those warnings? (Probably field too
small.)
> And it's all showing up correctly. So, \r\n is not the correct syntax if
> the character data is unicode or utf-8. I tried \x0d\x0a, \u000d\u000a,
> \000d\000a. Nothing fits. What#s the correct way?
It seems that character sequence gets completely ignored. If I use "|" to
terminate a line then the next record should contain the \r\n plus the real
data in the first field, shouldn't it? But it doesn't, it's just gone.
Another possibility is that it uses the "|" to terminate the line for data
input, then ignores everythign after it and starts in a new line with the
next record.
> It seems that character sequence gets completely ignored. If I use "|" to
> terminate a line then the next record should contain the \r\n plus the real
> data in the first field, shouldn't it? But it doesn't, it's just gone.
I spoke too soon, it wasn't visible in PHPMyAdmin. It's visible in my output
and in direct selects. The \r\n gets put in the first field. So, I really need
to know how I can match \r\n in this file.
> So, I really need
> to know how I can match \r\n in this file.
Ok, got it. First, you have to convert to unix format, so only one
character ends the line, then you use
LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.break.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '' ESCAPED BY '' LINES
TERMINATED BY 0x0a;
(note the missing quotes around the termination character!)
Hope this helps someone else in the same situation.
Line termination defaults to line terminated by newline characters, so
you may more reliably get what you want by omitting it entirely.
--
Don't use this code for realtime control, for weapons systems, or for
anything else that may put life or limb at hazard. It isn't man-rated,
it isn't really thing-rated, and we don't claim that it's worth a good
G*dDamn for anything at all, at all.
-- Mike Andrews, on Java compilers
> Line termination defaults to line terminated by newline characters,
Oh, yes? I was under the impression that it's not optional. Yeah, you are
right, it's optional. However, it doesn't help much in this case.
It defaults to '\n' only, so I have to remove the \r, anyway. If I then
use nothing or LINES TERMINATED BY '\n' or LINES TERMINATED BY 0x0a it all
works.
Just '\r\n' doesn't work (with \r not stripped yet).
Looks like a bug when utf8/doublebyte characters are present: you can use
only one termination character. Maybe it's already fixed in a newer
version of MySQL.
Anyway, thanks for your help and making me think. :-)