Invalid utf8 character string error while importing into mysql

11,776 views
Skip to first unread message

Lasse Stelzer

unread,
Jun 21, 2016, 3:24:44 AM6/21/16
to jwpl-users, braeu...@ukp.informatik.tu-darmstadt.de
Hi all,

I tried importing the .txt files created by the datamachine into mysql and got this Error with Page.txt: Error Code: 1300. Invalid utf8 character string: '{{dieser Artikel|behandelt den Begriff aus dem Buddhismus. Für '. The Error occurs when I use LOAD DATA LOCAL INFILE 'C:\\Users\\Lasse\\Desktop\\wikipedia\\output\\Page.txt' INTO TABLE page to import.
I used the following command for the datamachein: java -Xmx4g -Dfile.encoding=utf8 -jar JWPLDatamachine.jar german !Hauptkategorie Begriffserklärung C:\Users\Lasse\Desktop\wikipedia
My wikipedia dump files are dewiki-latest-categorylinks.sql.gz, dewiki-latest-pagelinks.sql.gz and dewiki-latest-pages-articles.xml.bz2.

Thanks in advance,
Lasse

Johannes Daxenberger

unread,
Jun 21, 2016, 5:10:10 AM6/21/16
to jw...@googlegroups.com, Lasse Stelzer, Sven Bräutigam

Hi Lasse,

 

which command did you use to create the database you are importing into? See https://dkpro.github.io/dkpro-jwpl/DataMachine/

Did you make sure that char set and collation are set to UTF-8? If you executed the right command, please re-check on the database itself.

Again, for the import command, did you make sure that the “--default-character-set=utf8” flag is set?

 

Best,

Johannes

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lasse Stelzer

unread,
Jun 21, 2016, 9:10:45 AM6/21/16
to jwpl-users, lasse....@web.de, braeu...@ukp.informatik.tu-darmstadt.de
Hi Johannes,

I used "CREATE DATABASE wiki DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci" to create the database.
The database is set to utf8:
+--------------------------+---------------------------------------------------------+
| Variable_name            | Value                                                   |
+--------------------------+---------------------------------------------------------+
| character_set_client     | utf8                                                    |
| character_set_connection | utf8                                                    |
| character_set_database   | utf8                                                    |
| character_set_filesystem | binary                                                  |
| character_set_results    | utf8                                                    |
| character_set_server     | utf8                                                    |
| character_set_system     | utf8                                                    |
| character_sets_dir       | C:\Program Files\MySQL\MySQL Server 5.7\share\charsets\ |
+--------------------------+---------------------------------------------------------+

Using "LOAD DATA LOCAL INFILE 'C:\\Users\\Lasse\\Desktop\\wikipedia\\output\\Page.txt' INTO TABLE page CHARACTER SET UTF8" yields the same result. I do not know where I can put “--default-character-set=utf8” to use.

Regards,
Lasse

Johannes Daxenberger

unread,
Jun 21, 2016, 9:33:16 AM6/21/16
to jw...@googlegroups.com, Lasse Stelzer, Sven Bräutigam
"--default-character-set=utf8" works with the mysqlimport command only. Try that one instead of LOAD DATA INFILE.
If that does not work, the only idea I have is to use a different (mysql) client.

Tristan Miller

unread,
Jun 21, 2016, 10:00:53 AM6/21/16
to jw...@googlegroups.com
Greetings.

On 21/06/16 03:33 PM, Johannes Daxenberger wrote:
> "--default-character-set=utf8" works with the mysqlimport command only. Try that one instead of LOAD DATA INFILE.
> If that does not work, the only idea I have is to use a different (mysql) client.

Is it possible that some non-Unicode characters may have crept into the
Wikipedia dump that Lasse is using? I know that a long time ago (about
ten years ago), Wikipedia didn't prevent people from entering control
characters, and characters outside the Unicode ranges, into articles. I
reported this problem at the time but I'm not sure if it was ever fixed.

Regards,
Tristan

--
Tristan Miller, Research Scientist
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 162 5296 | Web: https://www.ukp.tu-darmstadt.de/

signature.asc

Johannes Daxenberger

unread,
Jun 21, 2016, 10:17:19 AM6/21/16
to jw...@googlegroups.com
> Is it possible that some non-Unicode characters may have crept into the Wikipedia dump that Lasse is using?
I never had such problem in the last year, but to be sure, you could simply try to reproduce the issue with an older dump. If the problem doesn't exist with older dumps and the suspicion turns out to be valid, you should probably inform Xmldata...@lists.wikimedia.org about it.

Best,
Johannes

-----Ursprüngliche Nachricht-----
Von: jw...@googlegroups.com [mailto:jw...@googlegroups.com] Im Auftrag von Tristan Miller
Gesendet: Dienstag, 21. Juni 2016 16:01
An: jw...@googlegroups.com
Betreff: Re: [jwpl-users] Invalid utf8 character string error while importing into mysql

Lasse Stelzer

unread,
Jun 22, 2016, 10:34:20 AM6/22/16
to jwpl-users
Hi,

Using the mysqlimport method now worked. Thank you for the help.
I tried LOCAL DATA INFILE with an older dump but got the same result. So it was not the problem Tristan suggested.

Regards,
Lasse
Reply all
Reply to author
Forward
0 new messages