Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Problem While Importing an XML Wikipedia Dump on MYSQL

155 views
Skip to first unread message

cana...@sabanciuniv.edu

unread,
Jul 3, 2016, 7:47:23 PM7/3/16
to jwpl-users
Hi everyone;

I have faced an issue about importing an xml dump file from WPedia to mysql. Currently, I am using 'mwdumper' ; which is basically a jar file that converts xml's into sql files. I have used this command to achieve this:

 java -jar mwdumper.jar --format=sql:1.5 --output=file:sample_name.sql sample_wikipedia_dump_file.xml |
    mysql -u <username> -p <my_database_name>

I initially get an error declaring that I lack a table on my schema named as text. Despite the error ; it partially converts the file up to 147 pages ( turns out to be depreciated)

mysql: Unknown OS character set 'cp857'.
mysql: Switching to the default character set 'latin1'.
ERROR 1146 (42S02) at line 46: Table 'wiki_test.text' doesn't exist
14 pages (10,566/sec), 1.000 revs (754,717/sec)
26 pages (13,098/sec), 2.000 revs (1.007,557/sec)
32 pages (11,228/sec), 3.000 revs (1.052,632/sec)
44 pages (12,032/sec), 4.000 revs (1.093,793/sec)
52 pages (12,929/sec), 5.000 revs (1.243,163/sec)
56 pages (12,009/sec), 6.000 revs (1.286,725/sec)
...
Exception in thread "main" java.lang.IllegalArgumentException: Invalid contribut
or
        at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)

        at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source
)

I then tried to import the generated sql file to a database in mysql , but I ended up geting this error:

ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '<?xml version="1.0" encoding="utf-8" ?>
<mediawiki xmlns="http://www.mediawiki.o' at line 1

My question is: Is this related with the default database collation and character encoding , or is it a problem within the mwdumper version? I have checked that both the XML and the schema has the same encoding (utf-8) . Also , how did you import the wikipedia dump file (XML) to mysql before? Is this the best way to aciheve this?

Thank you;

Torsten Zesch

unread,
Jul 4, 2016, 3:34:38 AM7/4/16
to jw...@googlegroups.com
This question seems largely unrelated to JWPL.
If you want to use it, please follow the instructions on this page:

I am sorry that we cannot help here with the other problems.

-Torsten

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages