Gzip Error, trying ti get DataMachine.

xavier sumba

unread,

Feb 4, 2016, 12:10:31 PM2/4/16

to jwpl-users

Hi,

I have downloaded the wikipedia dumps, but I am having some troubles when I tried to get a Page.

This is my configuration file of wp-download:

[Configuration]

base_url = http://download.wikimedia.org

[Templates]
file_format = ${langcode}wiki-${date}-${filename}.${filetype}
language_dir_format = ${langcode}wiki

[Files]
pages-articles = True
categorylinks = True
pagelinks = True

[Filetypes]
pages-articles  = xml.bz2
categorylinks   = sql.gz
pagelinks       = sql.gz

[Languages]
en = True

This is the commando I used to download the dumps:

wp-download --resume -v /Users/cuent/Desktop/wikipedia/dumps

And I got this files (It took so long):

enwiki-20160113-categorylinks.sql.gz 1.6G
enwiki-20160113-pagelinks.sql.gz 4.7G
enwiki-20160113-pages-articles.xml.bz2 12G

After that I ran the transformations. It took like 4 and half hours. I used this command:

java -Xmx14G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar english Contents Disambiguation_pages dumps/en/20160113/

I got the following files:

page.bin 435M revision
revision.bin 103M
text.bin 31G
output
|---Category.txt 61M
|---category_inlinks.txt 65M
|---category_outlinks.txt 65M
|---category_pages.txt 636M
|---page_categories.txt 636M

Documentation says I have to get 11 files ("you should get 11 txt files in an “output” subfolder") in my output folder, but I got 5.

This is my log, I got the following message:

"Date/Time","Total Memory","Free Memory","Message"
"2016.02.03 23:47:09","257425408","241131656","parse input dumps..."
"2016.02.03 23:47:09","257425408","241131656","Discussions are unavailable"
"2016.02.04 03:29:39","100139008","98329328","processing table page..."
"2016.02.04 03:29:40","100139008","96359680","Pages 10000"
........
"2016.02.04 03:35:48","2435317760","240240824","Pages 13530000"
"2016.02.04 03:35:48","2435317760","238847264","processing table categorylinks..."
"2016.02.04 03:35:49","2438463488","244665240","Categorylinks 10000"
........
"2016.02.04 03:46:37","2822242304","293641336","Categorylinks 97890000"
"2016.02.04 03:46:41","2823815168","820690024","processing table pagelinks..."
"2016.02.04 03:46:41","2823815168","820690024","Not in GZIP format java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164) java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78) java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90) de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.GZipDecompressor.getInputStream(GZipDecompressor.java:31) de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.UniversalDecompressor.getInputStream(UniversalDecompressor.java:204) de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.createPagelinksParser(DataMachineGenerator.java:140) de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:78) de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59) de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)"
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.GZipDecompressor.getInputStream(GZipDecompressor.java:31)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.UniversalDecompressor.getInputStream(UniversalDecompressor.java:204)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.createPagelinksParser(DataMachineGenerator.java:140)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:78)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)"

Then I try to import my data to mysql, but is obvious I am not going to have all my tables load successfully after that.

mysqlimport -uroot -p --local --default-character-set=utf8 wiki_dumps /Users/cuent/Desktop/wikipedia/dumps/en/20160113/output/*.txt

I want to run this example:

        DatabaseConfiguration dbConfig = new DatabaseConfiguration();
        dbConfig.setHost("localhost");
        dbConfig.setDatabase("wiki_dumps");
        dbConfig.setUser("root");
        dbConfig.setPassword("password");
        dbConfig.setLanguage(Language.english);
        Wikipedia wiki = new Wikipedia(dbConfig);

        Page page = wiki.getPage("Hello World");
        System.out.println(page.getText());

So, What am I doing wrong? Why am I having 5 files instead of 11? How can I correct my error?

Cheers.

Torsten Zesch

unread,

Feb 4, 2016, 2:01:56 PM2/4/16

to jw...@googlegroups.com

Maybe the pagelinks file is corrupted?

Can you unzip it on the command line?

-Torsten

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

xavier sumba

unread,

Feb 4, 2016, 3:48:59 PM2/4/16

to jwpl-users

Yes, I run from source code and It's the following error that the program try to uncompress enwiki-20160113-pagelinks.sql.gz and I get enwiki-20160113-pagelinks.sql.gz.cpgz. This behaviour is a cycle. It's not an error of JWPL. It is out of this scope. I just would like to know if Is there another way to download that page links file?

xavier sumba

unread,

Feb 4, 2016, 3:49:43 PM2/4/16

to jwpl-users

And the sizes of the files are OK?

Cheers.

Oliver Ferschke

unread,

Feb 4, 2016, 3:55:07 PM2/4/16

to jw...@googlegroups.com

Hi,

you can download manually with wget, a browser or download tool

https://dumps.wikimedia.org/enwiki/20160113/

This page also contains md5 and sha1 checksums, so you can make sure that the downloaded files are not corrupted.

Afaik, you can only download one (or two?) files at a time from any given IP.

-Oliver

xavier sumba

unread,

Feb 4, 2016, 4:24:58 PM2/4/16

to jwpl-users

Hi Oliver,

Thank you so much, I was weeks trying to download that files. I will give the commands, that I am going to use if anyone is facing this problem.

I am using latest snapshot https://dumps.wikimedia.org/enwiki/latest/

I am using curl to download the following files:

curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pagelinks.sql.gz
curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz
curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

And you can check if your files are not corrupted, using the checksums of the following files enwiki-latest-md5sums.txt and enwiki-latest-sha1sums.txt.

Hope It helps, I will keep you informed.

Cheers.

xavier sumba

unread,

Feb 11, 2016, 1:35:33 AM2/11/16

to jwpl-users

Hi,

I have downloaded all the files and file sizes are OK.

1.75GB / enwiki-20160204-categorylinks.sql.gz
5.05GB / enwiki-20160204-pagelinks.sql.gz
12.7GB / enwiki-20160204-pages-articles.xml.bz2

After that I tried so many times and I can't fix the error to get my transformations. I execute this:

java -Xmx14G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar english Contents Disambiguation_pages dumps/en/20160204/

This is the error:

21:15:22,638  INFO XmlBeanDefinitionReader:315 - Loading XML bean definitions from class path resource [context/applicationContext.xml]
21:15:22,888  INFO Log4jLogger:21 - parse input dumps... 
21:15:22,890  INFO Log4jLogger:21 - Discussions are unavailable 
00:17:58,140  INFO Log4jLogger:21 - 8192
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546) 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1735) 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1606) 
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1644) 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1748) 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973) 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848) 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777) 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648) 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332) 
javax.xml.parsers.SAXParser.parse(SAXParser.java:195) 
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:208) 
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44) 
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65) 
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59) 
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)

Why is it happening?

Cheers.

Torsten Zesch

unread,

Feb 11, 2016, 4:21:08 AM2/11/16

to jw...@googlegroups.com

Have you checked the integrity of the downloaded files using the provided checksums?

-Torsten

Johannes Daxenberger

unread,

Feb 11, 2016, 5:36:24 AM2/11/16

to jw...@googlegroups.com, Dr. Torsten Zesch

Hi Xavier,

this is a known (but not yet solved) problem. See my post from last year April (and a previous one from September 25, 2014):

"I ran into this exact problem on a recent (2015) en_wiki dump. It seems to be a known issue with Xerces 2.7.1 used by OpenJDK 1.7 (http://stackoverflow.com/questions/22891411/java-xerces-java-lang-arrayindexoutofboundsexception-8192, https://bugs.openjdk.java.net/browse/JDK-7156085).

I could solve this problem by using a newer version of Xerces (e.g. 2.11, following this guide: http://a-sirenko.blogspot.de/2013/08/jdk-7-sax-parser-produces.html):
java -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -cp de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar:xercesImpl.jar:xml-apis.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine

However, this solution didn’t make me happy for too long. After a couple of minutes/hours of processing, the new XML parser seems to have a different problem:

15:17:24,501 INFO Log4jLogger:21 - org.xml.sax.SAXParseException; lineNumber: 467866063; columnNumber: 343; Invalid byte 2 of 4-byte UTF-8 sequence.
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:212)

de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)

So far, I couldn’t figure out why this happens; the dump seems to be fine (no suspicious characters in the given line, as far as I can tell – the uncompressed file has 120GB), -Dfile.encoding is UTF8."

You can try to above solution and see if it works for you. Which Java version are you using?

Best,
Johannes

xavier sumba

unread,

Feb 11, 2016, 3:35:15 PM2/11/16

to jwpl-users

Yes my checksums are the same, I was so happy. Because take too long to download.

Cheers.

...

xavier sumba

unread,

Feb 11, 2016, 3:39:01 PM2/11/16

to jwpl-users, torste...@uni-due.de

Hi Johannes,

I tried your solution.

java -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -cp de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar:xercesImpl.jar:xml-apis.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine english Contents Disambiguation_pages ../dumps/en/20160204/

Seems that I have the error that you are facing.

10:44:15,746  INFO XmlBeanDefinitionReader:315 - Loading XML bean definitions from class path resource [context/applicationContext.xml]
10:44:16,048  INFO Log4jLogger:21 - parse input dumps... 
10:44:16,048  INFO Log4jLogger:21 - Discussions are unavailable 
14:25:47,052  INFO Log4jLogger:21 - org.xml.sax.SAXParseException; lineNumber: 373897168; columnNumber: 281; Invalid byte 2 of 4-byte UTF-8 sequence. 
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:212) 
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44) 
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65) 
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59) 
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)

Cheers.

Reply all

Reply to author

Forward