Gzip Error, trying ti get DataMachine.

61 views
Skip to first unread message

xavier sumba

unread,
Feb 4, 2016, 12:10:31 PM2/4/16
to jwpl-users

Hi,


I have downloaded the wikipedia dumps, but I am having some troubles when I tried to get a Page.


This is my configuration file of wp-download:

[Configuration]

base_url
= http://download.wikimedia.org

[Templates]
file_format
= ${langcode}wiki-${date}-${filename}.${filetype}
language_dir_format
= ${langcode}wiki

[Files]
pages
-articles = True
categorylinks
= True
pagelinks
= True

[Filetypes]
pages
-articles  = xml.bz2
categorylinks  
= sql.gz
pagelinks      
= sql.gz

[Languages]
en
= True


This is the commando I used to download the dumps:

wp-download --resume -v /Users/cuent/Desktop/wikipedia/dumps

And I got this files (It took so long): 

enwiki-20160113-categorylinks.sql.gz 1.6G
enwiki-20160113-pagelinks.sql.gz 4.7G
enwiki-20160113-pages-articles.xml.bz2 12G


After that I ran the transformations. It took like 4 and half hours. I used this command:

java -Xmx14G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar english Contents Disambiguation_pages dumps/en/20160113/

I got the following files:

page.bin 435M
revision
revision.bin 103M
text.bin 31G
output
|---Category.txt 61M
|---category_inlinks.txt 65M
|---category_outlinks.txt 65M
|---category_pages.txt 636M
|---page_categories.txt 636M

Documentation says I have to get 11 files ("you should get 11 txt files in an “output” subfolder") in my output folder, but I got 5.

This is my log, I got the following message:

"Date/Time","Total Memory","Free Memory","Message"
"2016.02.03 23:47:09","257425408","241131656","parse input dumps..."
"2016.02.03 23:47:09","257425408","241131656","Discussions are unavailable"
"2016.02.04 03:29:39","100139008","98329328","processing table page..."
"2016.02.04 03:29:40","100139008","96359680","Pages 10000"
........
"2016.02.04 03:35:48","2435317760","240240824","Pages 13530000"
"2016.02.04 03:35:48","2435317760","238847264","processing table categorylinks..."
"2016.02.04 03:35:49","2438463488","244665240","Categorylinks 10000"
........
"2016.02.04 03:46:37","2822242304","293641336","Categorylinks 97890000"
"2016.02.04 03:46:41","2823815168","820690024","processing table pagelinks..."
"2016.02.04 03:46:41","2823815168","820690024","Not in GZIP format

java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.GZipDecompressor.getInputStream(GZipDecompressor.java:31)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.UniversalDecompressor.getInputStream(UniversalDecompressor.java:204)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.createPagelinksParser(DataMachineGenerator.java:140)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:78)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)"

java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.GZipDecompressor.getInputStream(GZipDecompressor.java:31)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.UniversalDecompressor.getInputStream(UniversalDecompressor.java:204)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.createPagelinksParser(DataMachineGenerator.java:140)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:78)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)"



Then I try to import my data to mysql, but is obvious I am not going to have all my tables load successfully after that.

mysqlimport -uroot -p --local --default-character-set=utf8 wiki_dumps /Users/cuent/Desktop/wikipedia/dumps/en/20160113/output/*.txt
 

I want to run this example:

        DatabaseConfiguration dbConfig = new DatabaseConfiguration();
        dbConfig
.setHost("localhost");
        dbConfig
.setDatabase("wiki_dumps");
        dbConfig
.setUser("root");
        dbConfig
.setPassword("password");
        dbConfig
.setLanguage(Language.english);
       
Wikipedia wiki = new Wikipedia(dbConfig);

       
Page page = wiki.getPage("Hello World");
       
System.out.println(page.getText());



So, What am I doing wrong? Why am I having 5 files instead of 11? How can I correct my error?


Cheers.

Torsten Zesch

unread,
Feb 4, 2016, 2:01:56 PM2/4/16
to jw...@googlegroups.com
Maybe the pagelinks file is corrupted?
Can you unzip it on the command line?

-Torsten

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

xavier sumba

unread,
Feb 4, 2016, 3:48:59 PM2/4/16
to jwpl-users
Yes, I run from source code and It's the following error that the program try to uncompress enwiki-20160113-pagelinks.sql.gz and  I get enwiki-20160113-pagelinks.sql.gz.cpgz. This behaviour is a cycle. It's not an error of JWPL. It is out of this scope. I just would like to know if Is there another way to download that page links file?

xavier sumba

unread,
Feb 4, 2016, 3:49:43 PM2/4/16
to jwpl-users
And the sizes of the files are OK?

Cheers.

Oliver Ferschke

unread,
Feb 4, 2016, 3:55:07 PM2/4/16
to jw...@googlegroups.com
Hi, 

you can download manually with wget, a browser or download tool

This page also contains md5 and sha1 checksums, so you can make sure that the downloaded files are not corrupted.
Afaik, you can only download one (or two?)  files at a time from any given IP.

-Oliver 

xavier sumba

unread,
Feb 4, 2016, 4:24:58 PM2/4/16
to jwpl-users
Hi Oliver,

Thank you so much, I was weeks trying to download that files. I will give the commands, that I am going to use if anyone is facing this problem.
I am using latest snapshot https://dumps.wikimedia.org/enwiki/latest/
I am using curl to download the following files: 


And you can check if your files are not corrupted, using the checksums of the following files enwiki-latest-md5sums.txt and enwiki-latest-sha1sums.txt.

Hope It helps, I will keep you informed.

Cheers.

xavier sumba

unread,
Feb 11, 2016, 1:35:33 AM2/11/16
to jwpl-users
Hi,

I have downloaded all the files and file sizes are OK.

1.75GB / enwiki-20160204-categorylinks.sql.gz
5.05GB / enwiki-20160204-pagelinks.sql.gz
12.7GB / enwiki-20160204-pages-articles.xml.bz2

After that I tried so many times and I can't fix the error to get my transformations. I execute this:

java -Xmx14G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar english Contents Disambiguation_pages dumps/en/20160204/

This is the error: 

21:15:22,638  INFO XmlBeanDefinitionReader:315 - Loading XML bean definitions from class path resource [context/applicationContext.xml]
21:15:22,888  INFO Log4jLogger:21 - parse input dumps...
21:15:22,890  INFO Log4jLogger:21 - Discussions are unavailable
00:17:58,140  INFO Log4jLogger:21 - 8192
com
.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
com
.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1735)
com
.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1606)
com
.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1644)
com
.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1748)
com
.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
com
.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
com
.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
com
.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
com
.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
com
.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
com
.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
com
.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648)
com
.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
javax
.xml.parsers.SAXParser.parse(SAXParser.java:195)
de
.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:208)
de
.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44)
de
.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65)
de
.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de
.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)


Why is it happening?

Cheers.

Torsten Zesch

unread,
Feb 11, 2016, 4:21:08 AM2/11/16
to jw...@googlegroups.com
Have you checked the integrity of the downloaded files using the provided checksums?

-Torsten

Johannes Daxenberger

unread,
Feb 11, 2016, 5:36:24 AM2/11/16
to jw...@googlegroups.com, Dr. Torsten Zesch
Hi Xavier,

this is a known (but not yet solved) problem. See my post from last year April (and a previous one from September 25, 2014):

"I ran into this exact problem on a recent (2015) en_wiki dump. It seems to be a known issue with Xerces 2.7.1 used by OpenJDK 1.7 (http://stackoverflow.com/questions/22891411/java-xerces-java-lang-arrayindexoutofboundsexception-8192https://bugs.openjdk.java.net/browse/JDK-7156085). 

I could solve this problem by using a newer version of Xerces (e.g. 2.11, following this guide: http://a-sirenko.blogspot.de/2013/08/jdk-7-sax-parser-produces.html): 
java -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -cp de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar:xercesImpl.jar:xml-apis.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine

However, this solution didn’t make me happy for too long. After a couple of minutes/hours of processing, the new XML parser seems to have a different problem:

15:17:24,501  INFO Log4jLogger:21 - org.xml.sax.SAXParseException; lineNumber: 467866063; columnNumber: 343; Invalid byte 2 of 4-byte UTF-8 sequence.
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:212)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)

So far, I couldn’t figure out why this happens; the dump seems to be fine (no suspicious characters in the given line, as far as I can tell – the uncompressed file has 120GB), -Dfile.encoding is UTF8."

You can try to above solution and see if it works for you. Which Java version are you using?

Best,
Johannes

xavier sumba

unread,
Feb 11, 2016, 3:35:15 PM2/11/16
to jwpl-users
Yes my checksums are the same, I was so happy. Because take too long to download.

Cheers.

...

xavier sumba

unread,
Feb 11, 2016, 3:39:01 PM2/11/16
to jwpl-users, torste...@uni-due.de
Hi Johannes,

I tried your solution.

java -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -cp de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar:xercesImpl.jar:xml-apis.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine english Contents Disambiguation_pages ../dumps/en/20160204/


Seems that I have the error that you are facing.


10:44:15,746  INFO XmlBeanDefinitionReader:315 - Loading XML bean definitions from class path resource [context/applicationContext.xml]
10:44:16,048  INFO Log4jLogger:21 - parse input dumps...
10:44:16,048  INFO Log4jLogger:21 - Discussions are unavailable
14:25:47,052  INFO Log4jLogger:21 - org.xml.sax.SAXParseException; lineNumber: 373897168; columnNumber: 281; Invalid byte 2 of 4-byte UTF-8 sequence.
de
.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:212)
de
.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44)
de
.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65)
de
.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de
.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)



Cheers.
Reply all
Reply to author
Forward
0 new messages