Hi,
I have downloaded the wikipedia dumps, but I am having some troubles when I tried to get a Page.
This is my configuration file of wp-download:
[Configuration]
base_url = http://download.wikimedia.org
[Templates]
file_format = ${langcode}wiki-${date}-${filename}.${filetype}
language_dir_format = ${langcode}wiki
[Files]
pages-articles = True
categorylinks = True
pagelinks = True
[Filetypes]
pages-articles = xml.bz2
categorylinks = sql.gz
pagelinks = sql.gz
[Languages]
en = True
This is the commando I used to download the dumps:
wp-download --resume -v /Users/cuent/Desktop/wikipedia/dumps
And I got this files (It took so long):
enwiki-20160113-categorylinks.sql.gz 1.6G
enwiki-20160113-pagelinks.sql.gz 4.7G
enwiki-20160113-pages-articles.xml.bz2 12G
After that I ran the transformations. It took like 4 and half hours. I used this command:
java -Xmx14G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar english Contents Disambiguation_pages dumps/en/20160113/
I got the following files:
page.bin 435M
revision
revision.bin 103M
text.bin 31G
output
|---Category.txt 61M
|---category_inlinks.txt 65M
|---category_outlinks.txt 65M
|---category_pages.txt 636M
|---page_categories.txt 636M
Documentation says I have to get 11 files ("you should get 11 txt files in an “output” subfolder") in my output folder, but I got 5.
This is my log, I got the following message:
"Date/Time","Total Memory","Free Memory","Message"
"2016.02.03 23:47:09","257425408","241131656","parse input dumps..."
"2016.02.03 23:47:09","257425408","241131656","Discussions are unavailable"
"2016.02.04 03:29:39","100139008","98329328","processing table page..."
"2016.02.04 03:29:40","100139008","96359680","Pages 10000"
........
"2016.02.04 03:35:48","2435317760","240240824","Pages 13530000"
"2016.02.04 03:35:48","2435317760","238847264","processing table categorylinks..."
"2016.02.04 03:35:49","2438463488","244665240","Categorylinks 10000"
........
"2016.02.04 03:46:37","2822242304","293641336","Categorylinks 97890000"
"2016.02.04 03:46:41","2823815168","820690024","processing table pagelinks..."
"2016.02.04 03:46:41","2823815168","820690024","Not in GZIP format
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.GZipDecompressor.getInputStream(GZipDecompressor.java:31)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.UniversalDecompressor.getInputStream(UniversalDecompressor.java:204)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.createPagelinksParser(DataMachineGenerator.java:140)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:78)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)"
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.GZipDecompressor.getInputStream(GZipDecompressor.java:31)
de.tudarmstadt.ukp.wikipedia.wikimachine.decompression.UniversalDecompressor.getInputStream(UniversalDecompressor.java:204)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.createPagelinksParser(DataMachineGenerator.java:140)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:78)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)"
Then I try to import my data to mysql, but is obvious I am not going to have all my tables load successfully after that.
mysqlimport -uroot -p --local --default-character-set=utf8 wiki_dumps /Users/cuent/Desktop/wikipedia/dumps/en/20160113/output/*.txt
I want to run this example:
DatabaseConfiguration dbConfig = new DatabaseConfiguration();
dbConfig.setHost("localhost");
dbConfig.setDatabase("wiki_dumps");
dbConfig.setUser("root");
dbConfig.setPassword("password");
dbConfig.setLanguage(Language.english);
Wikipedia wiki = new Wikipedia(dbConfig);
Page page = wiki.getPage("Hello World");
System.out.println(page.getText());
So, What am I doing wrong? Why am I having 5 files instead of 11? How can I correct my error?
Cheers.
--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
1.75GB / enwiki-20160204-categorylinks.sql.gz
5.05GB / enwiki-20160204-pagelinks.sql.gz
12.7GB / enwiki-20160204-pages-articles.xml.bz2
java -Xmx14G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar english Contents Disambiguation_pages dumps/en/20160204/
21:15:22,638 INFO XmlBeanDefinitionReader:315 - Loading XML bean definitions from class path resource [context/applicationContext.xml]
21:15:22,888 INFO Log4jLogger:21 - parse input dumps...
21:15:22,890 INFO Log4jLogger:21 - Discussions are unavailable
00:17:58,140 INFO Log4jLogger:21 - 8192
com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1735)
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1606)
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1644)
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1748)
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:648)
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:332)
javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:208)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)
...
java -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -cp de.tudarmstadt.ukp.wikipedia.datamachine-1.0.0-jar-with-dependencies.jar:xercesImpl.jar:xml-apis.jar de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine english Contents Disambiguation_pages ../dumps/en/20160204/
Seems that I have the error that you are facing.
10:44:15,746 INFO XmlBeanDefinitionReader:315 - Loading XML bean definitions from class path resource [context/applicationContext.xml]
10:44:16,048 INFO Log4jLogger:21 - parse input dumps...
10:44:16,048 INFO Log4jLogger:21 - Discussions are unavailable
14:25:47,052 INFO Log4jLogger:21 - org.xml.sax.SAXParseException; lineNumber: 373897168; columnNumber: 281; Invalid byte 2 of 4-byte UTF-8 sequence.
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:212)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:44)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:65)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:59)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:57)