No *.txt files found in the output subfolder

178 views
Skip to first unread message

nusn...@gmail.com

unread,
Sep 14, 2016, 8:54:56 PM9/14/16
to jwpl-users
Hello,

I have downloaded the wikipedia dump files and the datamachine jar file with dependencies in the same folder as follows:

de.tudarmstadt.ukp.wikipedia.datamachine-1.1.0-jar-with-dependencies.jar
enwiki-20160820-categorylinks.sql.gz
enwiki-20160820-pagelinks.sql.gz
enwiki-20160820-pages-articles.xml.bz2

I run command:

java -Xmx4g -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.1.0-jar-with-dependencies.jar english Contents Disambiguation_pages ./


In the result, I see the output folder and 3 *.bin files:

page.bin
revision.bin
text.bin

But the output folder is empty.

Could you please let me know what is wrong in my command? I have tried several times but the results are all the same.

Thank you,
Hoa

Johannes Daxenberger

unread,
Sep 19, 2016, 9:35:29 AM9/19/16
to jw...@googlegroups.com

Hi,

 

did you make sure that the program actually terminated (by itself)? The transformation can take a while (maybe days).

If so, you could try with a smaller dump (e.g. a different language)to see whether that works and if so, maybe there is  a problem with the particular dump you are trying to process.

 

Best,

Johannes

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nusn...@gmail.com

unread,
Sep 20, 2016, 3:07:21 AM9/20/16
to jwpl-users

Thanks Johannes for the reply.

The program was actually terminated itself.

I tested with smaller dump (ukwiki) and it worked. I saw 11 text files in the output folder.

I have downloaded another dump (enwiki-20160901) and run it again. It produced 3 *.bin files and terminated without any text files in the output folder. There is not any error message on the bash window.

Have you got any idea about this problem?

Regards,
Hoa

Johannes Daxenberger

unread,
Sep 20, 2016, 5:29:30 AM9/20/16
to jw...@googlegroups.com

Did you make sure there is enough space on your hard drive?

nusn...@gmail.com

unread,
Sep 20, 2016, 6:59:18 AM9/20/16
to jwpl-users
Hi Johannes,

My hard disk still has 105G free.

Here are the 3 *.bin files I received after the process stoped.

-rw-rw-r-- 1 ngo010 ngo010    19160397 Sep 20 18:08 page.bin
-rw-rw-r-- 1 ngo010 ngo010     5087776 Sep 20 18:08 revision.bin
-rw-rw-r-- 1 ngo010 ngo010  5435083894 Sep 20 18:08 text.bin

It is strange that the dump file's size is 12GB but the text.bin file is only 5GB. I guess the running process meets some break condition and terminates itself.

I have tested in two PC running Ubuntu 14.04 and 16.04 64 bit, 16G RAM.

I wonder if other users meet the same problem? What is the dump files have you tested with the 1.1.0 version?

Thank you.
Hoa

nusn...@gmail.com

unread,
Sep 20, 2016, 10:29:53 PM9/20/16
to jwpl-users
Hello,

I run JWPLDataMachine class in Eclipse and when the process stops, I see the following log messages:

12:10:23,792  INFO main Log4jLogger:logObject:28 - org.xml.sax.SAXParseException; lineNumber: 66640650; columnNumber: 2321; JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING".

de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:209)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:49)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:70)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:64)

I tried to set parameter for JVM as follows:

 -DentityExpansionLimit=100000000

But it happened again.

I tried to disable feature secure processing by :

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);

But it still does not work.

Have you met this situation? and please advice how to overcome it?

Regards,
Hoa

Nitish Gupta

unread,
Oct 20, 2016, 8:17:09 PM10/20/16
to jwpl-users
Any updates? I am facing the same issue and tries the same things as suggested by Hoa. 

It doesn't seem to work. Any workarounds?

Torsten Zesch

unread,
Oct 21, 2016, 8:08:26 AM10/21/16
to jw...@googlegroups.com
Sorry, I am not aware of someone having come up with a workaround for this so far.

-Torsten

To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.

Nitish Gupta

unread,
Oct 21, 2016, 10:13:28 PM10/21/16
to jwpl-users
I got it to work. I put the DataMachine maven dependency in an empty Java project and ran this small function.
System.setProperty("jdk.xml.totalEntitySizeLimit", "500000000");

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
// Path where wiki dump is stored.
String[] arg = {"english", "Contents", "Disambiguation_pages", "/save/ngupta19/enwiki/20160501/"};
JWPLDataMachine.main(arg);

This was able to process the dumps into the 11 .txt files. It took around 7 hours. Just for bookkeeping, making the SQL database and loading tables took around 2-3 hours and initial indexing in Java around 1 hour.
Thanks,
Reply all
Reply to author
Forward
0 new messages