DataMachine Transformation of WikipediaDump

65 views
Skip to first unread message

Shadow

unread,
May 10, 2020, 8:36:00 AM5/10/20
to jwpl-users
When using the DataMachine to transform the WikipediaDump I executed following command in my terminal java -jar JWPLDataMachine.jar [LANGUAGE] [MAIN_CATEGORY_NAME] [DISAMBIGUATION_CATEGORY_NAME] [SOURCE_DIRECTORY]

I added the -Xmx4g to assign additional memory
So when executing the DataMachine it gives me 3 bin files and an output folder which should contain 11 txt files after a successful execution.
After one and half hour of execution I receive an external txt file created by the DataMachine which contains the following message:

/*

"Date/Time","Total Memory","Free Memory","Message"
"2020.05.09 16:47:15","257425408","245268928","parse input dumps..."
"2020.05.09 16:47:15","257425408","245268928","Discussions are unavailable"
"2020.05.09 18:25:07","209190912","141272304","org.xml.sax.SAXParseException; lineNumber: 57156821; columnNumber: 399; JAXP00010004: Die akkumulierte Größe von Entitys ist "50.000.001" und überschreitet den Grenzwert "50.000.000", der von "FEATURE_SECURE_PROCESSING" festgelegt wurde.

de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:209)
de.tudarmstadt.ukp.wikipedia.datamachine.dump.xml.XML2Binary.<init>(XML2Binary.java:47)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.processInputDump(DataMachineGenerator.java:70)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.DataMachineGenerator.start(DataMachineGenerator.java:64)
de.tudarmstadt.ukp.wikipedia.datamachine.domain.JWPLDataMachine.main(JWPLDataMachine.java:64)"

*/

The output folder remains empty while the DataMachine is still running. I assumed by getting this message something got interrupted. To me it looks like a lack of available memory, but it is just a guess. On the other hand I already assigned additional memory by using the -Xmx4g flag. Can somebody explain what the problem is and how the DataMachine can be run successfully when such a problem occurs?

Thanks in Advance

Torsten Zesch

unread,
May 10, 2020, 2:52:03 PM5/10/20
to jw...@googlegroups.com
Please try
System.setProperty("jdk.xml.totalEntitySizeLimit", String.valueOf(Integer.MAX_VALUE));
--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
jwpl+uns...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/jwpl/4eb300df-38b2-45f6-82c1-1f834bb8652f%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/4eb300df-38b2-45f6-82c1-1f834bb8652f%40googlegroups.com?utm_medium=email&utm_source=footer>.

Shadow

unread,
May 10, 2020, 3:40:21 PM5/10/20
to jwpl-users
Thanks for your quick response, but where and how do I apply this? Is there a document or file where I can put it in?  

Torsten Zesch

unread,
May 10, 2020, 3:46:48 PM5/10/20
to jw...@googlegroups.com

Shadow

unread,
May 11, 2020, 6:24:26 AM5/11/20
to jwpl-users
Thanks and how long does it roughly take to transform the Wikipedia Dump? I ve started up the process yesterday evening and it is still running with my laptop switching to stand by from time to time.

Shadow

unread,
May 11, 2020, 8:30:52 AM5/11/20
to jwpl-users
And is it necessary to put the number for totalEntitySizeLimit in exclamation Marks?


Am Sonntag, 10. Mai 2020 21:46:48 UTC+2 schrieb Torsten Zesch:

Torsten Zesch

unread,
May 11, 2020, 8:40:33 AM5/11/20
to jw...@googlegroups.com
You can try
java -Djdk.xml.totalEntitySizeLimit=PUT_LARGE_VALUE_HERE
or
java "-Djdk.xml.totalEntitySizeLimit=PUT_LARGE_VALUE_HERE"

The process will run quite a while depending on your system and which language you are trying to process. Possibly a day or two.
https://groups.google.com/d/msgid/jwpl/4eb300df-38b2-45f6-82c1-1f834bb8652f%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/4eb300df-38b2-45f6-82c1-1f834bb8652f%40googlegroups.com> <https://groups.google.com/d/msgid/jwpl/4eb300df-38b2-45f6-82c1-1f834bb8652f%40googlegroups.com>
<https://groups.google.com/d/msgid/jwpl/4eb300df-38b2-45f6-82c1-1f834bb8652f%40googlegroups.com?utm_medium=email&utm_source=footer>.






--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to

jw...@googlegroups.com <>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com> <https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com?utm_medium=email&utm_source=footer>.




--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
jwpl+uns...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com?utm_medium=email&utm_source=footer>.

Shadow

unread,
May 11, 2020, 10:56:20 AM5/11/20
to jwpl-users
This morning my project supervisor suggested this: -Djdk.xml.totalEntitySizeLimit="2147483647"

Shadow

unread,
May 12, 2020, 2:04:54 AM5/12/20
to jwpl-users
So this morning I ended up getting 1 txt file in my output folder called Category.txt and nearly 3 GB of my RAM being used by the DataMachine. The latest message in my txt log file is "2020.05.12 07:36:04","2919759872","142793144","Pages 17130000". According to the Task Manager the process stopped running but the remaining other 10 txt files in my output folder are still missing. 


Am Montag, 11. Mai 2020 14:40:33 UTC+2 schrieb Torsten Zesch:

Torsten Zesch

unread,
May 12, 2020, 3:36:50 AM5/12/20
to jw...@googlegroups.com
Depending on the language version you are trying to process, 3GB of RAM is probably not enough.

Any other error messages?
https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com> <https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com>
<https://groups.google.com/d/msgid/jwpl/e660180e-f04c-4a35-99cc-d93af10b18e4%40googlegroups.com?utm_medium=email&utm_source=footer>.





--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to

jw...@googlegroups.com <>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com> <https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com?utm_medium=email&utm_source=footer>.




--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
jwpl+uns...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/jwpl/81c0e74b-8cf5-4f07-b980-c13b291243f4%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/81c0e74b-8cf5-4f07-b980-c13b291243f4%40googlegroups.com?utm_medium=email&utm_source=footer>.

Shadow

unread,
May 12, 2020, 4:33:00 AM5/12/20
to jwpl-users
I have 16 GB of RAM on my current Laptop with 51 % being being used in total by the DataMachine and other applications. In the txt log file were no other error messages. I am using the latest English Wikipedia Dump (roughly 25 GB in size). The Console only shows the log4j Warnings and in the Console the process is still running. 

This is my Wikipedia Dump:
enwiki-latest-pages-articles.xml.bz2
enwiki-latest-pagelinks.sql.gz
enwiki-latest-categorylinks.sql.gz

This is how the txt file looks like:

"Date/Time","Total Memory","Free Memory","Message"
"2020.05.11 14:28:45","257425408","245268800","parse input dumps..."
"2020.05.11 14:28:45","257425408","245268800","Discussions are unavailable"
"2020.05.12 07:10:26","93847552","91990040","processing table page..."
"2020.05.12 07:10:27","93847552","89760360","Pages 10000"
"2020.05.12 07:10:27","93847552","87259704","Pages 20000"
"2020.05.12 07:10:28","93847552","85759304","Pages 30000"
(......)
"2020.05.12 07:36:02","2919759872","150920712","Pages 17110000"
"2020.05.12 07:36:03","2919759872","147250480","Pages 17120000"
"2020.05.12 07:36:04","2919759872","142793144","Pages 17130000

Shadow

unread,
May 12, 2020, 8:24:14 AM5/12/20
to jwpl-users
Is it a problem that I didn`t add the -Xmx4g flag to the command line? I only executed the java -Djdk.xml.totalEntitySizeLimit=PUT_LARGE_VALUE_HERE -jar [rest of command].


Am Dienstag, 12. Mai 2020 09:36:50 UTC+2 schrieb Torsten Zesch:

Torsten Zesch

unread,
May 12, 2020, 8:28:47 AM5/12/20
to jw...@googlegroups.com
Without the flag, it takes the default (whatever that is).

If you didn't assign enough memory, processing will get slower and slower as you approach the memory limit and then fail at some point.
If you have 16GB, try as much as you can spare (or move processing to a dedicated server).
https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com> <https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com>
<https://groups.google.com/d/msgid/jwpl/6708fc71-8116-41a7-a32a-8ff481218061%40googlegroups.com?utm_medium=email&utm_source=footer>.





--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to

jw...@googlegroups.com <>.
To view this discussion on the web visit

https://groups.google.com/d/msgid/jwpl/81c0e74b-8cf5-4f07-b980-c13b291243f4%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/81c0e74b-8cf5-4f07-b980-c13b291243f4%40googlegroups.com> <https://groups.google.com/d/msgid/jwpl/81c0e74b-8cf5-4f07-b980-c13b291243f4%40googlegroups.com?utm_medium=email&utm_source=footer>.




--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
jwpl+uns...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/jwpl/7c236689-56fa-489b-adfc-6a49f9c3b411%40googlegroups.com <https://groups.google.com/d/msgid/jwpl/7c236689-56fa-489b-adfc-6a49f9c3b411%40googlegroups.com?utm_medium=email&utm_source=footer>.

Shadow

unread,
May 13, 2020, 3:03:06 PM5/13/20
to jwpl-users
When importing the txt files into the Database I used following Command Line: mysqlimport -u root -p --local --default-character-set=utf8 wikipediadump.category "C:\Users\jacqu\OneDrive\Desktop\Projekt KI\GameRecommender\Output\Category.txt" 

Normally I would  expect the console to ask  for my password and  showing something such as this on the console: "wikipedia.category: Records: [number] Deleted: [number] Skipped: [number] Warnings: [number]" after having finished importing the txt file. Instead the console returns -> , so from what I understand basically nothing happened at this point.  

According to the mysqlimport specification I don`t see what could possibly be wrong with the command line I used as shown above to import the txt file. -u is Username, -p is the password, --local and --default-character-set are mysqlimport Options which are given by the JWPL Documentation, wikipediadump is how I called my database and category is the name of the Table for the data contained in the category.txt file.
Reply all
Reply to author
Forward
0 new messages