Multithreaded Page parsing

15 views
Skip to first unread message

Nitish Gupta

unread,
Oct 25, 2016, 3:11:44 PM10/25/16
to jwpl-users
I am trying to parse pages in a multithreaded fashion for faster parsing. 

org.hibernate.TransactionException - is thrown when I try to call page.getPlainText() in a multi-threaded ExecutorService.

The exception starts from Page.getCompiledPage() function.


Looks like concurrent queries to the SQL database are not allowed. Any way to make this process faster?

Torsten Zesch

unread,
Oct 25, 2016, 3:25:05 PM10/25/16
to jw...@googlegroups.com
As far as I know, we never tried this with multi threading.
So sorry, no idea how to fix this.

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nitish Gupta

unread,
Oct 25, 2016, 3:32:37 PM10/25/16
to jwpl-users
Okay. Thanks. Do you remember/have an estimate on how much time does it take to parse all Wikipedia pages i.e. say just calling the page.getPlainText() function.


On Tuesday, October 25, 2016 at 2:25:05 PM UTC-5, Torsten Zesch wrote:
As far as I know, we never tried this with multi threading.
So sorry, no idea how to fix this.
2016-10-25 21:11 GMT+02:00 Nitish Gupta <gnni...@gmail.com>:
I am trying to parse pages in a multithreaded fashion for faster parsing. 

org.hibernate.TransactionException - is thrown when I try to call page.getPlainText() in a multi-threaded ExecutorService.

The exception starts from Page.getCompiledPage() function.


Looks like concurrent queries to the SQL database are not allowed. Any way to make this process faster?

--
You received this message because you are subscribed to the Google Groups "jwpl-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+uns...@googlegroups.com.

Torsten Zesch

unread,
Oct 25, 2016, 3:41:39 PM10/25/16
to jw...@googlegroups.com
It depends on what language version you are parsing and on your machine/setup.
For the larger Wikipedias, it could take more than a day.
Processing time is nearly linear, so you could time how long it takes to process e.g. 1000 pages and then extrapolate.


To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.

Nitish Gupta

unread,
Oct 25, 2016, 8:09:05 PM10/25/16
to jwpl-users
Currently, according to the rate at which the English Wikipedia parsing is going, it looks like it will take 5 days.

I have boiled down the original issue to just one Session being made in the code and methods like Page.getText()/ getTitle() etc. all use the same session. 
Are there any plans to support multi-threaded operations using the API?

Torsten Zesch

unread,
Oct 26, 2016, 3:15:50 AM10/26/16
to jw...@googlegroups.com
It is rather low on the list, but we accept patches :)

To unsubscribe from this group and stop receiving emails from it, send an email to jwpl+unsubscribe@googlegroups.com.

Nitish Gupta

unread,
Oct 26, 2016, 3:24:45 PM10/26/16
to jwpl-users
Sure. I plan to include multi-threaded operations on a fork of your repo. I will send a pull-request once I am able to figure it out and are relatively free from the crazy schedule. You've been very helpful. Thanks :)
Reply all
Reply to author
Forward
0 new messages