Using OpenNlpTextAnalyzer.pear with UIMA : Performance

kiran

unread,

Mar 20, 2013, 3:23:23 AM3/20/13

to digita...@googlegroups.com

Hi,

Did anyone use the OpenNlpTextAnalyzer.pear with UIMA ?

I have tried using it with behemoth latest trunk but I have seen the tasks fail in hadoop due to the Task attempt_201303200315_0006_m_000014_0 failed to report status for 611 seconds. Killing!

.

My document set is quite small (8 WARC files of around 100MB). I will try increasing the timeout for tasktracker.

Does anyone has an idea on why UIMA processing is a bit slow ?

Thank you,

Kiran.

DigitalPebble

unread,

Mar 20, 2013, 6:00:40 AM3/20/13

to digita...@googlegroups.com

Hi Kiran

I don't think the UIMA module in general is to blame but there is probably something specific to the PEAR you are using.

Why don't you use jstack to see what it is busy on? anything interesting from the logs? You've used Tika to extract the text from the WARC files right? Does it get stuck on a specific document?

Julien

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com
http://www.digitalpebble.com

kiran chitturi

unread,

Mar 20, 2013, 9:34:01 AM3/20/13

to digita...@googlegroups.com

Hi Julien,

I was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.

As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).

I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.

I've used Tika to process the HTML files that come from WARC, (both with behemoth and LucidWorks Software) and it is working great. I was able to get more metadata from the files.

Last night, I was looking at annotations that come by using Tika and been wondering how to filter the annotations based on our tag interest. An Example would be, if I want to extract all images or links to images from the HTML files. Can we modify tika annotations so that it parses only the 'img' tags and none of the other markup annotataions. Do you know if this is possible ?

Thank you,

Kiran.

Full thread dump OpenJDK 64-Bit Server VM (20.0-b12 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00007f4694002000 nid=0x60ca waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"LeaseChecker" daemon prio=10 tid=0x00007f46b05a6800 nid=0x4ed1 waiting on condition [0x00007f46a9d7a000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1379)
at java.lang.Thread.run(Thread.java:679)
"IPC Client (47) connection to LucidN1/192.168.201.1:50002 from hadoop" daemon prio=10 tid=0x00007f46b059d000 nid=0x4ecc in Object.wait() [0x00007f46a9f7c000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:706)
- locked <0x00000000eb07c248> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:748)

--
You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/Sw-TlNi4RM0/unsubscribe?hl=en-GB.
To unsubscribe from this group and all of its topics, send an email to digitalpebbl...@googlegroups.com.

To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

--

Kiran Chitturi

DigitalPebble

unread,

Mar 20, 2013, 9:47:27 AM3/20/13

to digita...@googlegroups.com

Hi Kiran

I was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.

As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).

I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.

that would be a good comparison to have. The problem could come from LucidWorks

I've used Tika to process the HTML files that come from WARC, (both with behemoth and LucidWorks Software) and it is working great. I was able to get more metadata from the files.

Last night, I was looking at annotations that come by using Tika and been wondering how to filter the annotations based on our tag interest. An Example would be, if I want to extract all images or links to images from the HTML files. Can we modify tika annotations so that it parses only the 'img' tags and none of the other markup annotataions. Do you know if this is possible ?

There is currently no filter to do that but the place to add it would be https://github.com/DigitalPebble/behemoth/blob/master/tika/src/main/java/com/digitalpebble/behemoth/tika/TikaProcessor.java#L257 just like we do with GATE or UIMA annotations.

You could also implement your own TikaProcessor and specify it with the parameter -t and implement the filtering logic there but doing it in the generic TikaProcessor would make more sense. Contributions and patches welcome as usual ;-)

Julien

kiran

unread,

Mar 21, 2013, 3:06:03 AM3/21/13

to digita...@googlegroups.com, jul...@digitalpebble.com

On Wednesday, March 20, 2013 9:47:27 AM UTC-4, DigitalPebble wrote:

Hi Kiran

I was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.

As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).

I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.

that would be a good comparison to have. The problem could come from LucidWorks

The job with LucidWorks kept on running on 18 + hours and the one I am running now with Behemoth alone has been running for 4+ hours with 54% completion. There are few tasks that keep getting killed because of the time out of task tracker. I think the processing with OpenNLP pear file is stuck somewhere with the data and is not able to progress, making the task tracker kill after a while. I am not sure how this can be fixed.

DigitalPebble

unread,

Mar 21, 2013, 4:46:58 AM3/21/13

to digita...@googlegroups.com

Hi

I was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.

As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).

I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.

that would be a good comparison to have. The problem could come from LucidWorks

The job with LucidWorks kept on running on 18 + hours and the one I am running now with Behemoth alone has been running for 4+ hours with 54% completion. There are few tasks that keep getting killed because of the time out of task tracker. I think the processing with OpenNLP pear file is stuck somewhere with the data and is not able to progress, making the task tracker kill after a while. I am not sure how this can be fixed.

You can specify a larger timeout for the MapReduce tasks which should help. Would be good to understand why that particular PEAR is so slow though.

I've never used OpenNLP very much but knowing that it is based mostly on ML models it could be that the memory required is pretty large. Check the memory usage on your machine and maybe try adding more memory to the MapReduce JVMs. VisualVM is a good tool to use locally and see how much time is spent garbage collecting etc... It could also be that your machine is swapping madly if it doesn't have enough RAM.

It would also be good to add some info in the logs about how much time is spent initialising the tasks i.e. in the case of UIMA load the pipeline from the PEAR files. It's probably not the reason why your processes are running out of time but I'd expect that loading the ML models within the PEAR to be quite time consuming.

BTW https://github.com/DigitalPebble/behemoth/issues/40 would allow to reuse the UIMA pipelines with the mapred.job.reuse.jvm.num.tasks and would limit the time spent initialising. Again, I think your problem is probably elsewhere.