--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
"Attach Listener" daemon prio=10 tid=0x00007f4694002000 nid=0x60ca waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"LeaseChecker" daemon prio=10 tid=0x00007f46b05a6800 nid=0x4ed1 waiting on condition [0x00007f46a9d7a000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1379)
at java.lang.Thread.run(Thread.java:679)
"IPC Client (47) connection to LucidN1/192.168.201.1:50002 from hadoop" daemon prio=10 tid=0x00007f46b059d000 nid=0x4ecc in Object.wait() [0x00007f46a9f7c000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:706)
- locked <0x00000000eb07c248> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:748)
--
You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/Sw-TlNi4RM0/unsubscribe?hl=en-GB.
To unsubscribe from this group and all of its topics, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
I was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.
I've used Tika to process the HTML files that come from WARC, (both with behemoth and LucidWorks Software) and it is working great. I was able to get more metadata from the files.Last night, I was looking at annotations that come by using Tika and been wondering how to filter the annotations based on our tag interest. An Example would be, if I want to extract all images or links to images from the HTML files. Can we modify tika annotations so that it parses only the 'img' tags and none of the other markup annotataions. Do you know if this is possible ?
Hi KiranI was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.that would be a good comparison to have. The problem could come from LucidWorks
I was able to do a (jstack -l $pid) and the logs show lot of thread waiting and timed_waiting. Please see the part of thread dump below.As you said, this is something very specific to the pear file and I am wondering where it got stuck. I am running LucidWorks Big Data Software and the 'UIMA' job using that software is still running from last night (7 + hours) for 128k documents (16k HTML rows and other empty content).I will try to use UIMA, with latest code in behemoth and check how much time it is taking to execute.that would be a good comparison to have. The problem could come from LucidWorksThe job with LucidWorks kept on running on 18 + hours and the one I am running now with Behemoth alone has been running for 4+ hours with 54% completion. There are few tasks that keep getting killed because of the time out of task tracker. I think the processing with OpenNLP pear file is stuck somewhere with the data and is not able to progress, making the task tracker kill after a while. I am not sure how this can be fixed.