Behemoth Beginners Guide?

191 views
Skip to first unread message

alexmc

unread,
Dec 6, 2010, 11:23:10 AM12/6/10
to DigitalPebble
May ask a dumb question? What would be the best way of getting me up
to speed with Behemoth?

I am trying to do some focussed crawling with Nutch or Bixo, want to
do some Named Entity Recognition (which sounds to me like UIMA or
Gate). Now I have been aware of Behemoth for some time as a tool which
helps do UIMA stuff on Hadoop but am only just getting round to
installing it. However the documentation is a bit thin on the ground?

Should I learn everything I need about connecting Nutch to UIMA now -
or will Behemoth help me with that? Do I just need to delve in and
understand all the code?

Cheers

Alex

DigitalPebble

unread,
Dec 6, 2010, 12:38:15 PM12/6/10
to digita...@googlegroups.com
Hi Alex,

May ask a dumb question? What would be the best way of getting me up
to speed with Behemoth? 

I am trying to do some focussed crawling with Nutch or Bixo, want to
do some Named Entity Recognition (which sounds to me like UIMA or
Gate). Now I have been aware of Behemoth for some time as a tool which
helps do UIMA stuff on Hadoop but am only just getting round to
installing it. However the documentation is a bit thin on the ground?
 
Did you have a look at http://github.com/jnioche/behemoth/wiki/howto ?
I've changed quite a few things in the way Behemoth's code is managed and will update the wiki soon, however with some knowledge of Hadoop you should be able to run the examples.
 

Should I learn everything I need about connecting Nutch to UIMA now -
or will Behemoth help me with that?

Behemoth will convert the Nutch segments into its own representation which will then be used as an input for GATE or UIMA (or whatever). It does not connect them as such.
 
Do I just need to delve in and
understand all the code?

re-NER  I'd suggest that you look at GATE (http://gate.ac.uk), play with the GUI a bit and follow the tutorials there. It has far more available resources than UIMA and is IMHO more flexible. In particuar it comes with ANNIE, which is a simple application for NER that is often used as a starting point by GATE users to build their own pipeline.

On the Behemoth front : a good way to start would be to look at the way Behemoth converts the Nutch segments into a SequenceFile of BehemothDocument, use the CorpusReader to see what the content looks like, then try processing your corpus with the GATE app included in the tests following the instructions from the Wiki.

Now that the main refactoring of the code is finished I'll probably spend more time on the documentation, any contributions, suggestions or questions are welcome.

HTH

Julien


--
 
Open Source Solutions for Text Engineering
 
http://digitalpebble.blogspot.com
http://www.digitalpebble.com

Reply all
Reply to author
Forward
0 new messages