May ask a dumb question? What would be the best way of getting me up
> to speed with Behemoth?
introductions / overview
> I am trying to do some focussed crawling with Nutch or Bixo, want to
> do some Named Entity Recognition (which sounds to me like UIMA or
> Gate). Now I have been aware of Behemoth for some time as a tool which
> helps do UIMA stuff on Hadoop but am only just getting round to
> installing it. However the documentation is a bit thin on the ground?
Did you have a look at http://github.com/jnioche/behemoth/wiki/howto
I've changed quite a few things in the way Behemoth's code is managed and
will update the wiki soon, however with some knowledge of Hadoop you should
be able to run the examples.
> Should I learn everything I need about connecting Nutch to UIMA now -
> or will Behemoth help me with that?
Behemoth will convert the Nutch segments into its own representation which
will then be used as an input for GATE or UIMA (or whatever). It does not
connect them as such.
> Do I just need to delve in and
> understand all the code?
re-NER I'd suggest that you look at GATE (http://gate.ac.uk
), play with the
GUI a bit and follow the tutorials there. It has far more available
resources than UIMA and is IMHO more flexible. In particuar it comes with
ANNIE, which is a simple application for NER that is often used as a
starting point by GATE users to build their own pipeline.
On the Behemoth front : a good way to start would be to look at the way
Behemoth converts the Nutch segments into a SequenceFile of
BehemothDocument, use the
see what the content looks like, then try processing your corpus with
GATE app included in the tests following the instructions from the Wiki.
Now that the main refactoring of the code is finished I'll probably spend
more time on the documentation, any contributions, suggestions or questions
Open Source Solutions for Text Engineering