re-NER I'd suggest that you look at GATE (
http://gate.ac.uk), play with the GUI a bit and follow the tutorials there. It has far more available resources than UIMA and is IMHO more flexible. In particuar it comes with ANNIE, which is a simple application for NER that is often used as a starting point by GATE users to build their own pipeline.
On the Behemoth front : a good way to start would be to look at the way Behemoth converts the Nutch segments into a SequenceFile of BehemothDocument, use the
CorpusReader to see what the content looks like, then try processing your corpus with the GATE app included in the tests following the instructions from the Wiki.
Now that the main refactoring of the code is finished I'll probably spend more time on the documentation, any contributions, suggestions or questions are welcome.
HTH
Julien