> Can someone tell me how to:
>
> 1) Configure OCROpus layout analysis to only allow a single column of
> text, ignoring the amount of white space within any given line, and
> returning a single text string for each line.
Use make_SegmentPageBy1CP or make_SegmentPageBySmear instead of
make_SegmentPageByRAST. Sometime this year, we'll add another layout
analysis engine that's trainable.
> 2) Create a custom language model which only recognizes a fixed set of
> sequences of words from a custom dictionary, falling back to a
> standard language model if no match is found.
This will be supported in the beta release; we're planning on
releasing that in April. Alternatively, you can try to load a new
dictionary into the Tesseract component.
> I apologize if documentation already exists explaining how to do this.
> If so, please just point me to it.
> If I need to read some C++ code or example Lua scripts, that is fine,
> if you could kindly suggest which
> files I should look at.
If you want to use Tesseract with a different layout engine, you can
start with ocropus/ocroscript/scripts/rec-ltess.lua and replace the
layout engine as described above. You can also initialize Tesseract
with a different dictionary (this may require a patch to tess.pkg in
order to add whatever function is necessary to set the Tesseract
dictionary; if you do it, please send it to us.)
If you want to use a different language model, rec-bpnet.lua and build-
ngram-model.lua sort of show you how to do it, but that isn't
supported yet and the code will be changing somewhat.
Cheers,
Thomas.