Stylized layout and "language"

ken...@gmail.com

unread,

Feb 16, 2008, 3:44:44 PM2/16/08

to ocropus

I want to use OCRopus to recognize store receipt slips, returning a
single line of text for each item on the receipt. Obviously the
content is mostly numbers and strange abbreviations which are not in
an English dictionary, however the set of possible item descriptions
is finite for any given store, and very good recognition accuracy
should be possible given a suitable language definition and layout
analysis.

Can someone tell me how to:

1) Configure OCROpus layout analysis to only allow a single column of
text, ignoring the amount of white space within any given line, and
returning a single text string for each line.

2) Create a custom language model which only recognizes a fixed set of
sequences of words from a custom dictionary, falling back to a
standard language model if no match is found.

I apologize if documentation already exists explaining how to do this.
If so, please just point me to it.
If I need to read some C++ code or example Lua scripts, that is fine,
if you could kindly suggest which
files I should look at.

Ken Aird

Tom

unread,

Feb 16, 2008, 6:11:35 PM2/16/08

to ocropus

> Can someone tell me how to:
>
> 1) Configure OCROpus layout analysis to only allow a single column of
> text, ignoring the amount of white space within any given line, and
> returning a single text string for each line.

Use make_SegmentPageBy1CP or make_SegmentPageBySmear instead of
make_SegmentPageByRAST. Sometime this year, we'll add another layout
analysis engine that's trainable.

> 2) Create a custom language model which only recognizes a fixed set of
> sequences of words from a custom dictionary, falling back to a
> standard language model if no match is found.

This will be supported in the beta release; we're planning on
releasing that in April. Alternatively, you can try to load a new
dictionary into the Tesseract component.

> I apologize if documentation already exists explaining how to do this.
> If so, please just point me to it.
> If I need to read some C++ code or example Lua scripts, that is fine,
> if you could kindly suggest which
> files I should look at.

If you want to use Tesseract with a different layout engine, you can
start with ocropus/ocroscript/scripts/rec-ltess.lua and replace the
layout engine as described above. You can also initialize Tesseract
with a different dictionary (this may require a patch to tess.pkg in
order to add whatever function is necessary to set the Tesseract
dictionary; if you do it, please send it to us.)

If you want to use a different language model, rec-bpnet.lua and build-
ngram-model.lua sort of show you how to do it, but that isn't
supported yet and the code will be changing somewhat.

Cheers,
Thomas.

Kurt Hardin

unread,

Oct 18, 2013, 1:20:49 PM10/18/13

to ocr...@googlegroups.com

I realize this is a very old thread, but I'd like to know if/how this type of layout analysis can be accomplished with Ocropus 0.7.

Reply all

Reply to author

Forward