Better segmentation of text columns?

39 views
Skip to first unread message

juliusrain

unread,
May 31, 2011, 6:49:08 PM5/31/11
to ocr...@googlegroups.com
 Hello,

I am trying to process a large amount of pages from a book preprocessed to contain only text. I used ocropus-pseg (with the default RAST segmenter) on the attached page, but it was not able to recognize that there are two text columns under the word "Collect." It did, however, correctly recognize that there are two columns under "Lesson II." It looks as if the segmenter can only recognize columns when they span many lines of text. Does anyone know how segmentation can be improved?

Thanks
sample.pseg.png

Tom

unread,
Jun 29, 2011, 5:47:37 PM6/29/11
to ocr...@googlegroups.com
Short columns are generally a problem and there is no simple, general purpose solution that automatically works for arbitrary documents.  In some cases, the only way to tell is by actually seeing which combination of lines makes the most sense at the textual level.

If you have a collection of pages, you can train layout analysis models on it.  We've published a couple of papers on trainable layout analysis, but that code hasn't been integrated into OCRopus yet. 
Reply all
Reply to author
Forward
0 new messages