Text-wrap recognition

58 views
Skip to first unread message

Ajg

unread,
Aug 18, 2024, 12:48:13 PM8/18/24
to tesseract-ocr
I have data that comes in from various old (1920) magazines that has multiple blocks of text on a single page. Right now, OCR recognition interprets the text lines across the page so the output is interspersed rather than word-wrapped to the next column.  Is there any way to get the OCR scanned text concatenated with one block following the next block?  Note- these are not all fixed size columns.  I tried all the pagesegmodes but the best I get is interspersed text. 

Ger Hobbelt

unread,
Aug 19, 2024, 5:25:32 AM8/19/24
to tesser...@googlegroups.com

Regrettably the only way I know with current tesseract is to work around the issue, i.e. create a column mask and apply that in a preprocess, hence feeding tesseract several images for a single page, one for each column where the other columns are tipexed (white-out, replaced by background color rectangles) so tesseract hour and tsv outputs will produce coordinates matching the entire page. Then collect the tesseract results for each image and stitch them together to reflow the text in a postprocess.

Tesseract doesn't have a sophisticated page layout analysis module on board so one is forced to use external means for that.

HTH,

Ger


On Sun, 18 Aug 2024, 18:48 Ajg, <ajg7...@gmail.com> wrote:
I have data that comes in from various old (1920) magazines that has multiple blocks of text on a single page. Right now, OCR recognition interprets the text lines across the page so the output is interspersed rather than word-wrapped to the next column.  Is there any way to get the OCR scanned text concatenated with one block following the next block?  Note- these are not all fixed size columns.  I tried all the pagesegmodes but the best I get is interspersed text. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2d3610a0-45e0-499c-86c2-08cc0ec622c1n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages