Re: [tesseract-ocr] Digest for tesseract-ocr@googlegroups.com - 2 updates in 1 topic

34 views
Skip to first unread message

Ron Young

unread,
Aug 19, 2024, 11:17:05 AM8/19/24
to tesser...@googlegroups.com
Take a look at the python layout-parser package. It works very well for me.

<https://layout-parser.github.io/>


On August 19, 2024 8:03:53 AM PDT, tesser...@googlegroups.com wrote:
Ajg <ajg7...@gmail.com>: Aug 18 09:48AM -0700

I have data that comes in from various old (1920) magazines that has
multiple blocks of text on a single page. Right now, OCR recognition
interprets the text lines across the page so the output is interspersed
rather than word-wrapped to the next column. Is there any way to get the
OCR scanned text concatenated with one block following the next block?
Note- these are not all fixed size columns. I tried all the pagesegmodes
but the best I get is interspersed text.
Ger Hobbelt <ger.h...@gmail.com>: Aug 19 11:25AM +0200

Regrettably the only way I know with current tesseract is to work around
the issue, i.e. create a column mask and apply that in a preprocess, hence
feeding tesseract several images for a single page, one for each column
where the other columns are tipexed (white-out, replaced by background
color rectangles) so tesseract hour and tsv outputs will produce
coordinates matching the entire page. Then collect the tesseract results
for each image and stitch them together to reflow the text in a
postprocess.
 
Tesseract doesn't have a sophisticated page layout analysis module on board
so one is forced to use external means for that.
 
HTH,
 
Ger
 
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages