Strange bbox'es in image_to_boxes

15 views
Skip to first unread message

Jürgen Uhl

unread,
Dec 12, 2025, 9:54:11 AM (2 days ago) Dec 12
to tesseract-ocr
I'm trying to build a scan postprocessor, where scanned pages (with mainly text) can be corrected/improved when pages are smeared or characters were not properly recognized.

I'm using pytesseract 5.5.0.20241111 on Windows with both image_to_pdf_or_hocr (mainly for seeing the character recognition confidence) and image_to_boxes to get the bbox info for each character. Except for very few  (most often zero) deviations per page between both methods, image_to_boxes returns strange box infos in many cases: I have attached returned boxes for "development", where all character boxes look ok, and "adhered", "have", and "transferred", where one character box is completely off, and another box combines two characters - even though the box lines contain the single, correctly recognized character.

Any idea, where this could come from or how to avoid this?
transferred.tif
have.tif
development.tif
adhered.tif

Zdenko Podobny

unread,
Dec 12, 2025, 12:34:57 PM (2 days ago) Dec 12
to tesser...@googlegroups.com
Hi there,

I’m afraid our crystal balls are on permanent vacation, so we have no way of magically knowing which Tesseract version or language model you used (and I'm certain you remember that pytesseract is merely a wrapper for the Tesseract executable).

Typically, unless you provide detailed steps and input images to reproduce your issue,  the best we can do is wish you luck and watch you wrestle with the problem on your own.

Zdenko


pi 12. 12. 2025 o 15:54 Jürgen Uhl <juergen...@gmail.com> napísal(a):
I'm trying to build a scan postprocessor, where scanned pages (with mainly text) can be corrected/improved when pages are smeared or characters were not properly recognized.

I'm using pytesseract 5.5.0.20241111 on Windows with both image_to_pdf_or_hocr (mainly for seeing the character recognition confidence) and image_to_boxes to get the bbox info for each character. Except for very few  (most often zero) deviations per page between both methods, image_to_boxes returns strange box infos in many cases: I have attached returned boxes for "development", where all character boxes look ok, and "adhered", "have", and "transferred", where one character box is completely off, and another box combines two characters - even though the box lines contain the single, correctly recognized character.

Any idea, where this could come from or how to avoid this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/19867a14-14cd-44bd-8f14-ca7d7d926ce4n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages