Strange bbox'es in image_to_boxes

69 views

Skip to first unread message

Jürgen Uhl

unread,

Dec 12, 2025, 9:54:11 AM12/12/25

to tesseract-ocr

I'm trying to build a scan postprocessor, where scanned pages (with mainly text) can be corrected/improved when pages are smeared or characters were not properly recognized.

I'm using pytesseract 5.5.0.20241111 on Windows with both image_to_pdf_or_hocr (mainly for seeing the character recognition confidence) and image_to_boxes to get the bbox info for each character. Except for very few (most often zero) deviations per page between both methods, image_to_boxes returns strange box infos in many cases: I have attached returned boxes for "development", where all character boxes look ok, and "adhered", "have", and "transferred", where one character box is completely off, and another box combines two characters - even though the box lines contain the single, correctly recognized character.

Any idea, where this could come from or how to avoid this?

transferred.tif

have.tif

development.tif

adhered.tif

Zdenko Podobny

unread,

Dec 12, 2025, 12:34:57 PM12/12/25

to tesser...@googlegroups.com

Hi there,

I’m afraid our crystal balls are on permanent vacation, so we have no way of magically knowing which Tesseract version or language model you used (and I'm certain you remember that pytesseract is merely a wrapper for the Tesseract executable).

Typically, unless you provide detailed steps and input images to reproduce your issue, the best we can do is wish you luck and watch you wrestle with the problem on your own.

Zdenko

pi 12. 12. 2025 o 15:54 Jürgen Uhl <juergen...@gmail.com> napísal(a):

I'm trying to build a scan postprocessor, where scanned pages (with mainly text) can be corrected/improved when pages are smeared or characters were not properly recognized.

I'm using pytesseract 5.5.0.20241111 on Windows with both image_to_pdf_or_hocr (mainly for seeing the character recognition confidence) and image_to_boxes to get the bbox info for each character. Except for very few (most often zero) deviations per page between both methods, image_to_boxes returns strange box infos in many cases: I have attached returned boxes for "development", where all character boxes look ok, and "adhered", "have", and "transferred", where one character box is completely off, and another box combines two characters - even though the box lines contain the single, correctly recognized character.

Any idea, where this could come from or how to avoid this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/19867a14-14cd-44bd-8f14-ca7d7d926ce4n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages