I'm trying to build a scan postprocessor, where scanned pages (with mainly text) can be corrected/improved when pages are smeared or characters were not properly recognized.
I'm using pytesseract 5.5.0.20241111 on Windows with both image_to_pdf_or_hocr (mainly for seeing the character recognition confidence) and image_to_boxes to get the bbox info for each character. Except for very few
(most often zero)
deviations per page between both methods, image_to_boxes returns strange box infos in many cases: I have attached returned boxes for "development", where all character boxes look ok, and "adhered", "have", and "transferred", where one character box is completely off, and another box combines two characters - even though the box lines contain the single, correctly recognized character.
Any idea, where this could come from or how to avoid this?