Tesseract returns exotic characters while processing standard latin-script document

69 views

Skip to first unread message

Pavel Hanák

unread,

Aug 21, 2025, 10:20:56 AMAug 21

to tesseract-ocr

Short version: Ghostscipt uses Tesseract, but their data exchange interface may contain a bug. However, their developers are not convinced it's really a bug, so I'm trying to find more evidence here.

Long version: Ghostscript now has the ability to perform OCR on documents via Tesseract. It has a really nifty feature you don't have to flatten the document to bitmap first, which is generally undesirable. Instead, Ghostscript takes vector text (its glyphs), renders a small portion of them to bitmap and feeds it to Tesseract. Then it takes the resulting character codes and assigns them to original vector glyphs, thus preserving the vector content of the document. I tried to use this feature to fix old PDF files that have completely garbled text encoding, i.e. their text looks fine on screen, but total garbage ("mojibake") is returned when I try to copy and paste from them.

It works surprisingly well, but I noticed one oddity: sometimes Tesseract returns characters from very exotic languages, even though the document's language is specified. In my case, the document is Czech, but certain characters are consistently returned as Ol Chiki or Hangul (Korean alphabet). My original bug reports contains concrete examples and a suprisingly detailed reply from one Ghostscript developer. It would be pointless to repeat it, so please look here:

https://bugs.ghostscript.com/show_bug.cgi?id=708548

Do you think he is right? I checked GS source code, but couldn't glean which --psm setting they use. I assume it's 7 (single line) or 8 (single word). Can Tesseract return characters from totally different alphabets with this setting? I tried to google it of course, but found nothing conclusive. Thanks.

Reply all

Reply to author

Forward

0 new messages