Invented Characters In Output Stream

12 views
Skip to first unread message

Dave Wood

unread,
Dec 28, 2019, 5:05:42 PM12/28/19
to tesseract-ocr

Using Tesseract Windows 5.0, psm=6, oem=1, eng.traineddata 2018 LSTM + legacy


As shown in the attached example files, Tesseract sometimes just adds characters out of thin air into the output stream. Attached are:


Invented Characters Input.png - file input to Tesseract
Invented Characters Output.txt - Tesseract text output


If you look at the sixth non-blank line down in the output which begins with "STYLE" you will see after "PRODUCTION DATE" on that line there are two tildes "~~" followed by a date "06/21/15".


If you look at the input .png file you will see that the image is completely and entirely blank between "PRODUCTION DATE" and the date. So why and how is Tesseract essentially inventing the in-between characters out of thin air?


I have seen other cases like this, more frequently when using the FAST version of the eng.traineddata.

Invented Characters Input.png
Invented Characters Output.txt
Reply all
Reply to author
Forward
0 new messages