Using Tesseract Windows 5.0, psm=6, oem=1, eng.traineddata 2018 LSTM + legacy
As shown in the attached example files, Tesseract sometimes just adds characters out of thin air into the output stream. Attached are:
Invented Characters Input.png - file input to Tesseract
Invented Characters Output.txt - Tesseract text output
If you look at the sixth non-blank line down in the output which begins with "STYLE" you will see after "PRODUCTION DATE" on that line there are two tildes "~~" followed by a date "06/21/15".
If you look at the input .png file you will see that the image is completely and entirely blank between "PRODUCTION DATE" and the date. So why and how is Tesseract essentially inventing the in-between characters out of thin air?
I have seen other cases like this, more frequently when using the FAST version of the eng.traineddata.