Hi,
On 13/06/2022 10:21, 'Yunlong Liu' via tesseract-ocr wrote:
> Dear developers,
>
> I had read carefully the online material about how to use Tesseract for
> OCR tasks. It works well for most of the data on my side. However, I
> found one weird thing which confuses me quite a lot. Here are the details.
>
> 1. Below is the image I am using. Basically, I have already binarized it
> to make all the pixel values either 0s or 255s. And the letter's height
> is ~30 pixels.
> TesseractInputImageSingle.png
> 2. I compiled the main branch locally. Here is the version info on my side
> Tesseract Version Info.png
> 3. After running the command "Tesseract TesseractInputImageSingle.png -
> --oem 1 --psm 7", I got "DOT *0*O4N 6VHPPC" which contains an extra ZERO
> in red.
>
> *Could anyone kindly explain why it happens and how to avoid the
> confusing ZERO during OCR?*
>
> 4. I also tried with oem = 0 because some users recommended to use this
> mode for code recognition, the result shows "DOT O4N *G*VHPPC" with "6"
> wrongly recognized as "G".
You might want to take a look at this issue and see if it helps:
https://github.com/tesseract-ocr/tesseract/pull/3476
Cheers,
Merlijn