Recognition contains an extra letter

45 views

Skip to first unread message

Yunlong Liu

unread,

Jun 13, 2022, 4:57:17 AM6/13/22

to tesseract-ocr

Dear developers,

I had read carefully the online material about how to use Tesseract for OCR tasks. It works well for most of the data on my side. However, I found one weird thing which confuses me quite a lot. Here are the details.

1. Below is the image I am using. Basically, I have already binarized it to make all the pixel values either 0s or 255s. And the letter's height is ~30 pixels.

2. I compiled the main branch locally. Here is the version info on my side

3. After running the command "Tesseract TesseractInputImageSingle.png - --oem 1 --psm 7", I got "DOT 0O4N 6VHPPC" which contains an extra ZERO in red.

Could anyone kindly explain why it happens and how to avoid the confusing ZERO during OCR?

4. I also tried with oem = 0 because some users recommended to use this mode for code recognition, the result shows "DOT O4N GVHPPC" with "6" wrongly recognized as "G".

This email and any attachment(s) it may contain is confidential and is intended solely for the use of the individual(s) to whom it is addressed. If you are not the intended recipient of this email, you must not take action based on the contents, nor distribute, nor expose any part of the content(s) to entities or person(s) beyond the original distribution list. Please contact the sender and delete the email if you have received it in error. Thank you.

Merlijn B.W. Wajer

unread,

Jun 13, 2022, 5:03:34 AM6/13/22

to tesser...@googlegroups.com

Hi,

On 13/06/2022 10:21, 'Yunlong Liu' via tesseract-ocr wrote:
> Dear developers,
>
> I had read carefully the online material about how to use Tesseract for
> OCR tasks. It works well for most of the data on my side. However, I
> found one weird thing which confuses me quite a lot. Here are the details.
>
> 1. Below is the image I am using. Basically, I have already binarized it
> to make all the pixel values either 0s or 255s. And the letter's height
> is ~30 pixels.

> TesseractInputImageSingle.png
> 2. I compiled the main branch locally. Here is the version info on my side

> Tesseract Version Info.png
> 3. After running the command "Tesseract TesseractInputImageSingle.png -

> --oem 1 --psm 7", I got "DOT *0*O4N 6VHPPC" which contains an extra ZERO
> in red.
>
> *Could anyone kindly explain why it happens and how to avoid the
> confusing ZERO during OCR?*

>
> 4. I also tried with oem = 0 because some users recommended to use this

> mode for code recognition, the result shows "DOT O4N *G*VHPPC" with "6"

> wrongly recognized as "G".

You might want to take a look at this issue and see if it helps:
https://github.com/tesseract-ocr/tesseract/pull/3476

Cheers,
Merlijn

Reply all

Reply to author

Forward

0 new messages