Why does tessaract fail on this image?

Tariq Ahmad

unread,

Jun 10, 2020, 12:50:38 PM6/10/20

to tesseract-ocr

I cannot understand whyTessaract fails on this (cropped) image:

Yet if i add a random white border it works:

Can anyone shed any light please?

Zdenko Podobny

unread,

Jun 11, 2020, 2:30:50 PM6/11/20

to tesser...@googlegroups.com

https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md#missing-borders

Zdenko

st 10. 6. 2020 o 18:50 'Tariq Ahmad' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/280cee80-aad1-4245-8346-25d87d447730o%40googlegroups.com.

Tariq Ahmad

unread,

Jun 12, 2020, 6:31:42 AM6/12/20

to tesseract-ocr

Many thanks for your reply - useful to know.

I now find that pytesseract is returning the wrong coordinates for individual characters. For example, for this image (which has a 10pixel border):

image_to_boxes returns:

A: 17 32 10 22

L: 17 32 24 33

etc

These I believe are interpreted as (left bottom right top) and when I extract the image for the letter A I get:

However, the same code works correctly for:

On Thursday, 11 June 2020 19:30:50 UTC+1, zdenop wrote:

https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md#missing-borders

Zdenko

st 10. 6. 2020 o 18:50 'Tariq Ahmad' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

I cannot understand whyTessaract fails on this (cropped) image:

Yet if i add a random white border it works:

Can anyone shed any light please?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,

Jun 12, 2020, 9:09:39 AM6/12/20

to tesser...@googlegroups.com

search for forum/issue tracker - there is explanation why LSTM can not exact character box coordinates.

If you need exact character boxes IMO you need to use legacy engine (but it could have other problems)

Zdenko

pi 12. 6. 2020 o 12:31 'Tariq Ahmad' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/53639a29-76a4-4917-8f74-743d48e1de77o%40googlegroups.com.

Reply all

Reply to author

Forward