Simple image FAIL fails

Jason

unread,

Apr 29, 2019, 1:11:40 PM4/29/19

to tesseract-ocr

Apologies for such a simple question but this is a super simple test case and I don't understand why it isn't working. This simple image contains the words "PASS" and "FAIL". "PASS" is recognized but "FAIL" comes out as "wee". What can I do to get it to detect "FAIL" properly?

I'm using the demo CPP code, ideally I would like to provide an "approved" word list or be able to remove "wee" and hopefully, it'll match to "FAIL". I saw https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data but that looks like it's just for the command line utility? How would I go about that int he C++ API?

fixed.png

Shree Devi Kumar

unread,

Apr 29, 2019, 1:50:53 PM4/29/19

to tesser...@googlegroups.com

ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata_fast

PASS wee

ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata_best

PASS AYE

ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata

PASS A\ 8

ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata --oem 1

PASS AYE

ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata --oem 0

PASS FAIL

Looks like `neural net tesseract` performs worse than `base tesseract` in this case.

On Mon, Apr 29, 2019 at 10:41 PM Jason <jaso...@gmail.com> wrote:

Apologies for such a simple question but this is a super simple test case and I don't understand why it isn't working. This simple image contains the words "PASS" and "FAIL". "PASS" is recognized but "FAIL" comes out as "wee". What can I do to get it to detect "FAIL" properly?

I'm using the demo CPP code, ideally I would like to provide an "approved" word list or be able to remove "wee" and hopefully, it'll match to "FAIL". I saw https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data but that looks like it's just for the command line utility? How would I go about that int he C++ API?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71bab02c-ba21-49dc-8e99-710d52075207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jason

unread,

Apr 29, 2019, 2:56:58 PM4/29/19

to tesseract-ocr

Thank you for looking into this and confirming I am not crazy.

Lorenzo Bolzani

unread,

Apr 29, 2019, 4:30:32 PM4/29/19

to tesser...@googlegroups.com

Hi,

inverting the image gives the correct results. Also cropping the image just around the text works.

Lorenzo

Il giorno lun 29 apr 2019 alle ore 19:11 Jason <jaso...@gmail.com> ha scritto:

Apologies for such a simple question but this is a super simple test case and I don't understand why it isn't working. This simple image contains the words "PASS" and "FAIL". "PASS" is recognized but "FAIL" comes out as "wee". What can I do to get it to detect "FAIL" properly?

I'm using the demo CPP code, ideally I would like to provide an "approved" word list or be able to remove "wee" and hopefully, it'll match to "FAIL". I saw https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data but that looks like it's just for the command line utility? How would I go about that int he C++ API?

Jason

unread,

Apr 30, 2019, 4:19:41 AM4/30/19

to tesseract-ocr

That's interesting because everything I've read about tesseract says that white/black or black/white (foreground/background) doesn't matter because it uses edge detection. (Outlines)

https://research.google.com/pubs/archive/33418.pdf
"by inspection of the nesting of
outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text"

Zdenko Podobny

unread,

Apr 30, 2019, 4:22:46 AM4/30/19

to tesser...@googlegroups.com

Which is valid for 3.05 and older version (a.k.a legacy engine) ...

Zdenko

ut 30. 4. 2019 o 10:19 Jason <jaso...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f3d16f69-51bf-4785-a23f-3c9dddda36c5%40googlegroups.com.

Reply all

Reply to author

Forward