Simple image FAIL fails

52 views
Skip to first unread message

Jason

unread,
Apr 29, 2019, 1:11:40 PM4/29/19
to tesseract-ocr
Apologies for such a simple question but this is a super simple test case and I don't understand why it isn't working. This simple image contains the words "PASS" and "FAIL". "PASS" is recognized but "FAIL" comes out as "wee". What can I do to get it to detect "FAIL" properly?

I'm using the demo CPP code, ideally I would like to provide an "approved" word list or be able to remove "wee" and hopefully, it'll match to "FAIL".  I saw https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data but that looks like it's just for the command line utility? How would I go about that int he C++ API?

fixed.png

Shree Devi Kumar

unread,
Apr 29, 2019, 1:50:53 PM4/29/19
to tesser...@googlegroups.com
ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata_fast
PASS wee
ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata_best
PASS AYE
ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata
PASS A\ 8
ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata --oem 1
PASS AYE
ubuntu@tesseract-ocr:~/TEST$ tesseract fixed.png - --psm 6 --dpi 300 --tessdata-dir ~/tessdata --oem 0
PASS FAIL

Looks like `neural net tesseract` performs worse than `base tesseract` in this case.



On Mon, Apr 29, 2019 at 10:41 PM Jason <jaso...@gmail.com> wrote:
Apologies for such a simple question but this is a super simple test case and I don't understand why it isn't working. This simple image contains the words "PASS" and "FAIL". "PASS" is recognized but "FAIL" comes out as "wee". What can I do to get it to detect "FAIL" properly?

I'm using the demo CPP code, ideally I would like to provide an "approved" word list or be able to remove "wee" and hopefully, it'll match to "FAIL".  I saw https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data but that looks like it's just for the command line utility? How would I go about that int he C++ API?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71bab02c-ba21-49dc-8e99-710d52075207%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jason

unread,
Apr 29, 2019, 2:56:58 PM4/29/19
to tesseract-ocr
Thank you for looking into this and confirming I am not crazy.

Lorenzo Bolzani

unread,
Apr 29, 2019, 4:30:32 PM4/29/19
to tesser...@googlegroups.com
Hi,
inverting the image gives the correct results. Also cropping the image just around the text works.


Lorenzo

Il giorno lun 29 apr 2019 alle ore 19:11 Jason <jaso...@gmail.com> ha scritto:
Apologies for such a simple question but this is a super simple test case and I don't understand why it isn't working. This simple image contains the words "PASS" and "FAIL". "PASS" is recognized but "FAIL" comes out as "wee". What can I do to get it to detect "FAIL" properly?

I'm using the demo CPP code, ideally I would like to provide an "approved" word list or be able to remove "wee" and hopefully, it'll match to "FAIL".  I saw https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data but that looks like it's just for the command line utility? How would I go about that int he C++ API?

Jason

unread,
Apr 30, 2019, 4:19:41 AM4/30/19
to tesseract-ocr
That's interesting because everything I've read about tesseract says that white/black or black/white (foreground/background) doesn't matter because it uses edge detection. (Outlines)

https://research.google.com/pubs/archive/33418.pdf
"by inspection of the nesting of
outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text"

Zdenko Podobny

unread,
Apr 30, 2019, 4:22:46 AM4/30/19
to tesser...@googlegroups.com
Which is valid for 3.05 and older version (a.k.a legacy engine) ...

Zdenko


ut 30. 4. 2019 o 10:19 Jason <jaso...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages