My data looks clean, why is it not recognised properly

121 views
Skip to first unread message

Soul Green

unread,
Apr 20, 2021, 3:11:48 AM4/20/21
to tesseract-ocr
I am very new to coding so forgive me.

I have been having an extremely low success rate with tesseract.
Here are 3 examples both pre- and post- processing:

red1.jpgcroppedred1.jpg            yellow1.jpgcroppedyellow1.jpg              blue1.jpgcroppedblue1.jpg
These were scanned as "a" ,"Ss30", and "moh" respectively.
I consider the yellow one a success, as I can just regex the 30 out of the result, but I still don't understand how it could be so off for the rest.

I've tried different traineddatas, even including one that I trained myself on over 200 data examples.

I have three theories as to why I couldn't train it:
1. The different colours are processed differently, causing differently shaped characters. (Red looks bold and yellow looks thin)
2. The different sizes of the images causes the characters to be slightly differently shaped when cropped.
3. Tesseract assumes that the two lines of text are one, and reads them together.
 
Could someone please give me a hint on what to try? I don't want to spend another day training it on just blue ones (for example) only to find that colour isn't the problem.
Thanks

Zdenko Podobny

unread,
Apr 20, 2021, 3:14:56 AM4/20/21
to tesser...@googlegroups.com
Hint: read documentation, stop guessing. You can start here https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md

Zdenko


ut 20. 4. 2021 o 9:11 Soul Green <soul...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com.

Soul Green

unread,
Apr 20, 2021, 4:40:29 AM4/20/21
to tesseract-ocr
Omg thanks.
I hadn't thought about checking that documentation. I've been using tesseract.js with node so I completely forgot that it was based on something else. How amateur.
I also didn't know that tesseract did its own processing as well.
Thanks again I'll try everything there

Zdenko Podobny

unread,
Apr 20, 2021, 5:10:28 AM4/20/21
to tesser...@googlegroups.com
Tesseract is an OCR engine, so try to eliminate graphics elements by yourself/send only text areas to OCR.

Zdenko


ut 20. 4. 2021 o 10:40 Soul Green <soul...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
Message has been deleted
0 new messages