My data looks clean, why is it not recognised properly

Soul Green

unread,

Apr 20, 2021, 3:11:48 AM4/20/21

to tesseract-ocr

I am very new to coding so forgive me.

I have been having an extremely low success rate with tesseract.

Here are 3 examples both pre- and post- processing:

These were scanned as "a" ,"Ss30", and "moh" respectively.
I consider the yellow one a success, as I can just regex the 30 out of the result, but I still don't understand how it could be so off for the rest.

I've tried different traineddatas, even including one that I trained myself on over 200 data examples.

I have three theories as to why I couldn't train it:

1. The different colours are processed differently, causing differently shaped characters. (Red looks bold and yellow looks thin)
2. The different sizes of the images causes the characters to be slightly differently shaped when cropped.
3. Tesseract assumes that the two lines of text are one, and reads them together.

Could someone please give me a hint on what to try? I don't want to spend another day training it on just blue ones (for example) only to find that colour isn't the problem.
Thanks

Zdenko Podobny

unread,

Apr 20, 2021, 3:14:56 AM4/20/21

to tesser...@googlegroups.com

Hint: read documentation, stop guessing. You can start here https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md

Zdenko

ut 20. 4. 2021 o 9:11 Soul Green <soul...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com.

Soul Green

unread,

Apr 20, 2021, 4:40:29 AM4/20/21

to tesseract-ocr

Omg thanks.

I hadn't thought about checking that documentation. I've been using tesseract.js with node so I completely forgot that it was based on something else. How amateur.

I also didn't know that tesseract did its own processing as well.

Thanks again I'll try everything there

Zdenko Podobny

unread,

Apr 20, 2021, 5:10:28 AM4/20/21

to tesser...@googlegroups.com

Tesseract is an OCR engine, so try to eliminate graphics elements by yourself/send only text areas to OCR.

Zdenko

ut 20. 4. 2021 o 10:40 Soul Green <soul...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com.

Reply all

Reply to author

Forward

Message has been deleted