Recommendation on how to best train Tesseract for new UTF-8 symbols

41 views
Skip to first unread message

Rafay Kalim

unread,
May 21, 2019, 10:37:57 AM5/21/19
to tesseract-ocr
Hey, so I am trying to train a new Tesseract model to only recognize certain UTF-8 symbols as I want an OCR that only recognizes these symbols and not other English letters etc. I realize there are two ways I can do this - one is to fine tune Tesseract over the normal English model and then blacklist the English text or train a completely new model that only recognizes this text. I was wondering if I could get some input into which of these - or another method, is better for ease, time and accuracy.

The context is I will have some various texts on a board and I want to recognize the locations of the symbols. However, I don't want to recognize any of the English or anything else as this may mess with my post processing. I have tried a few locations (like restricting where these symbols can be on the board and then only scanning the text in those strips) but I am not satisfied with the results. Additionally, I can also control the font and the size of the text on the board and everything else, except the actual codes. 

Thanks for the help!

Lorenzo Bolzani

unread,
May 21, 2019, 10:49:17 AM5/21/19
to tesser...@googlegroups.com
Hi,
when you fine tune the model (maybe with ocrd-train) you can choose to restrict the model output to a smaller set of characters. No need to blacklist or anything else.

If you just want to locate the symbols something like opencv matchTemplate or training an opencv/dlib hog detector may be more appropriate. Using tesseract looks like a very convoluted way to do it.

If you have multiple symbols use multiple patterns/train multiple detectors.


Bye

Lorenzo



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3237ae86-db20-467c-bebc-6b45f854e799%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages