Icon recognition using Tesseract OCR

259 views

Skip to first unread message

Rabie Sadoq

unread,

Jan 29, 2021, 11:30:01 AM1/29/21

to tesser...@googlegroups.com

Hello, I am using tesseract in my QA Team to make test by taking pictures of our products screens.

We have some difficulties to achieve some recognitions especially with icons.

I have tried to do icon data training on our proprietary font using jTessBoxEditor-2.3.1. It gave some promising results, however, some icons are recognized as latin characters and we suspect that the issue is introduced in the step of generating the unicharset file.

The following are screen shots of the first 3 commands of jTessBoxEditor Trainer log on icons file, we would like you to take a look at them and see if there is a correction to be made:

Command 1:

There are 2 icon failures to match blob

Command 2:

Unicharset file generated does not recognize some characters ?

Command 3:

Tesseract documentation says: --script-dir should point to a directory containing the relevant .unicharset file(s) for your training character set. These can be downloaded from https://github.com/tesseract-ocr/langdata).

Knowing that the used font containing icons is proprietary, I didn’t know where to point the --script-dir. The directory used in in the command above was automatically set when running the training commands, hence the failure to load script unicharset ? do you have any suggestions where to point the --script-dir. ?

Unicharset file:

The above is part of the generated unicharset file where you can see the 4 icons being recognized as latin characters. These icons are always recognized as the corresponding latin characters when the tesseract command is run using the generated traineddata file. Do you have any suggestions what is the origin of the issue and how it could be corrected?