Hello, I am using tesseract in my QA Team to make test by taking pictures of our products screens.
We have some difficulties to achieve
some recognitions especially with icons.
I have tried to do icon data training on our proprietary font using jTessBoxEditor-2.3.1. It gave some promising results, however, some icons are recognized as latin characters and we suspect that the issue is introduced in the step of generating the unicharset file.
The following are screen shots of the first 3 commands of jTessBoxEditor
Trainer log on icons file, we would like you to take a look at them and see if there is a
correction to be made:
Command 1:
There are 2
icon failures to match blob
Command 2:
Unicharset file generated does not recognize some characters ?
Command 3:
Tesseract
documentation says: --script-dir
should point to a directory containing the relevant .unicharset file(s)
for your training character set. These can be downloaded from https://github.com/tesseract-ocr/langdata).
Knowing
that the used font containing icons is proprietary, I didn’t know where to
point the --script-dir
. The directory used in in the command above was automatically set when
running the training commands, hence the failure to load script
unicharset ? do you have any suggestions where to point the --script-dir
. ?
Unicharset file:
The above is part of the generated unicharset file where you can see the 4 icons being recognized as latin characters. These icons are always recognized as the corresponding latin characters when the tesseract command is run using the generated traineddata file. Do you have any suggestions what is the origin of the issue and how it could be corrected?
Thank you
in advance for your reply.
Cheers,
Rabie