Hello,
I should read out fixed-format time and date from images.
The task is rather trivial, but tesseract performs weirdly.
I am using the Danish trained model. The format of the date string is dd.mm.yy the time is hh:mm.
Very often the ':' in the time is recognized as '1', but this is not difficult to correct.
In the date I experienced letter 'U' and 'O' instead of number '0' (this is neither very difficult to postprocess) and letter 'U' and 'H' instead of number '11'.
This is harder ...
The English pretrained model works - on the checked examples - perfectly (but I cant use it because the our embedded system has not enough memory).
I can build whitelist of characters with numbers and separators only. The precision doesn't inclease too much ...
Because of the format is fixed, I tried to use patterns: \d\d.\d\d.\d\d for the date and \d\d:\d\d for the time.
With English model the pattern file is accepted and obviously is used, but the accuracy drops (starts to mismatch the ':' with '1', putting space between day, month and year ,,,)
With the danish model I get error message (sorry I can't quote it (I am on an other computer), but it cant recognize the format of the regexp, or similar ...) with the _same_ pattern file.
How the pattern file depend on the language?
What other way one can imagine to improve my model ...
I am _no_t using LSTM but tesseract 4.0.0 on linux.
Thanks in advance,
Karoly