non english user-patterns

42 views

Skip to first unread message

Karoly Makonyi

unread,

Apr 8, 2020, 1:37:04 PM4/8/20

to tesseract-ocr

Hello,

I should read out fixed-format time and date from images.

The task is rather trivial, but tesseract performs weirdly.

I am using the Danish trained model. The format of the date string is dd.mm.yy the time is hh:mm.
Very often the ':' in the time is recognized as '1', but this is not difficult to correct.
In the date I experienced letter 'U' and 'O' instead of number '0' (this is neither very difficult to postprocess) and letter 'U' and 'H' instead of number '11'.

This is harder ...
The English pretrained model works - on the checked examples - perfectly (but I cant use it because the our embedded system has not enough memory).

I can build whitelist of characters with numbers and separators only. The precision doesn't inclease too much ...

Because of the format is fixed, I tried to use patterns: \d\d.\d\d.\d\d for the date and \d\d:\d\d for the time.
With English model the pattern file is accepted and obviously is used, but the accuracy drops (starts to mismatch the ':' with '1', putting space between day, month and year ,,,)
With the danish model I get error message (sorry I can't quote it (I am on an other computer), but it cant recognize the format of the regexp, or similar ...) with the _same_ pattern file.

How the pattern file depend on the language?

What other way one can imagine to improve my model ...

I am _no_t using LSTM but tesseract 4.0.0 on linux.

Thanks in advance,
Karoly

Karoly Makonyi

unread,

Apr 9, 2020, 12:58:23 AM4/9/20

to tesseract-ocr

OK, here the error that the Danish model generates:
Error: failed to insert pattern '\d\d.\d\d.\d\d'
Error: failed to insert pattern '\d\d:\d\d'

regards,
Karoly

> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/def598f3-4e33-4d73-b3a5-9615192b3ff3%40googlegroups.com.
>

Reply all

Reply to author

Forward

0 new messages