As good as Latin.traineddata (fast integer) but faster

O CR

unread,

Apr 8, 2020, 11:10:28 AM4/8/20

to tesseract-ocr

Hi all,

I try to read names on images with tesseract LSTM. Names like:

Śerena Kovitch

ŁAGUNA EVREIST

Äna Optici

Orğu Moninck

(I don't have to recognize words)

Latin.traineddata (fast integer) is doing well with the diacritics, but there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ ,﹚ ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so Latin.traineddata is too slow.

So I thought I take eng.traineddata (best float for LSTM) and I train it for the diacritics. But there are almost 400 diacritics. So I don't know if fine-tuning for such amount of characters is a good idea?

However I tried it but the quality is very poor.

I trained with eng.training_text (a English text of 72 lines) and I added all the diacritics several times. The char error rate during lstmeval is around 0.1. I did a test with 80 documents, and I read 30 names correct. (on each document there is one name). (time is similar to Latin.traineddata)

What can I do to get a model that is as good as Latin.traineddata on diacritics but is much faster in ocr reading?

Thank you.

Shree Devi Kumar

unread,

Apr 8, 2020, 12:27:15 PM4/8/20

to tesseract-ocr

I suggest you fine-tune Latin.traineddata using text of the kind you expect. It will have a smaller unicharset and when you convert to fast integer model, it should be smaller in size.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com.

O CR

unread,

Apr 10, 2020, 7:29:03 AM4/10/20

to tesseract-ocr

Which language do I have to use? Because Latin isn't supported.

./tesstrain.sh --fonts_dir "/usr/share/fonts" --lang Latin --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir ./tessdata --output_dir ./output

Op woensdag 8 april 2020 18:27:15 UTC+2 schreef shree:

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

Apr 10, 2020, 8:17:55 AM4/10/20

to tesseract-ocr

The file is probably there as script/Latin.traineddata

You can copy to wherever you are looking for the best traineddata files.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com.

O CR

unread,

Apr 10, 2020, 10:24:04 AM4/10/20

to tesseract-ocr

Thank you for responding.

I did the finetuning on the best Latin float model. And I converted the model to integer. But it's still slower then the fast integer Latin model....

Any other ideas to make it faster?

Op vrijdag 10 april 2020 14:17:55 UTC+2 schreef shree:

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com.

Lorenzo Bolzani

unread,

Apr 10, 2020, 11:27:26 AM4/10/20

to tesser...@googlegroups.com

Hi,

I started writing this email thinking that removing some characters should not make any real difference: I think the model parameters do not change with fine tuning and even when removing a few layers the bulk of the model remains the same.

I decided to test it and I found a very strange thing. I have 14 custom trained models and I found out that 2/3 of these are twice as slow as the others.

The slow ones are as slow as the standard ones, "eng", "spa", etc.

I do not remember ever converting them to fast models. All the models are about 6.4MB, all trained with ocr-d (tesstrain).

The speed difference is visible from python code (tesserocr API wrapper) and from command line (I repeat the same recognition 100 times, with one as warmup.

The oldest ones (maybe trained with 4.0.0-beta?), from 2018, are generally faster except for one. All use a reduced charset but the size of the charset makes no difference.

Any ideas?

Bye

Lorenzo

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Lorenzo Bolzani

unread,

Apr 10, 2020, 1:35:20 PM4/10/20

to tesser...@googlegroups.com

I thought this may lead to some insights useful for the OP but as the matter gets more mysterious I'm opening a new thread not to hijack this.

Lorenzo

Shree Devi Kumar

unread,

Apr 10, 2020, 9:36:15 PM4/10/20

to tesseract-ocr

Please see
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-tessdata_fast

It seems that Ray used a smaller network spec for many languages when training for tessdata_fast to speed them up. However since their float versions are not available, training has to be done using tessdata_best models. That might explain the result you got.

Fine-tuning for impact does not change the model. Plus-minus or replace top layer may do that.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com.

Reply all

Reply to author

Forward