what mean updatesubtrainer?

69 views
Skip to first unread message

Pndaza

unread,
Jun 18, 2019, 9:57:16 AM6/18/19
to tesseract-ocr
i do fine tuning for myanmar with 20k textline for 5 fonts.
base traineddat is not from best , but from this mya-layer.zip by shreeshrii.

after 80k iterations, error tate is 0.546

Screenshot_6.jpg


after that i get no improvement.

Screenshot_7.jpg

updatesubtrainer is runing periodically.

is is normal?

Shree Devi Kumar

unread,
Jun 18, 2019, 11:41:09 AM6/18/19
to tesser...@googlegroups.com
Convert a few checkpoints to trained data / run lstmeval on them. 

You don't want to overfit the model.

Pndaza

unread,
Jun 18, 2019, 1:04:08 PM6/18/19
to tesseract-ocr
eval result.

At iteration 0, stage 0, Eval Char error rate=0.44462951, Word error rate=2.4380774

how can i tune to get error rate lower than 0.1%

Pndaza

unread,
Jun 19, 2019, 12:58:00 AM6/19/19
to tesseract-ocr
 I wrongly gave old traineddata (mya-layer.traineddata) for lstmtraing --traineddata instead of starter traineddata.
so i am retraining again.
i will infrom result again.

while extracting unicharset, unicharset_extractor say

Two grapheme links in a row:0x103a 0x1039X
Invalid start of Myanmar syllable:0x103aX
Normalization failed for sXtring '၁။  ‘‘မနောပုဗ္ဗင်္ဂမာ ဓမ္မာ၊ မနောသေဋ္ဌာ မနောမယာ။'
0x103a 0x1039 combination is valid if it is preceed by 0x1004.
these are called kinzi.


When the first consonant in a consonant cluster is a non-word-final  [U+1004 MYANMAR LETTER NGA] it rises over the following letter and keeps its virama, rather than pushing the following consonant below it, eg. အင်္ဂလန် ʔŋˣ͓glnˣ ʔɪ̀ɴga̰làɴ England. This is called 'kinzi' (ကင်းစီး kɪ́ɴzí). To achieve this, use the sequence  +  ် +  ္ [U+1004 MYANMAR LETTER NGA + U+103A MYANMAR SIGN ASAT + U+1039 MYANMAR SIGN VIRAMA​] , then continue with the next letter.

 
is it a bug?




Pndaza

unread,
Jun 19, 2019, 7:43:10 AM6/19/19
to tesseract-ocr
In lsmttraining process, is say

Can't encode transcription: 'ကင်းသည် ဖြစ်ရာ၏၊ မင်းမြတ် ခြင်္သေ့၏ ရှေးဦးစွာသောX ဤအင်္ဂါကို ယူအပ်၏။' in language '

when there have kinzi in string.

Pndaza

unread,
Jun 21, 2019, 7:30:39 AM6/21/19
to tesseract-ocr
i tried with differnt leaning rate(0.005). but not ok.
Should i increase data set.

// best, so go back to the best model and try a different learning rate.


On Tuesday, 18 June 2019 22:11:09 UTC+6:30, shree wrote:
Reply all
Reply to author
Forward
0 new messages