Replace top layers, output class count and recoder (chi_sim)

33 views
Skip to first unread message

Shiming He

unread,
Jan 24, 2019, 9:55:37 AM1/24/19
to tesseract-ocr
Hi group,

I'm trying to retrain top layers from the chi_sim tessdata_best model using Tesseract 4.0.0. Combine_tessdata says this about the network: Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] . I noticed at the end: O1c1 says just 1 output class. When unpacked, its unicharset contains 4022 characters, why unicharset doesn't match outputs?

I'm adding characters as well as providing new fonts. When retraining with '--append_index 5 --net_spec [Lfx512 O1c1]', the training tool complains about output class count

 Appending a new network to an old one!!Warning: given outputs 1 not equal to unicharset of 5077.

Then it insisted on another structure: Built network:[1,48,0,1[C3,3Ft16]Mp3,3Lfys64Lfx96Lrx96Lfx512Fc5077] from request [Lfx512 O1c1]. 

The starter traineddata is created this way:

combine_lang_model --input_unicharset model/custom/custom.lstm-unicharset --script_dir data/langdata_lstm --words data/langdata_lstm/chi_sim/chi_sim.wordlist --puncs data/langdata_lstm/chi_sim/chi_sim.punc --numbers data/langdata_lstm/chi_sim/chi_sim.numbers --output_dir model --lang chi_sim --pass_through_recoder

And .lstm-unicharset is generated from 'unicharset_extractor --norm_mode 1' with box files.

Where did I do wrong?

Thanks in advance,
He Shiming

易鑫

unread,
Mar 27, 2019, 4:34:59 AM3/27/19
to tesseract-ocr
Hello,
Did you fix this problem, I am encounter this problem now? I have tried many ways,include your method.
thanks.

在 2019年1月24日星期四 UTC+8下午10:55:37,Shiming He写道:

易鑫

unread,
Mar 27, 2019, 4:47:34 AM3/27/19
to tesseract-ocr
Sorry,my problem is different from yours.I looked wrong just now.
Your confusion:
 "I noticed at the end: O1c1 says just 1 output class. When unpacked, its unicharset contains 4022 characters, why unicharset doesn't match outputs?  "

you can find this information in the wiki  "The number of classes is ignored (only there for compatibility with TensorFlow) as the actual number is taken from the unicharset."

So the number is ignored I think.

易鑫 <yixinl...@gmail.com> 于2019年3月27日周三 下午4:35写道:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e84bead-32ab-4bb4-acbf-4f5e69470987%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages