Hi group,
I'm trying to retrain top layers from the chi_sim tessdata_best model using Tesseract 4.0.0. Combine_tessdata says this about the network: Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] . I noticed at the end: O1c1 says just 1 output class. When unpacked, its unicharset contains 4022 characters, why unicharset doesn't match outputs?
I'm adding characters as well as providing new fonts. When retraining with '--append_index 5 --net_spec [Lfx512 O1c1]', the training tool complains about output class count
Appending a new network to an old one!!Warning: given outputs 1 not equal to unicharset of 5077.
Then it insisted on another structure: Built network:[1,48,0,1[C3,3Ft16]Mp3,3Lfys64Lfx96Lrx96Lfx512Fc5077] from request [Lfx512 O1c1].
The starter traineddata is created this way:
combine_lang_model --input_unicharset model/custom/custom.lstm-unicharset --script_dir data/langdata_lstm --words data/langdata_lstm/chi_sim/chi_sim.wordlist --puncs data/langdata_lstm/chi_sim/chi_sim.punc --numbers data/langdata_lstm/chi_sim/chi_sim.numbers --output_dir model --lang chi_sim --pass_through_recoder
And .lstm-unicharset is generated from 'unicharset_extractor --norm_mode 1' with box files.
Where did I do wrong?
Thanks in advance,
He Shiming