combine_lang_model makes no dawg file

40 views
Skip to first unread message

Hosein Khoshdel

unread,
Sep 17, 2018, 6:48:09 AM9/17/18
to tesseract-ocr
i used combine_lang_model like this:

combine_lang_model    --input_unicharset     ../combinelangmodel/fas.lstm-unicharset   \
--script_dir    ../combinelangmodel/sdir   \
--outputdir    outputdir \
--lang    fas  \
--lang_is_rtl    true \
--words    ..\lists\fas.wordlist  \
--puncs    ..\lists\fas.punc  \
--numbers     ..\lists\fas.numbers  \

BTW i get fas.lstm-unicharset by using combine_tessdata with -u on official fas.traineddata and got fas.wordlist, fas.punc and fas.numbers from langdata repo. now almost everything is fine except that when i unpack the resulting traineddata there is no dawg file in it although the help says that if the 3 word lists are provided the dawg files are also added to traineddata file. 
can you please help me and show me what part i am doing wrong?
also the extra spaces in command is just for better readability here 

Shree Devi Kumar

unread,
Sep 17, 2018, 12:25:03 PM9/17/18
to tesser...@googlegroups.com
I use it as follows and it works. Please check that you are using correct paths for the files.

combine_lang_model \
--input_unicharset ./layersan/san.unicharset \
--script_dir ~/langdata \
--words ~/langdata/san/san.wordlist \
--numbers ~/langdata/san/san.numbers \
--puncs ~/langdata/san/san.punc \
--output_dir ./layersan \
--lang san \
--pass_through_recoder \
--version_str ` cat ./layersan/san.new.version`

And, here is the unpacking of this traineddata file

~/tesstutorial-deva/layersan/san$ combine_tessdata -u san.traineddata ./san.

Extracting tessdata components from san.traineddata
Wrote ./san.config
Wrote ./san.lstm-punc-dawg
Wrote ./san.lstm-word-dawg
Wrote ./san.lstm-number-dawg
Wrote ./san.lstm-unicharset
Wrote ./san.lstm-recoder
Wrote ./san.version
Version string:4.0.0-beta.4-138-g2093:san:shreeshrii20180917:from:4.00.00alpha:Devanagari:synth20170629test
0:config:size=1013, offset=192
18:lstm-punc-dawg:size=5306, offset=1205
19:lstm-word-dawg:size=15123986, offset=6511
20:lstm-number-dawg:size=450, offset=15130497
21:lstm-unicharset:size=12621, offset=15130947
22:lstm-recoder:size=1552, offset=15143568
23:version:size=92, offset=15145120




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ecb262d7-d448-4125-a60e-ddf266aea40c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages