Traineddata always ended in same size and did not match with wordlist

262 views
Skip to first unread message

easyma...@gmail.com

unread,
Jan 8, 2018, 7:24:47 AM1/8/18
to tesseract-ocr
Hi all,

I am doing my project using Tesseract v4.00, and always getting the traineddata output in the same size after training with my own data.
I suppose that I did not do the steps correctly..

The only data that I provided were:
1. training_text
2. puncs (I just reduced the general punc as provided in tesseract github)
3. numbers
4. wordlists (I made various wordlists for several training, ranging between 100.000 - 2.000.000) 
5. font name (I also made various fonts for several training, ranging between 1 - 20 fonts)

The steps that I did were:
1. Made tiff file, unicharset and other complement data using tesstrain.sh
2. Made tiff file, unicharset and other complement data using tesstrain.sh for evaluation
3. Combined unicharset, wordlists, puncs, numbers and version_str to create started traineddata using combine_lang_data ( I am still not confident with the value of version_str though)
4. Trained data using lstmtraining
5. Combined all output file using lstmtraining --continue_from ...

Yet, all of my training ended with same size which is 10.5MB..
Did I do all my steps correctly?

Once, I also trained with modifying WORD_DAWG_FACTOR in language_spesific.sh to 0 and 1, because I want to read the text and match 100% with my wordlists. But, the result also did not satisfy me, some words are not in my wordlists such as "USISUSISU".
Do you know whats the cause?

I really appreciate if anyone can help or suggest any solution.
Thankyou !!

ShreeDevi Kumar

unread,
Jan 8, 2018, 7:36:50 AM1/8/18
to tesser...@googlegroups.com
Did you use --stop_training flag at the end?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

easyma...@gmail.com

unread,
Jan 9, 2018, 1:31:53 AM1/9/18
to tesseract-ocr
Yes, I did the following command in tesseract/training directory:

lstmtraining --stop_training --continue_from ../result/mylangoutput/base_checkpoint --traineddata ../result/mylangcombine/mylang/mylang.traineddata --model_output ../result/mylangoutput/mylang.traineddata
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Jan 9, 2018, 6:17:40 AM1/9/18
to tesser...@googlegroups.com
1. If you use tesstrain.sh, it will create the starter traineddata, you do NOT need to run combine_lang_data. If you want to change version string, look at tesstrain_utils.sh and modify the command in it.

2. If you are always getting the same size file, it looks like that you are probably copying some old file as traineddata as part of your script. It could be copying from a wrong folder or some such thing.

I am attaching a bash script, you can modify it for your setup and try if that helps.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
finetune.sh.txt

easyma...@gmail.com

unread,
Jan 9, 2018, 6:45:26 AM1/9/18
to tesseract-ocr
Wow, thank you for your time and response !
I really appreciate that.

My reason for using combine_lang_data is to make my punc, wordlist, and numbers effects the trainned data.. Or, it doesn't work like that?

Now, I will try your shell script for training, and will share the result if its done 

ShreeDevi Kumar

unread,
Jan 9, 2018, 7:36:08 AM1/9/18
to tesser...@googlegroups.com

My reason for using combine_lang_data is to make my punc, wordlist, and numbers effects the trainned data.. Or, it doesn't work like that?

​If you update the files in langdata folder and then run tesstrain.sh, it will automatically use your files.

Now, I will try your shell script for training, and will share the result if its done 

​You will need to modify it according to the location of your files.

Also, update the fonts list as per your requirements.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

easyma...@gmail.com

unread,
Jan 9, 2018, 8:30:12 AM1/9/18
to tesseract-ocr
Yup, now I am still training the data..

By the way, if I want to set WORD_DAWG_FACTOR, so tesseract will read the text and match 100% with my wordlist, where should I edit the code? Is it in language_specific.sh? ( as I know, parameters wont work in tesseract 4.00)

easyma...@gmail.com

unread,
Jan 10, 2018, 5:26:37 AM1/10/18
to tesseract-ocr
It works !!
I modified your bash script and executed it. Finally I get different traineddata size.

But, can I train it from scratch?
It needs starting traineddata which I can get from combine_lang_model, isn't it?

 
On Tuesday, January 9, 2018 at 7:36:08 PM UTC+7, shree wrote:

ShreeDevi Kumar

unread,
Jan 10, 2018, 6:16:14 AM1/10/18
to tesser...@googlegroups.com
On Wed, Jan 10, 2018 at 3:56 PM, <easyma...@gmail.com> wrote:
It works !!
I modified your bash script and executed it. Finally I get different traineddata size.

But, can I train it from scratch?
It needs starting traineddata which I can get from combine_lang_model, isn't it?


​Starter traineddata will be generated by tesstrain.sh, change the files in langdata folder.​

​To train from scratch, you need to change the lstmtraining command. It will not need continue_from and old_traineddata.

You will need to add a network specification - such as

 --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \

​Usually the best traineddata will have the network spec used for training by Ray as part of the version string.


Message has been deleted

easyma...@gmail.com

unread,
Jan 26, 2018, 3:38:33 AM1/26/18
to tesseract-ocr
Yep, I have done it :)
Thank you for your help Shree

Currently, I have modified my wordlists to 1 million, puncs, numbers and net spec but the result is not as good as finetuning from tessdata_best 
If anyone can suggest any tips for getting better result in training tesseract from scratch, please share it with me :))
Reply all
Reply to author
Forward
0 new messages