some questions about lstm training

89 views
Skip to first unread message

易鑫

unread,
Jan 24, 2019, 10:34:36 PM1/24/19
to tesseract-ocr
Hello,everyone:
     I am a new user of tesseract 4.0.Now  I follow the instructions(https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) to training lstm model.

By the way,my environment is Ubuntu16.04 and I compile the tessract 4.0 by myself.I met some problems.

I follow these steps.
1.I run this command:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval

It is okay.

2.I run this command
mkdir -p ~/tesstutorial/engoutput
training/lstmtraining --debug_interval 100 \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
Here,I am confused,because currently I am in the tesseract directory, I can not find training folder under this directory.
and I think after I install the tesseract successfully,the system can recognize the lstmtraining command,so I use this command instead.
lstmtraining --debug_interval 100 \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000
There is an error.
mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Segmentation fault (core dumped)
I look the source code in lstmtrainer.h
102   // assumed that the character set is to be re-mapped from old_traineddata to
103   // the new, with consequent change in weight matrices etc.
104   bool TryLoadingCheckpoint(const char* filename, const char* old_traineddata);
105 
106   // Initializes the character set encode/decode mechanism directly from a
107   // previously setup traineddata containing dawgs, UNICHARSET and
108   // UnicharCompress. Note: Call before InitNetwork!
109   void InitCharSet(const std::string& traineddata_path) {
110     ASSERT_HOST(mgr_.Init(traineddata_path.c_str()));
111     InitCharSet();
112   }
113   void InitCharSet(const TessdataManager& mgr) {
114     mgr_ = mgr;
115     InitCharSet();
116   }
I don't know how to solve the problem.Is anyone can help me.Thanks in advance.Sorry for my poor english.





Aodren BARY

unread,
Jan 25, 2019, 12:22:57 AM1/25/19
to tesseract-ocr
Hi, i had a similar issue, did you tried to add a debug in the source code, something like :

std::cerr << traineddata_path.c_str() << std::endl;


or a printf should do it.
did you do this command from the wiki ?
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
 
--noextract_font_properties --langdata_dir ../langdata \

 
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

By the way, i am stuck at this point , tesseract seems to loop infintely at this point.

易鑫

unread,
Jan 25, 2019, 12:52:48 AM1/25/19
to tesser...@googlegroups.com
I do not run the  command:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
 
--noextract_font_properties --langdata_dir ../langdata \

 
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

From the wiki,I thought the it is optional Now I think this command is mandatory, I run this command line,but an error occurs:
ERROR: /tmp/eng-2019-01-25.FUB/eng.Arial_Bold_Italic.exp0.box does not exist or is not readable
I think this is the key reason.

 



Aodren BARY <aodre...@gmail.com> 于2019年1月25日周五 下午1:23写道:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6e11f4c3-3142-45a1-9f31-9a9f86504a93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aodren BARY

unread,
Jan 25, 2019, 1:03:37 AM1/25/19
to tesseract-ocr
Yes you need to install some fonts
You can find a tutorial here http://www.linuxandubuntu.com/home/how-to-install-microsoft-fonts-in-ubuntu-linux
You can find the fonts that tesseract use for the his command in the script language_specific.sh if i remember correctly. To find the location of this cript. Do a simple whereis tesstrain.sh

易鑫

unread,
Jan 25, 2019, 1:13:25 AM1/25/19
to tesser...@googlegroups.com
thank you so much,I will try.

Aodren BARY <aodre...@gmail.com> 于2019年1月25日周五 下午2:03写道:
Yes you need to install some fonts
You can find a tutorial here http://www.linuxandubuntu.com/home/how-to-install-microsoft-fonts-in-ubuntu-linux
You can find the fonts that tesseract use for the his command in the script language_specific.sh if i remember correctly. To find the location of this cript. Do a simple whereis tesstrain.sh

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Marziye Rahmati

unread,
Jan 25, 2019, 1:14:06 AM1/25/19
to tesseract-ocr
hi
I had this problem before.
I think that you make a mistake in addressing traineddata. you must give traineddata's address that made by tesstrain.sh.
Good luck.

Shree Devi Kumar

unread,
Jan 25, 2019, 1:15:03 AM1/25/19
to tesser...@googlegroups.com
>currently I am in the tesseract directory, I can not find training folder under this directory.

All source files were moved to tesseract/src. You will find training folder under it. src/training/lstmtraining should work without install.

>mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

This means that the traineddata file was NOT found.

Your first command is creating the file in a different directory and second is referencing a different directory. Both should be same.

> --output_dir ~/tesstutorial/engeval

> --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

易鑫

unread,
Jan 25, 2019, 2:14:25 AM1/25/19
to tesser...@googlegroups.com
Thank you all,This problem has been solved after I install microsoft fonts.Now there is another issues,
When I run 
training/lstmtraining --debug_interval 100 \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
I notice the basetrain.log file.
Warning: given outputs 111 not equal to unicharset of 110.
Num outputs,weights in Series:
  1,36,0,1:1, 0
Num outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys48:48, 12480
  Lfx96:96, 55680
  Lrx96:96, 74112
  Lfx256:256, 361472
  Fc110:110, 28270
Total weights = 532174
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc110] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]
Training parameters:
  Debug interval = 100, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=109
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Arial_Bold.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engeval/eng.Impact_Condensed.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Century_Schoolbook_L_Medium.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Century_Schoolbook_L_Italic.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Arial_Italic.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Courier_New_Bold.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Century_Schoolbook_L_Bold.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Century_Schoolbook_L_Bold_Italic.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Arial_Bold_Italic.exp0.lstmf
Loaded 72/72 pages (1-72) of document /home/yixin/tesstutorial/engtrain/eng.Arial.exp0.lstmf
Starting sh -c "trap 'kill %1' 0 1 2 ; java -Xms1024m -Xmx2048m -jar ./ScrollView.jar & wait"
Error: Unable to access jarfile ./ScrollView.jar
sh: 1: kill: No such process

ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...
ScrollView: Waiting for server...

It seems that ScrollView.jar is needed,but from the wiki  "It is also useful, but not required, to build ScrollView.jar:"

so should I install the ScrollView.jar or is there any mistake, thank you.



Shree Devi Kumar

unread,
Jan 25, 2019, 2:16:50 AM1/25/19
to tesser...@googlegroups.com
Scrollview.jar is useful for the visual display of training. If you are not interested in that, you can change --debug_interval 100 to --debug_interval -1

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Aodren BARY

unread,
Jan 25, 2019, 2:17:28 AM1/25/19
to tesseract-ocr
Yes change --debug_interval 100 by --debug_interval 0

Aodren BARY

unread,
Jan 25, 2019, 2:18:41 AM1/25/19
to tesseract-ocr


Le vendredi 25 janvier 2019 08:17:28 UTC+1, Aodren BARY a écrit :
 change --debug_interval 100 by --debug_interval 0
    It's not mandatory to install ScrollView

易鑫

unread,
Jan 25, 2019, 2:39:50 AM1/25/19
to tesser...@googlegroups.com
Thank you Aodren BARY.

I do not look the wiki carefully.


"NOTE that to use --debug_interval > 0 you must build ScrollView.jar as well as the other training tools."

Aodren BARY <aodre...@gmail.com> 于2019年1月25日周五 下午3:18写道:


Le vendredi 25 janvier 2019 08:17:28 UTC+1, Aodren BARY a écrit :
 change --debug_interval 100 by --debug_interval 0
    It's not mandatory to install ScrollView

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages