Help: lstmtraning not found

54 views
Skip to first unread message

minh...@gmail.com

unread,
Aug 6, 2020, 3:40:06 PM8/6/20
to tesseract-ocr
Dear friends,

I have tried to run tesseract followed the guide in: https://github.com/tesseract-ocr/tesseract/issues/1453

Until the step 10: 

SCROLLVIEW_PATH=~/tesseract/java \
~/tesseract/src/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--debug_interval -1 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log


then no thing happen, in the basetrain.log:
zsh: no such file or directory: /Users/minhtupham/tesseract/src/training/lstmtraining

is there missing lstmtraining file?
I check in the training folder, there is a file name "lstmtraining.cpp"

Please help me what I have to do?

Many thanks,

TuPM

minh...@gmail.com

unread,
Aug 6, 2020, 3:43:49 PM8/6/20
to tesseract-ocr
Sorry that I forgot to note: 

I use Mac OS 10.15.6 Catalina

The tessseract version: tesseract 5.0.0-alpha-773-gd33ed

Also, tesseract is installed via MacPorts, since installation via brew has a lot of errors.

Thanks,

Shree Devi Kumar

unread,
Aug 6, 2020, 9:43:02 PM8/6/20
to tesseract-ocr
If you have tesseract and all training tools installed, you should be able to use 
tesseract
lstmtraining
etc without giving the path.

What's the output of

which tesseract
tesseract -v
which lstmtraining
lstmtraining -v



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b45b1f8d-4e84-482b-b0f1-03670a14055en%40googlegroups.com.

minh...@gmail.com

unread,
Aug 6, 2020, 10:30:33 PM8/6/20
to tesseract-ocr
Many thanks Shree,

As you suggest, I remove the the path, now it works now

by the way, my tesseract and lstm version:

tesseract 5.0.0-alpha-773-gd33ed l
eptonica-1.78.0

~ % lstmtraining -v
5.0.0-alpha-773-gd33ed

minh...@gmail.com

unread,
Aug 7, 2020, 4:54:20 AM8/7/20
to tesseract-ocr
Could you also please advise for training experience

I am training Vietnamese for only Time New Romans at this time.

The best traineddata is good, but it is big (for all fonts) and take quite a long time to process

I plan to train from scratch,
...
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

After 5000 iterations, Error rate = 76.676   it is so high

What should I do next?
It is any improvements if I rerun the above training for second/third time (with same data in --train_listfile ~). As I thought, each time the traineddata is updated.
Is it a way to exact traineddata from best_traineddata for some selected fonts?

Thanks,

TuPM

Shree Devi Kumar

unread,
Aug 7, 2020, 5:09:42 AM8/7/20
to tesseract-ocr
The number of iterations for training from scratch need to be much larger hundreds of thousands. 

5000 is used in tutorial to give an idea of training process. You need to train till error rates is close to 0.01

minh...@gmail.com

unread,
Aug 7, 2020, 11:04:38 AM8/7/20
to tesseract-ocr
The training stop when the error is about 0.5
how could I change the code for continuing training till error rate is close to 0.01, thanks

Reply all
Reply to author
Forward
0 new messages