Help: lstmtraning not found

minh...@gmail.com

unread,

Aug 6, 2020, 3:40:06 PM8/6/20

to tesseract-ocr

Dear friends,

I have tried to run tesseract followed the guide in: https://github.com/tesseract-ocr/tesseract/issues/1453

Until the step 10:

SCROLLVIEW_PATH=~/tesseract/java \
~/tesseract/src/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--debug_interval -1 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

then no thing happen, in the basetrain.log:

zsh: no such file or directory: /Users/minhtupham/tesseract/src/training/lstmtraining

is there missing lstmtraining file?

I check in the training folder, there is a file name "lstmtraining.cpp"

Please help me what I have to do?

Many thanks,

TuPM

minh...@gmail.com

unread,

Aug 6, 2020, 3:43:49 PM8/6/20

to tesseract-ocr

Sorry that I forgot to note:

I use Mac OS 10.15.6 Catalina

The tessseract version: tesseract 5.0.0-alpha-773-gd33ed

Also, tesseract is installed via MacPorts, since installation via brew has a lot of errors.

Thanks,

Shree Devi Kumar

unread,

Aug 6, 2020, 9:43:02 PM8/6/20

to tesseract-ocr

If you have tesseract and all training tools installed, you should be able to use

tesseract

lstmtraining

etc without giving the path.

What's the output of

which tesseract

tesseract -v

which lstmtraining

lstmtraining -v

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b45b1f8d-4e84-482b-b0f1-03670a14055en%40googlegroups.com.

minh...@gmail.com

unread,

Aug 6, 2020, 10:30:33 PM8/6/20

to tesseract-ocr

Many thanks Shree,

As you suggest, I remove the the path, now it works now

by the way, my tesseract and lstm version:

tesseract 5.0.0-alpha-773-gd33ed l
eptonica-1.78.0

~ % lstmtraining -v
5.0.0-alpha-773-gd33ed

minh...@gmail.com

unread,

Aug 7, 2020, 4:54:20 AM8/7/20

to tesseract-ocr

Could you also please advise for training experience

I am training Vietnamese for only Time New Romans at this time.

The best traineddata is good, but it is big (for all fonts) and take quite a long time to process

I plan to train from scratch,

...

--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

After 5000 iterations, Error rate = 76.676 it is so high

What should I do next?

It is any improvements if I rerun the above training for second/third time (with same data in --train_listfile ~). As I thought, each time the traineddata is updated.

Is it a way to exact traineddata from best_traineddata for some selected fonts?

Thanks,

TuPM

Shree Devi Kumar

unread,

Aug 7, 2020, 5:09:42 AM8/7/20

to tesseract-ocr

The number of iterations for training from scratch need to be much larger hundreds of thousands.

5000 is used in tutorial to give an idea of training process. You need to train till error rates is close to 0.01

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5c4f1657-252f-4f5e-be85-b55b78c21bf3n%40googlegroups.com.

minh...@gmail.com

unread,

Aug 7, 2020, 11:04:38 AM8/7/20

to tesseract-ocr

The training stop when the error is about 0.5

how could I change the code for continuing training till error rate is close to 0.01, thanks

Reply all

Reply to author

Forward