Estimating duration of the train data creation

28 views
Skip to first unread message

Sim Tov

unread,
Aug 21, 2021, 5:00:47 PM8/21/21
to tesser...@googlegroups.com
Hello,

I want to train Tesseract 4 (LSTM) from scratch to recognize certain font family and run this command:

/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir Fonts --lang heb --linedata_only --noextract_font_properties --langdata_dir ./langdata  --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir output/train --fontlist <list> <of> <eight> <fonts>

my training_text file is 26M and wordlist is 6.3M . I have launched the command above 2 days ago and the process is still running. I get output like this:

Page 3357
Loaded 171819/171819 pages (1-171819) of document /tmp/tmp.M6Ams42Ik5/...

1. Is there a way to estimate how long all this will take or how many pages are going to be loaded?


In the previous stage the text was rendered with the output like this:

Rendered page 1796 to file /tmp/tmp.vmJd24cTIt/...

2. Is there a way to estimate how many pages are going to be rendered?

Thank you!

Reply all
Reply to author
Forward
0 new messages