Generate Arabic PLUS traineddat gives error

53 views
Skip to first unread message

Essam Zaky

unread,
Mar 28, 2020, 1:26:13 PM3/28/20
to tesseract-ocr
Dear @Shreeshrii
I had followed your bash script to add Andalus font in the Arabic lanaguage here it the script url

all steps steps works except the last one which generate the traineddata here it's the error

osboxes@osboxes:~/tesstutorial/tesseract$ time lstmtraining \
>   --stop_training \
>   --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \
>   --traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \
>   --model_output ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata
Loaded file /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint, unpacking...
Code range changed from 74 to 85!
Must supply the old traineddata for code conversion!
Failed to read continue from: /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint


Best Regards
Essam

Shree Devi Kumar

unread,
Mar 28, 2020, 9:16:09 PM3/28/20
to tesseract-ocr
Please check that you have used the correct path for the traineddata file.

Please share the lstmtraining command that you used before this for training.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com.

Shree Devi Kumar

unread,
Mar 28, 2020, 11:06:16 PM3/28/20
to tesseract-ocr

lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \
  --continue_from ../tesstutorial/trainplusminus/eng.lstm \
  --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata tessdata/best/eng.traineddata \
  --train_listfile ../tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600

...


lstmtraining \
  --stop_training \
  --continue_from ../tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
  --model_output ../tesstutorial/trainplusminus/eng_plusminus.traineddata

    --traineddata  needs to be same in both commands. 
--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Essam Zaky

unread,
Mar 29, 2020, 1:38:07 AM3/29/20
to tesseract-ocr
Hi@shreeshrii
attached is the bash script as described in the following page

when i change the line #51 line 

--traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \

to be

--traineddata ~/tesstutorial/araeval/ara/ara.traineddata

now it works fine without error 
but i have another question
the number of character set in best train is 85 and in the new generated character set contain only 74
how to keep unicharset number as best  85 ?

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
araplus.sh

Shree Devi Kumar

unread,
Mar 29, 2020, 1:50:54 AM3/29/20
to tesseract-ocr
The unicharset is based on the training text you use. Please make sure you have all required characters in the text.

Fine-tune for impact works with the unicharset of the best traineddata file, but then you can't add any characters to it.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com.

Essam Zaky

unread,
Mar 29, 2020, 3:23:15 AM3/29/20
to tesseract-ocr
Thanks @shreeshrii

 , while prepare the training text what are the recommendations for this step

is there ant tutorial to show me how to prepare the training text.

example
what is the recommended text size
how many character instance repeated in the training set
,  what about ligatures, how to handle it and how to add it in unicharset
....

Shree Devi Kumar

unread,
Mar 29, 2020, 5:45:01 AM3/29/20
to tesseract-ocr

Essam Zaky

unread,
Mar 29, 2020, 8:00:33 AM3/29/20
to tesseract-ocr
I read this page but still need more information about how to build training data set
say i would train the engine to recognize field contain 15 digit
is it enough to give small text file contain the 10 digits from 0 to 9
or should i prepare the training text to contain all 15 digit combination that it mean to have 10pow15 digit which is very huge data

Shree Devi Kumar

unread,
Mar 29, 2020, 8:30:30 AM3/29/20
to tesseract-ocr
On Sun, Mar 29, 2020 at 5:30 PM Essam Zaky <essa...@gmail.com> wrote:
I read this page but still need more information about how to build training data set
say i would train the engine to recognize field contain 15 digit
is it enough to give small text file contain the 10 digits from 0 to 9
or should i prepare the training text to contain all 15 digit combination that it mean to have 10pow15 digit which is very huge data

Small file would work for the legacy engine. For LSTM training you need large file.
Reply all
Reply to author
Forward
0 new messages