Generate Arabic PLUS traineddat gives error

Essam Zaky

unread,

Mar 28, 2020, 1:26:13 PM3/28/20

to tesseract-ocr

Dear @Shreeshrii

I had followed your bash script to add Andalus font in the Arabic lanaguage here it the script url

https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948

all steps steps works except the last one which generate the traineddata here it's the error

osboxes@osboxes:~/tesstutorial/tesseract$ time lstmtraining \

> --stop_training \

> --continue_from ~/tesstutorial/ara_from_full/PLUS_checkpoint \

> --traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \

> --model_output ~/tesstutorial/ara_from_full/ara.Andalus.PLUS.traineddata

Loaded file /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint, unpacking...

Code range changed from 74 to 85!

Must supply the old traineddata for code conversion!

Failed to read continue from: /home/osboxes/tesstutorial/ara_from_full/PLUS_checkpoint

Best Regards

Essam

Shree Devi Kumar

unread,

Mar 28, 2020, 9:16:09 PM3/28/20

to tesseract-ocr

Please check that you have used the correct path for the traineddata file.

Please share the lstmtraining command that you used before this for training.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 28, 2020, 11:06:16 PM3/28/20

to tesseract-ocr

See https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.sh

lstmtraining --model_output ../tesstutorial/trainplusminus/plusminus \
--continue_from ../tesstutorial/trainplusminus/eng.lstm \
--traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
--old_traineddata tessdata/best/eng.traineddata \
--train_listfile ../tesstutorial/trainplusminus/eng.training_files.txt \
--max_iterations 3600

...

lstmtraining \
--stop_training \
--continue_from ../tesstutorial/trainplusminus/plusminus_checkpoint \
--traineddata ../tesstutorial/trainplusminus/eng/eng.traineddata \
--model_output ../tesstutorial/trainplusminus/eng_plusminus.traineddata

--traineddata needs to be same in both commands.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Essam Zaky

unread,

Mar 29, 2020, 1:38:07 AM3/29/20

to tesseract-ocr

Hi@shreeshrii

attached is the bash script as described in the following page

https://github.com/tesseract-ocr/tesseract/issues/2695#issuecomment-539412948

when i change the line #51 line

--traineddata ~/tesstutorial/tesseract/tessdata/best/ara.traineddata \

to be

--traineddata ~/tesstutorial/araeval/ara/ara.traineddata

now it works fine without error

but i have another question

the number of character set in best train is 85 and in the new generated character set contain only 74

how to keep unicharset number as best 85 ?

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c9123f5-8e80-447c-9bf1-2c6ec9831238%40googlegroups.com.

araplus.sh

Shree Devi Kumar

unread,

Mar 29, 2020, 1:50:54 AM3/29/20

to tesseract-ocr

The unicharset is based on the training text you use. Please make sure you have all required characters in the text.

Fine-tune for impact works with the unicharset of the best traineddata file, but then you can't add any characters to it.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com.

Essam Zaky

unread,

Mar 29, 2020, 3:23:15 AM3/29/20

to tesseract-ocr

Thanks @shreeshrii

, while prepare the training text what are the recommendations for this step

is there ant tutorial to show me how to prepare the training text.

example

what is the recommended text size

how many character instance repeated in the training set

, what about ligatures, how to handle it and how to add it in unicharset

....

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e1e7e7c6-8b11-4713-a303-837604668c22%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 29, 2020, 5:45:01 AM3/29/20

to tesseract-ocr

https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#introduction

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com.

Essam Zaky

unread,

Mar 29, 2020, 8:00:33 AM3/29/20

to tesseract-ocr

I read this page but still need more information about how to build training data set

say i would train the engine to recognize field contain 15 digit

is it enough to give small text file contain the 10 digits from 0 to 9

or should i prepare the training text to contain all 15 digit combination that it mean to have 10pow15 digit which is very huge data

https://github.com/tesseract-ocr/tessdoc/blob/master/TrainingTesseract-4.00.md#introduction

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0446e92c-6302-4910-a633-2f5e9fa1e043%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 29, 2020, 8:30:30 AM3/29/20

to tesseract-ocr

On Sun, Mar 29, 2020 at 5:30 PM Essam Zaky <essa...@gmail.com> wrote:

I read this page but still need more information about how to build training data set
say i would train the engine to recognize field contain 15 digit
is it enough to give small text file contain the 10 digits from 0 to 9
or should i prepare the training text to contain all 15 digit combination that it mean to have 10pow15 digit which is very huge data

Small file would work for the legacy engine. For LSTM training you need large file.

Reply all

Reply to author

Forward