mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

Jennil Thiyam

unread,

May 27, 2019, 5:25:17 AM5/27/19

to tesseract-ocr

I got error whie trying to perform fine tuning, the command i used is below:

lstmtraining --model_output /model \

--continue_from /ben_extract/ben.lstm \

--traineddata /tesstutorial_output/ben/ben.traineddata \

--old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata \

--train_listfile /tesstutorial_output/ben.training_files.txt \

--max_iterations 1500

I have read the discussion about the same error, but the solution provided over there were all about changing path and all, and i am sure i am right about the path. please help me out

Shree Devi Kumar

unread,

May 27, 2019, 6:53:37 AM5/27/19

to tesser...@googlegroups.com

Has /ben_extract/ben.lstm been extracted from /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0958d266-6f2f-4d10-9104-ee8145a4f005%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jennil Thiyam

unread,

May 27, 2019, 6:55:33 AM5/27/19

to tesser...@googlegroups.com

yes...i extracted with the command combine_tessdata

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXN72W5rb7o%3D7btSfz-GOj%2BoXWOX10%3Dr3CpdNb%2By-JbKA%40mail.gmail.com.

Shree Devi Kumar

unread,

May 27, 2019, 8:13:43 AM5/27/19

to tesser...@googlegroups.com

Is /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata from tessdata_best repo? Only those models can be used for finetuning.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoofQjuSOdaVNdkB%2B54b%2BzNhLWY9uyb-yDFuDGrhEh-ixCg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Jennil Thiyam

unread,

May 27, 2019, 8:54:39 AM5/27/19

to tesser...@googlegroups.com

I installed by using the command in ubuntu 18, so i dint install from git repository, so if i installed from git repository,will this thing work??

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWhz4YfUPDDWctdkbKcA-nVT1j2Rxkbq%2BZhuh2W2dxqJA%40mail.gmail.com.

Shree Devi Kumar

unread,

May 27, 2019, 12:51:02 PM5/27/19

to tesser...@googlegroups.com

You can download ben.traineddata from tessdata_best in a different location and use that as part of lstmtraining command

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoocvQgqXPQL6VAWm-iZS_WHu3dU094fH%3Db_i%2Bo2B%2BAdzPA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Jennil Thiyam

unread,

May 28, 2019, 12:56:48 AM5/28/19

to tesser...@googlegroups.com

do you mean to change only the path of this old traineddata(in the command, that I underlined) to the path of ben.traineddata(that i am going to download from tessdata_best)? or do i need to perform the whole process with this (to be downloaded) ben.traineddata?

lstmtraining --model_output /model \

--continue_from /ben_extract/ben.lstm \

--traineddata /tesstutorial_output/ben/ben.traineddata \

--old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/ben.traineddata \

--train_listfile /tesstutorial_output/ben.training_files.txt \

--max_iterations 1500

Do you have any idea about the estimated time it will take for 1500 iterations?

Thank you

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXcdSWM-TxaSPVtk%3DVbG4bB8DRrtT6ocGRBErq46si6_g%40mail.gmail.com.

Shree Devi Kumar

unread,

May 28, 2019, 1:00:22 AM5/28/19

to tesser...@googlegroups.com

The old traineddata and the lstm file need to be in sync. So you should extract lstm file after downloading the traineddata and use those files. Rest of files don't need to be regenerated.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodYdOK4S9XoGOBAKoGWvRQ1xA52%3DUB-TqoVVgSLagPraw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Jennil Thiyam

unread,

May 28, 2019, 1:15:17 AM5/28/19

to tesser...@googlegroups.com

okay, thank you

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXDbsmDfyngQ%2B_2Pqiwumj%3DuT3c16myvoutpD%3DOVq%3DN_g%40mail.gmail.com.

Jennil Thiyam

unread,

May 28, 2019, 8:29:47 AM5/28/19

to tesser...@googlegroups.com

training/tesstrain.sh \
  --fonts_dir /c/Windows/Fonts \
  --tessdata_dir ./tessdata \
  --training_text ../langdata/ara/ara.training_text \
  --langdata_dir ../langdata \
  --lang ara  \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --fontlist "Arial" \
  --output_dir ~/tesstutorial/aratest
  
training/tesstrain.sh \
  --fonts_dir /c/Windows/Fonts \
  --tessdata_dir ./tessdata \
  --training_text ../langdata/ara/ara.training_text \
  --langdata_dir ../langdata \
  --lang ara  \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --fontlist "Arial" \
  "Arial Unicode MS" \
  "Calibri" \
  "Courier New" \
  --output_dir ~/tesstutorial/araeval

can anyone tell me why do we need to create this eval data, i meant it is also going to same as training data.

Shree Devi Kumar

unread,

May 28, 2019, 8:52:40 AM5/28/19

to tesser...@googlegroups.com

It is using a different set of fonts. So training is being done on one set of fonts and eval on others.

alternately, you can use a smaller text file for eval and use same set of fonts.

It all depends on what you want to accomplish with training.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodOZJg3eGg5k2w%3D5%3DeCtq2%2BmNfw%3DFsaYT-4OB2hEmLHMw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Jennil Thiyam

unread,

May 28, 2019, 8:53:57 AM5/28/19

to tesser...@googlegroups.com

okay, now i understand, thank you shree

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXzCFHEEioCSu6drQSysHti818xztypCFSWMrQDWtuPaw%40mail.gmail.com.

Jennil Thiyam

unread,

May 29, 2019, 2:40:02 AM5/29/19

to tesser...@googlegroups.com

I got this thing while trying to make starter training data

Rendered page 31 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif
Stripped 1 unrenderable words
Rendered page 31 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif
Stripped 1 unrenderable words
Rendered page 37 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif
Stripped 1 unrenderable words
Rendered page 38 to file /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.tif
Stripped 2 unrenderable words
Rendered page 32 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif
Stripped 6 unrenderable words
Rendered page 32 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif
Stripped 1 unrenderable words
Rendered page 38 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif
Stripped 1 unrenderable words
Rendered page 39 to file /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.tif
Rendered page 33 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif
Stripped 5 unrenderable words
Rendered page 33 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif
Stripped 1 unrenderable words
Rendered page 39 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif
Stripped 1 unrenderable words
Rendered page 40 to file /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.tif
Rendered page 34 to file /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.tif
Rendered page 34 to file /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.tif
Rendered page 40 to file /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.tif
Stripped 1 unrenderable words

......

and then

.......

Invalid start of grapheme sequence:M=0x9c7

Normalization failed for string 'পাে'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'জাে'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'গাি'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'রীি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'ভাে'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'জাি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'থাে'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'হাে'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'পুে'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'পুি'
Invalid start of grapheme sequence:H=0x9cd
Normalization failed for string 'অ্যা'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'খাে'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'চুে'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'ঢাি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'তাে'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'উে'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'উি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'থাে'
Invalid start of grapheme sequence:M=0x9bf
Normalization failed for string 'তাি'

Invalid start of grapheme sequence:M=0x9bf

but finally i got

=== Moving lstmf files for training data ===
Moving /tmp/ben-2019-05-29.K90/ben.Bangla_Medium.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa
Moving /tmp/ben-2019-05-29.K90/ben.Lohit_Bengali.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa
Moving /tmp/ben-2019-05-29.K90/ben.Mukti_Narrow.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa
Moving /tmp/ben-2019-05-29.K90/ben.Nikosh.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa
Moving /tmp/ben-2019-05-29.K90/ben.SolaimanLipi.exp0.lstmf to /home/guest/tesstutorial/train_wa/Eval_wa

Created starter traineddata for LSTM training of language 'ben'

Run 'lstmtraining' command to continue LSTM training for language 'ben'

No error, will this training data be good, i am asking this because i feel lots of things are happening not in the way it has to be....like it says "normalization failed" "unrenderable"

Shree Devi Kumar

unread,

May 29, 2019, 6:21:02 AM5/29/19

to tesser...@googlegroups.com

Check that the training text you used is normalized correctly, also check the Bengali normalization/validation rules

https://github.com/tesseract-ocr/tesseract/issues/1038

Jennil Thiyam

unread,

May 29, 2019, 7:05:51 AM5/29/19

to tesser...@googlegroups.com

One simple question, I get confuse every time. The question is about setting the TESSDATA_PREFIX environment variable.

Which path should i set?

/usr/local/share/tessdata (but here i could not find .traineddata, but if this is the path, can i just copy the .traineddata to this folder "tessdata")

or

/home/vesper2/tesstutorial/tesseract/tessdata/best

I installed the tesseract from the source, n now the version is 5.0.0

I know this question might sound stupid, but i need to make sure, coz i always get confused.

On Wed, May 29, 2019 at 3:50 PM Shree Devi Kumar <shree...@gmail.com> wrote:

Check that the training text you used is normalized correctly, also check the Bengali normalization/validation rules
https://github.com/tesseract-ocr/tesseract/issues/1038

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURh2H0u7jMo1QyZh-cHwPMiNG0UA8G25JoJqs2L3mBkw%40mail.gmail.com.

Jennil Thiyam

unread,

May 30, 2019, 1:13:31 AM5/30/19

to tesser...@googlegroups.com

I add only one character like 30 times in the ben.training_text (that too in the end of the original training text), which meant i dint modified the original ben.training_text in large aspect. still why i am getting this "normalization failed" in many of the words which are already in the original training_text.

And then i tried to create training data without any extra character, whcih meant i only used the original training text, still i got this "normalization failed" and "Stripped 1 unrenderable words". why is this so??

On Wed, May 29, 2019 at 3:50 PM Shree Devi Kumar <shree...@gmail.com> wrote:

Check that the training text you used is normalized correctly, also check the Bengali normalization/validation rules
https://github.com/tesseract-ocr/tesseract/issues/1038

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURh2H0u7jMo1QyZh-cHwPMiNG0UA8G25JoJqs2L3mBkw%40mail.gmail.com.

Reply all

Reply to author

Forward