normalisation failed for string error

Prabhakar Tayenjam

unread,

Feb 1, 2019, 7:28:52 AM2/1/19

to tesseract-ocr

What is causing this error and what are the possibles fixes??

Normalization failed for string 'া'
Word started with a combiner:0x982
Normalization failed for string 'ং'
Word started with a combiner:0x9c1
Normalization failed for string 'ু'
Word started with a combiner:0x9c0
Normalization failed for string 'ী'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9cb
Normalization failed for string 'ো'
Word started with a combiner:0x9cb
Normalization failed for string 'ো'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9c0
Normalization failed for string 'ী'
Word started with a combiner:0x9be
Normalization failed for string 'া'

Shree Devi Kumar

unread,

Feb 1, 2019, 8:16:51 AM2/1/19

to tesser...@googlegroups.com

Looks like two maatraas together or maatraa followe by vedic accent - does not meet Indic normalization rules.

What training text are you using?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0b8830ec-533a-477f-baff-34fe1f1d1826%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Prabhakar Tayenjam

unread,

Feb 1, 2019, 8:44:47 AM2/1/19

to tesseract-ocr

This happens everytime I use tesstrain.sh. I use a training text combining the default provided in the langdata (https://github.com/tesseract-ocr/langdata) and some other text collected manually.

I tried using only the default training text provided in the langdata and get the same result.

I am training for Bengali

Message has been deleted

Shree Devi Kumar

unread,

Feb 1, 2019, 9:26:18 AM2/1/19

to tesser...@googlegroups.com

Use training_text from langdata_lstm which has larger training text used for LSTM training (for tessdata_best and tessdata_fast).

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ba84047a-acad-4360-b6f6-2136af197b80%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Feb 1, 2019, 9:28:42 AM2/1/19

to tesser...@googlegroups.com

Please run a substitution script to clean up your training text. eg. for Hindi I use the following sed script.

s/ / /g

s/्‌ं/ं/g

s/‌्‌ृ/‌ृ/g

s/ा्/ा/g

s/ि्/ि/g

s/ी्/ी/g

s/ु्/ु/g

s/े्/े/g

s/ै्/ै/g

s/ो्/ो/g

s/ौ्/ौ/g

s/ॊ्/ॊ/g

s/ॆ्/ॆ/g

s/ॉ्/ॉ/g

s/ृ्/ृ/g

s/°//g

s/²//g

s/³//g

s/¹//g

s//ः/g

s//॑/g

s//॒/g

s/॑ः/ः॑/g

s/॒ः/ः॒/g

s/᳚ः/ः᳚/g

s/॑ं/ं॑/g

s/॒ं /ं॒/g

s/᳚ं/ं᳚/g

s/॑ँ/ँ॑/g

s/॒ँ /ँ॒/g

s/᳚ँ/ँ᳚/g

s/्ः/ः/g

s/्ं/ं/g

s/्ँ/ँ/g

s/ ः/ः/g

s/ ं/ं/g

s/ ँ/ँ/g

s/ ᳚//g

s/ःः/ः/g

s/ंं/ं/g

s/ँँ/ँ/g

s/््/्/g

s/ ॑//g

s/ ॒//g

s/ ᳚//g

s/ऽं/ऽ/g

s/ऽँ/ऽ/g

s/ऽः/ऽ/g

On Fri, Feb 1, 2019 at 7:55 PM Prabhakar Tayenjam <ptayen...@gmail.com> wrote:

I have looked at it again closely. I think I have something. Please look to clarify.

The string giving this error are the string that contains ' ৌ', 'া', 'ী', ' ো' etc.
Normalization failed for string 'ো'
Normalization failed for string 'ৌ'
Normalization failed for string 'ী'

And this characters cannot combine with the adjacent characters in the training text.
This words are from the langdata ((https://github.com/tesseract-ocr/langdata). I am providing the screenshots of the training text

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/da976729-6d43-4012-b215-f64517fbb4ce%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Prabhakar Tayenjam

unread,

Feb 1, 2019, 11:28:01 AM2/1/19

to tesseract-ocr

I have done tesstrain using the langdata-lstm, still get the normalisation failed error. I have not done substitutions though.

I would like to know how this error effects the accuracy of the newly trained model

Reply all

Reply to author

Forward