normalisation failed for string error

73 views
Skip to first unread message

Prabhakar Tayenjam

unread,
Feb 1, 2019, 7:28:52 AM2/1/19
to tesseract-ocr
What is causing this error and what are the possibles fixes??

Normalization failed for string 'া'
Word started with a combiner:0x982
Normalization failed for string 'ং'
Word started with a combiner:0x9c1
Normalization failed for string 'ু'
Word started with a combiner:0x9c0
Normalization failed for string 'ী'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9cb
Normalization failed for string 'ো'
Word started with a combiner:0x9cb
Normalization failed for string 'ো'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9be
Normalization failed for string 'া'
Word started with a combiner:0x9c0
Normalization failed for string 'ী'
Word started with a combiner:0x9be
Normalization failed for string 'া'

Shree Devi Kumar

unread,
Feb 1, 2019, 8:16:51 AM2/1/19
to tesser...@googlegroups.com
Looks like two maatraas together or maatraa followe by vedic accent - does not meet Indic normalization rules.

What training text are you using?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0b8830ec-533a-477f-baff-34fe1f1d1826%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Prabhakar Tayenjam

unread,
Feb 1, 2019, 8:44:47 AM2/1/19
to tesseract-ocr
This happens everytime I use tesstrain.sh. I use a training text combining the default provided in the langdata (https://github.com/tesseract-ocr/langdata) and some other text collected manually.
I tried using only the default training text provided in the langdata and get the same result.
I am training for Bengali
Message has been deleted

Shree Devi Kumar

unread,
Feb 1, 2019, 9:26:18 AM2/1/19
to tesser...@googlegroups.com
Use training_text from langdata_lstm which has larger training text used for LSTM training (for tessdata_best and tessdata_fast). 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Feb 1, 2019, 9:28:42 AM2/1/19
to tesser...@googlegroups.com
Please run a substitution script to clean up your training text. eg. for Hindi I use the following sed script.

s/ / /g
s/्‌ं/ं/g
s/‌्‌ृ/‌ृ/g
s/ा्/ा/g
s/ि्/ि/g
s/ी्/ी/g
s/ु्/ु/g
s/े्/े/g
s/ै्/ै/g
s/ो्/ो/g
s/ौ्/ौ/g
s/ॊ्/ॊ/g
s/ॆ्/ॆ/g
s/ॉ्/ॉ/g
s/ृ्/ृ/g
s/°//g
s/²//g
s/³//g
s/¹//g
s//ः/g
s//॑/g
s//॒/g
s/॑ः/ः॑/g
s/॒ः/ः॒/g
s/᳚ः/ः᳚/g
s/॑ं/ं॑/g
s/॒ं /ं॒/g
s/᳚ं/ं᳚/g
s/॑ँ/ँ॑/g
s/॒ँ /ँ॒/g
s/᳚ँ/ँ᳚/g
s/्ः/ः/g
s/्ं/ं/g
s/्ँ/ँ/g
s/ ः/ः/g
s/ ं/ं/g
s/ ँ/ँ/g
s/ ᳚//g
s/ःः/ः/g
s/ंं/ं/g
s/ँँ/ँ/g
s/््/्/g
s/ ॑//g
s/ ॒//g
s/ ᳚//g
s/ऽं/ऽ/g
s/ऽँ/ऽ/g
s/ऽः/ऽ/g



On Fri, Feb 1, 2019 at 7:55 PM Prabhakar Tayenjam <ptayen...@gmail.com> wrote:
I have looked at it again closely. I think I have something. Please look to clarify.

The string giving this error are the string that contains ' ৌ', 'া', 'ী', ' ো'  etc.
Normalization failed for string 'ো'
Normalization failed for string 'ৌ'
Normalization failed for string 'ী'

And this characters cannot combine with the adjacent  characters in the training text.
This words are from the langdata ((https://github.com/tesseract-ocr/langdata). I am providing the screenshots of the training text


Screenshot from 2019-02-02 01-19-14.png


Screenshot from 2019-02-02 01-16-44.png

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Prabhakar Tayenjam

unread,
Feb 1, 2019, 11:28:01 AM2/1/19
to tesseract-ocr
I have done tesstrain using the langdata-lstm, still get the normalisation failed error. I have not done substitutions though.
I would like to know how this error effects the accuracy of the newly trained model
Reply all
Reply to author
Forward
0 new messages