The extra character is not recognized after fine tuning training

Jennil Thiyam

unread,

May 31, 2019, 5:52:51 AM5/31/19

to tesser...@googlegroups.com

I have followed the procedure (that is described in training tesseract 4 for fine tuning for putting plus-minus sign in eng.traineddata) to train ben.traineddata (by adding one character which is not in the Bengali alpahbets, more than 30 times, in ben.training_text). after creating starter training data and then running lstmtraining, the model failed to recognized the new character, in case of plus-minus, it is said that the plus-minus sign was recognized.

Does anyone have any suggestion???

The demo of the training_text is given below,

.....

লক্ষ্যমাত্রা নির্দেশ ধ্বংস কে
দেখতে শুধু লাইব্রেরী আশা স্বাগত থাং
শতাব্দী অন্ধ্রপ্রদেশ (িপিপিপ)
সন্ধান করে অভ্যুত্থানের প্রসিদ্ধ
ময়ূরের শুরু ইন্টারেস্টিং দলের ও
পুিলেশর খ্রিস্টপূর্ব আশা প্রদর্শিত
কহীং উইকিপিডিয়াতে এ্যান্ড 19 ইঞ্চি
আছে ০ লিখতে অর্পানেট পরে এেক
ভূঁইয়ার আছে করুন, গ্লোব সেপ্টেম্বর
প্রশ্ন,

ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং

....

the underlined text is the possible form that this new character can take, is ther any rule in adding this new character to the training text???

Shree Devi Kumar

unread,

May 31, 2019, 6:13:48 AM5/31/19

to tesser...@googlegroups.com

What is the new character you want to add?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgooeysg5AfzppAXjKpREOvH2Jnz14wksMUjhsjotMJxE3bA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

May 31, 2019, 6:16:10 AM5/31/19

to tesser...@googlegroups.com

Is your new character included in

https://github.com/tesseract-ocr/langdata_lstm/blob/master/ben/ben.unicharset

On Fri, May 31, 2019 at 3:22 PM Jennil Thiyam <thiyam...@gmail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgooeysg5AfzppAXjKpREOvH2Jnz14wksMUjhsjotMJxE3bA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

May 31, 2019, 6:17:39 AM5/31/19

to tesser...@googlegroups.com

or in

https://github.com/tesseract-ocr/langdata_lstm/blob/master/asm/asm.unicharset

Jennil Thiyam

unread,

May 31, 2019, 7:25:43 AM5/31/19

to tesser...@googlegroups.com

How come this character is in here??? Its not used in bengali, and also not recognized by ben.traindata model, the character is in my unicharset that I got after running tesstrain.sh

The character is pronounced as "waa" . I attached two picture, the first one wa.png is the sshot of the unicharset from the link u have given, and the picture wa_11.png is the unicharset that i got after performing tesstrain.sh(after adding this new character in ben.training_text)

The character is in line no.35(in wa.png) and 79(in wa_11.png)

Please help me out

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4hqzzPYxs5C3G7vdTrW%3DAfLgU7zi8cKH8YT22jE5C7g%40mail.gmail.com.

wa.png

wa_11.png

Shree Devi Kumar

unread,

May 31, 2019, 7:28:51 AM5/31/19

to tesser...@googlegroups.com

Please try the asm.traineddata which is for Assamese which is written in Bengali script.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgooeEaQ6TnAXYnAqFpfU0KX5kppUBjxWDhv16bk4N%3Dher4g%40mail.gmail.com.

Shree Devi Kumar

unread,

May 31, 2019, 7:30:19 AM5/31/19

to tesser...@googlegroups.com

script/Bengali.traineddata is another option

Jennil Thiyam

unread,

May 31, 2019, 8:12:53 AM5/31/19

to tesser...@googlegroups.com

Assamese used some extra characters which are not used in Bengali and our language, so I want to modify in ben.traineddata. I tried using asm.traineddata, it recognizes the character that I wanted, but it also misrecognized other characters with their character(which are not used in bengali and our language). So i want to modify on ben.traineddata. And I want to know how ben.traineddata fail to recognized the character( that i asked ) even though that character is in unicharset

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGLL26ecjwu8j6PN_Z6wi_Y%2BxQXgwH55nAi-YNtY_Xzg%40mail.gmail.com.

Jennil Thiyam

unread,

May 31, 2019, 1:10:58 PM5/31/19

to tesser...@googlegroups.com

SHree Devi, any suggestions?

Shree Devi Kumar

unread,

May 31, 2019, 1:25:50 PM5/31/19

to tesser...@googlegroups.com

Did you try script/bengali traineddata,?

For adding a character in Indic languages where it can form many ligatures consonant conjuncts and different vowel forms, it is like adding many letters , so plus minus instructions won't work.

You will need to do replace a layer type training instead.

Regarding normalization you should look at the text to make sure that it is ok. I don't know the script but my guess is that the vowel maatraa that go on both sides of consonants may have been encoded as separate rather than one.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodnB4J5dvxrKu-JnFGSzNZj%2B1LgGivW83WtMJJcoF_Xug%40mail.gmail.com.

Jennil Thiyam

unread,

May 31, 2019, 1:31:57 PM5/31/19

to tesser...@googlegroups.com

What is this script/bengali traineddata???

Is it not the ben,traineddata?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVhp5QngJWnqEfjijcURAXuEwRKiD8CFgUqO2B0kd8Zvw%40mail.gmail.com.

Shree Devi Kumar

unread,

May 31, 2019, 1:47:02 PM5/31/19

to tesser...@googlegroups.com

https://github.com/tesseract-ocr/tessdata_best/tree/master/script

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoocuKwxprXafSoBj%3D_qC-NUU3MwSbGM7zRnLvM_YBuyeag%40mail.gmail.com.

Jennil Thiyam

unread,

May 31, 2019, 1:53:34 PM5/31/19

to tesser...@googlegroups.com

So, your suggestion is perform fine tuning process to this bengali.traineddata?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXjwyKCRKLCb8c8TiJCAvuaj1mi9%3DMzX3EYPRbvkPKkvQ%40mail.gmail.com.

Jennil Thiyam

unread,

May 31, 2019, 1:53:54 PM5/31/19

to tesser...@googlegroups.com

Thank you so much for all your help

Reply all

Reply to author

Forward