The extra character is not recognized after fine tuning training

67 views
Skip to first unread message

Jennil Thiyam

unread,
May 31, 2019, 5:52:51 AM5/31/19
to tesser...@googlegroups.com
I have followed the procedure (that is described in training tesseract 4 for fine tuning for putting plus-minus sign in eng.traineddata) to train ben.traineddata (by adding one character which is not in the Bengali alpahbets, more than 30 times, in ben.training_text). after creating starter training data and then running lstmtraining, the model failed to recognized the new character, in case of plus-minus, it is said that the plus-minus sign was recognized.
Does anyone have any suggestion???
The demo of the training_text is given below,
.....
লক্ষ্যমাত্রা নির্দেশ ধ্বংস কে
দেখতে শুধু লাইব্রেরী আশা স্বাগত থাং
শতাব্দী অন্ধ্রপ্রদেশ (িপিপিপ)
সন্ধান করে অভ্যুত্থানের প্রসিদ্ধ
ময়ূরের শুরু ইন্টারেস্টিং দলের ও
পুিলেশর খ্রিস্টপূর্ব আশা প্রদর্শিত
কহীং উইকিপিডিয়াতে এ্যান্ড 19 ইঞ্চি
আছে ০ লিখতে অর্পানেট পরে এেক
ভূঁইয়ার আছে করুন, গ্লোব সেপ্টেম্বর
প্রশ্ন,
ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
ৱ ৱা ৱি ৱী ৱু ৱূ ৱে ৱৈ ৱো ৱৌ ৱং
....
the underlined text is the possible form that this new character can take, is ther any rule in adding this new character to the training text???

Shree Devi Kumar

unread,
May 31, 2019, 6:13:48 AM5/31/19
to tesser...@googlegroups.com
What is the new character you want to add?



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgooeysg5AfzppAXjKpREOvH2Jnz14wksMUjhsjotMJxE3bA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
May 31, 2019, 6:16:10 AM5/31/19
to tesser...@googlegroups.com

On Fri, May 31, 2019 at 3:22 PM Jennil Thiyam <thiyam...@gmail.com> wrote:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgooeysg5AfzppAXjKpREOvH2Jnz14wksMUjhsjotMJxE3bA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
May 31, 2019, 6:17:39 AM5/31/19
to tesser...@googlegroups.com

Jennil Thiyam

unread,
May 31, 2019, 7:25:43 AM5/31/19
to tesser...@googlegroups.com
How come this character is in here??? Its not used in bengali, and also not recognized by ben.traindata model, the character is in my unicharset that I got after running tesstrain.sh
The character is pronounced as "waa" . I attached two picture, the first one wa.png is the sshot of the unicharset from the link u have given, and the picture wa_11.png is the unicharset that i got after performing tesstrain.sh(after adding this new character in ben.training_text)
The character is in line no.35(in wa.png) and 79(in wa_11.png)

Please help me out

wa.png
wa_11.png

Shree Devi Kumar

unread,
May 31, 2019, 7:28:51 AM5/31/19
to tesser...@googlegroups.com
Please try the asm.traineddata which is for Assamese which is written in Bengali script.

Shree Devi Kumar

unread,
May 31, 2019, 7:30:19 AM5/31/19
to tesser...@googlegroups.com
script/Bengali.traineddata is another option 

Jennil Thiyam

unread,
May 31, 2019, 8:12:53 AM5/31/19
to tesser...@googlegroups.com
Assamese used some extra characters which are not used in Bengali and our language, so I want to modify in ben.traineddata. I tried using asm.traineddata, it recognizes the character that I wanted, but it also misrecognized other characters with their character(which are not used in bengali and our language). So i want to modify on ben.traineddata. And I want to know how ben.traineddata fail to recognized the character( that i asked ) even though that character is in unicharset

Jennil Thiyam

unread,
May 31, 2019, 1:10:58 PM5/31/19
to tesser...@googlegroups.com
SHree Devi, any suggestions?

Shree Devi Kumar

unread,
May 31, 2019, 1:25:50 PM5/31/19
to tesser...@googlegroups.com
Did you try script/bengali traineddata,?

For adding a character in Indic languages where it can form many ligatures consonant conjuncts and different vowel forms, it is like adding many letters , so plus minus instructions won't work.

You will need to do replace a layer type training instead.

Regarding normalization you should look at the text to make sure that it is ok. I don't know the script but my guess is that the vowel maatraa that go on both sides of consonants may have been encoded as separate rather than one.






Jennil Thiyam

unread,
May 31, 2019, 1:31:57 PM5/31/19
to tesser...@googlegroups.com
What is this script/bengali traineddata???
Is it not the ben,traineddata?

Shree Devi Kumar

unread,
May 31, 2019, 1:47:02 PM5/31/19
to tesser...@googlegroups.com

Jennil Thiyam

unread,
May 31, 2019, 1:53:34 PM5/31/19
to tesser...@googlegroups.com
So, your suggestion is perform fine tuning process to this bengali.traineddata? 

Jennil Thiyam

unread,
May 31, 2019, 1:53:54 PM5/31/19
to tesser...@googlegroups.com
Thank you so much for all your help
Reply all
Reply to author
Forward
0 new messages