How to created training text as provided in langdata for any new language if i have just just have a wordlist.

134 views
Skip to first unread message

Romil Mehla

unread,
Apr 7, 2018, 5:20:38 AM4/7/18
to tesseract-ocr
Is there any program to generate it ?  i see ambiguous_words.cpp generating dictionary words and ambiguous words where is it used ? or it can be used to build unicharambigs file to generate rules ?

ShreeDevi Kumar

unread,
Apr 7, 2018, 6:16:10 AM4/7/18
to tesser...@googlegroups.com
Just a word list is not enough for training text.

For tesseract 4.0.0 it needs to be representative of the text to be recognized.

On Sat 7 Apr, 2018, 2:50 PM Romil Mehla, <meh...@gmail.com> wrote:
Is there any program to generate it ?  i see ambiguous_words.cpp generating dictionary words and ambiguous words where is it used ? or it can be used to build unicharambigs file to generate rules ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Romil Mehla

unread,
Apr 7, 2018, 6:32:34 AM4/7/18
to tesseract-ocr
Thanks for your reply , i have read about tesseract 4.0 and Ray mentioned how he used so many files to train tesseract 4.0 but i dont want to use tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my understanding suppose for eng languaur . eng.training_text file is build from eng.wordlist  file mentioned in langdata. For a new language how can i build training text from my new languaue wordlist ,any idea on who has created the eng.training_text  file ? is there any rule or algorithm to do so , or it is randomly generated from eng.wordlist by maintaining minimum 10 times occurrence of a character in training text.



Please clarify on this , please let me know how to generate traning_text??

ShreeDevi Kumar

unread,
Apr 7, 2018, 8:52:10 AM4/7/18
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Romil Mehla

unread,
Apr 9, 2018, 4:46:32 AM4/9/18
to tesser...@googlegroups.com
Hi Shree Thanks for replying

For tesseract 3.05.00

I had already checked that link there they mentioned 
"Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.
There should be more samples of the more frequent characters - at least 20.
Don't make the mistake of grouping all the non-letters together. Make the text more realistic"

Does it holds for langdatat eng.training_text if yes  Then that means they are generating it randomly . How randomly generated training text can assure accuracy.
Also they have mentioned each character should have minimum sample of 10 , why so , where in code this criteria is used . I have checked code but could not find this criteria anywhere . Is it related to algorithm ? then which one adaptive of shape classifier or related to bounding box coordinates .

Please clear my doubts and if required please pull Ray or someone from dev team as well as i have doubts regarding tesseract code as well.
I could not post in tesseract-dev forum because doubts should be asked in tesseract =user list only

Then how can i have tesseract developer answer my question. Please tell me the way

Thanks again for your timely reply and help .




ShreeDevi Kumar

unread,
Apr 9, 2018, 5:49:55 AM4/9/18
to tesser...@googlegroups.com
For tesseract 3.05

random text will work, it is suggested to use combos similar to English training text.

It is unlikely you will get answers to your questions from the developers. You can search past issues/questions in forum and github.

3.05 training does not take long, run a few experiments for your 'language' and test.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Romil Mehla

unread,
Apr 9, 2018, 6:25:18 AM4/9/18
to tesseract-ocr
Thanks Shree , but if tesseract is open source then why developers can't answer doubts , If i were to randomly train my model how can i come down to accurate accuracy of my model , then my model accuracy will also be random. 

I want the reason for condition imposed on training text , how much it will impact my accuracy , is there any other way in which i can increase my model's accuracy by my own knowing these answer so that my random training does not give me a random model.





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages