How to created training text as provided in langdata for any new language if i have just just have a wordlist.

Romil Mehla

unread,

Apr 7, 2018, 5:20:38 AM4/7/18

to tesseract-ocr

Is there any program to generate it ? i see ambiguous_words.cpp generating dictionary words and ambiguous words where is it used ? or it can be used to build unicharambigs file to generate rules ?

ShreeDevi Kumar

unread,

Apr 7, 2018, 6:16:10 AM4/7/18

to tesser...@googlegroups.com

Just a word list is not enough for training text.

For tesseract 4.0.0 it needs to be representative of the text to be recognized.

On Sat 7 Apr, 2018, 2:50 PM Romil Mehla, <meh...@gmail.com> wrote:

Is there any program to generate it ? i see ambiguous_words.cpp generating dictionary words and ambiguous words where is it used ? or it can be used to build unicharambigs file to generate rules ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ce880b4-b750-4be9-a1a0-01f832f679df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Romil Mehla

unread,

Apr 7, 2018, 6:32:34 AM4/7/18

to tesseract-ocr

Thanks for your reply , i have read about tesseract 4.0 and Ray mentioned how he used so many files to train tesseract 4.0 but i dont want to use tesseract 4.0 , i wanted to know about tesseract 3.05.00 , from my understanding suppose for eng languaur . eng.training_text file is build from eng.wordlist file mentioned in langdata. For a new language how can i build training text from my new languaue wordlist ,any idea on who has created the eng.training_text file ? is there any rule or algorithm to do so , or it is randomly generated from eng.wordlist by maintaining minimum 10 times occurrence of a character in training text.

Please clarify on this , please let me know how to generate traning_text??

ShreeDevi Kumar

unread,

Apr 7, 2018, 8:52:10 AM4/7/18

to tesser...@googlegroups.com

see https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%40googlegroups.com.

Romil Mehla

unread,

Apr 9, 2018, 4:46:32 AM4/9/18

to tesser...@googlegroups.com

Hi Shree Thanks for replying

For tesseract 3.05.00

I had already checked that link there they mentioned

"Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.

There should be more samples of the more frequent characters - at least 20.

Don't make the mistake of grouping all the non-letters together. Make the text more realistic"

Does it holds for langdatat eng.training_text if yes Then that means they are generating it randomly . How randomly generated training text can assure accuracy.

Also they have mentioned each character should have minimum sample of 10 , why so , where in code this criteria is used . I have checked code but could not find this criteria anywhere . Is it related to algorithm ? then which one adaptive of shape classifier or related to bounding box coordinates .

Please clear my doubts and if required please pull Ray or someone from dev team as well as i have doubts regarding tesseract code as well.

I could not post in tesseract-dev forum because doubts should be asked in tesseract =user list only

Then how can i have tesseract developer answer my question. Please tell me the way

Thanks again for your timely reply and help .

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWcHvQfqitW37fh-tVk9GsfZq9Byc%3Dmv_cGM2Uipwp%2B5w%40mail.gmail.com.

ShreeDevi Kumar

unread,

Apr 9, 2018, 5:49:55 AM4/9/18

to tesser...@googlegroups.com

For tesseract 3.05

random text will work, it is suggested to use combos similar to English training text.

It is unlikely you will get answers to your questions from the developers. You can search past issues/questions in forum and github.

3.05 training does not take long, run a few experiments for your 'language' and test.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKLV5Psfa-y_ZXE-%2BJf%2BUVtPbicCdzkfVB6cHBfEnw8j%2ByLyqA%40mail.gmail.com.

Romil Mehla

unread,

Apr 9, 2018, 6:25:18 AM4/9/18

to tesseract-ocr

Thanks Shree , but if tesseract is open source then why developers can't answer doubts , If i were to randomly train my model how can i come down to accurate accuracy of my model , then my model accuracy will also be random.

I want the reason for condition imposed on training text , how much it will impact my accuracy , is there any other way in which i can increase my model's accuracy by my own knowing these answer so that my random training does not give me a random model.

see https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.03%E2%80%933.05

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fcfdc967-121e-480a-a0fe-e57f341115c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWcHvQfqitW37fh-tVk9GsfZq9Byc%3Dmv_cGM2Uipwp%2B5w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Reply all

Reply to author

Forward