My confusion about "Fine Tuning for ± a few characters"

49 views
Skip to first unread message

易鑫

unread,
Jan 30, 2019, 4:55:23 AM1/30/19
to tesseract-ocr
Hello,everyone:

     I get some confusion about "Fine Tuning for ± a few characters". In the wiki (https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters),

it says  "Modifylangdata/eng/eng.training_text to include some samples of ±."

     My question is why should we do that,what is eng,training_text file used for?

I list the files in the langdata/eng folder.


-rwxrwxrwx 1 yixin yixin      249 1月  23 16:29 desired_characters
-rwxrwxrwx 1 yixin yixin     2235 1月  23 16:29 eng.numbers
-rwxrwxrwx 1 yixin yixin     6082 1月  23 16:29 eng.punc
-rwxrwxrwx 1 yixin yixin     6801 1月  23 16:29 eng.training_text
-rwxrwxrwx 1 yixin yixin    80847 1月  23 16:29 eng.training_text.bigram_freqs
-rwxrwxrwx 1 yixin yixin     1063 1月  23 16:29 eng.training_text.unigram_freqs
-rwxrwxrwx 1 yixin yixin     1058 1月  23 16:29 eng.unicharambigs
-rwxrwxrwx 1 yixin yixin 15836450 1月  23 16:29 eng.word.bigrams
-rwxrwxrwx 1 yixin yixin  3852057 1月  23 16:29 eng.wordlist

what are these files used for?

I think desired_characters is corresponding to Unicharset,and I can see there are totally 119 different characters in desired_characters.
eng.number is corresponding to Number dawg. eng.punc is corresponding to Punctuation pattern dawg,    eng.word.list is corresponding to Word dawg., am I right?

and what are other files used for? thank you in advance.

Sorry for my poor English.
   

Shree Devi Kumar

unread,
Jan 30, 2019, 5:46:07 AM1/30/19
to tesser...@googlegroups.com
>

it says  "Modifylangdata/eng/eng.training_text to include some samples of ±."

That is part of a training tutorial, where the goal is to add a new character ± to the eng.traineddata so that it can be recognized by the finetuned traineddata.

It is only an example. You have to modify it based on what you need.

Please read the documentation.


etc.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0853a38c-6426-42d6-9c8d-de4062b50832%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages