Finetune 4.0 location of new punc and numbers files?

41 views
Skip to first unread message

nahibi

unread,
Jan 20, 2019, 1:45:25 AM1/20/19
to tesseract-ocr
Hello,

I try to finetune tesseract 4.0 like it is explained here:

https://github.com/Shreeshrii/tessdata_shreetest/commit/b69b7e6ba6c7b0bd15f1b5541ac8fa5746383ad4

"- custom training text, punc and numbers files are used by updating the files in langdata/eng folder"

I do not know what I have to do with the punc and numbers files. 
Do I have to create new files in the same directory like custom training text file?
Do I have to replace the original ones from "~/tesseract-ocr/langdata_lstm/eng"?
Something else?

Best Regards
nahibi 

Shree Devi Kumar

unread,
Jan 20, 2019, 1:56:23 AM1/20/19
to tesser...@googlegroups.com
It depends on what you are fine tuning for.

 I had changed the punc and numbers file so that only those punctuation characters were used which were in the unicharset eg. For a digits trained data which is for 0-9 and decimal point, comma and minus sign, I removed all other punctuation marks and kept only . , and -

Similarly the numbers file was modified for the patterns expected.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2b4acaf3-61a9-4878-891d-20df6e990953%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

nahibi

unread,
Jan 20, 2019, 2:28:20 AM1/20/19
to tesseract-ocr
Thank you for your quick response. So if I understand you correctly I have to change the original ones from "~/tesseract-ocr/langdata_lstm/eng" directory.
What about the eng.wordlist file in this directory. I think it is useless for numbers only. I only want to detect numbers between 0..1000, should I create a own one which include these numbers?

nahibi

unread,
Jan 20, 2019, 3:11:56 AM1/20/19
to tesseract-ocr
I replaced the original punc and numbers file from "~/tesseract-ocr/langdata_lstm/eng" and deleted all other files.
But when I check the generated eng.unicharset file in my output folder "~/tesstut/testy1/output/eng" it is still containing letters.
I think this is not normal and I am doing something wrong.
Reply all
Reply to author
Forward
0 new messages