Training Tesseract 4.0 for Nepali Language

97 views
Skip to first unread message

Nirajan Pant

unread,
May 9, 2017, 3:02:56 AM5/9/17
to tesseract-ocr
The trainned data provided here is not giving good results with Nepali text image documents. It is unable to recognize some lines correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali language.

ShreeDevi Kumar

unread,
May 9, 2017, 3:09:31 AM5/9/17
to tesser...@googlegroups.com
Please provide sample of 'not giving good results' and samples of lines not being recognized correctly. Images and ground truth files will be helpful.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <nira...@gmail.com> wrote:
The trainned data provided here is not giving good results with Nepali text image documents. It is unable to recognize some lines correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali language.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
May 9, 2017, 3:54:39 AM5/9/17
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nirajan Pant

unread,
May 9, 2017, 11:35:34 AM5/9/17
to tesseract-ocr
Here is a sample image:


And the result is:

त्यसपछि कसरी इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर
सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदेंगएकोछ्दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन।


 


मन्दिर जाने बाटो प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको सम्झनामा सबै कुरा अधुरा छन दिनभरिको अधिकांश समय यिनै
कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान, कति संघर्ष वा दुखका कहानीहरु होलान ? बारम्बार
सोध्ने यत्न गर्छु, मुस्काई मात्र रहन्छ


 


आज त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको सुन्छु " हे भगवान, कति एक्लो जीवन !"


एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै प्रश्न गर्छ -
'म्झिचकोबिषयमाकतिलेखिइन्यग्यौत?पुस्तककहिलेतय1रहुनात्तिम्रो?"
किबुच एक प्रकारको सामुदायिक विकासको अवधारणा हो, इजरायलमा यसको उदाहरणीय र अनुकरणीय प्रयोग भएको छ |


"अहँ आधा पनि सकेको छैन, यस्ता खाले पुस्तकको हाम्रो देशमा खासै महत्व या उपयोगिता होला जस्तो पनि लाग्दैन । त्यसैले यी


अहिले त कथा पो लेखन थालेको छु, फेसबुकतिर टाँस्दिन्छु , एक दुइ जनाले पढ्छन पनि।"


 


 


मेरो नजीक आएर अन्छ उ, त्यसो भए आज के लेख्यौँ त, सुनाउन त ?




On Tuesday, 9 May 2017 12:54:31 UTC+5:45, shree wrote:
Please provide sample of 'not giving good results' and samples of lines not being recognized correctly. Images and ground truth files will be helpful.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <nira...@gmail.com> wrote:
The trainned data provided here is not giving good results with Nepali text image documents. It is unable to recognize some lines correctly. Can anybody help me in re-training Tesseract 4.0 for Nepali language.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
May 9, 2017, 11:41:04 AM5/9/17
to tesser...@googlegroups.com

Thanks. Please provide the 'ground truth' ie the original accurate text for the image.

Have tried to OCR the same image with options

--oem 1 --PSM 6 -l hin

Sometimes hindi traineddata gives better results.


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
May 9, 2017, 12:53:25 PM5/9/17
to tesser...@googlegroups.com
Attached is the output I get with

tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

nep_text_11.txt

Nirajan Pant

unread,
May 10, 2017, 7:33:47 AM5/10/17
to tesseract-ocr
Yeah! I got the same result as yours with hin.traineddata which is better than nep.traineddata. I think the langdata need some revisions. I have attached the ground truth text for the image.


gt_nep_text_google-group_qn.txt

ShreeDevi Kumar

unread,
May 10, 2017, 7:40:56 AM5/10/17
to tesser...@googlegroups.com
Please open an issue in langdata repo with any specific errors that you see for Nepali. Take a look at the wordlist and training_text,

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Nirajan Pant

unread,
May 10, 2017, 11:21:49 AM5/10/17
to tesseract-ocr
Thank you @shree. Can you help in how to generate langdata for training Tesseract 4.0?

ShreeDevi Kumar

unread,
May 10, 2017, 12:38:32 PM5/10/17
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
May 10, 2017, 12:50:05 PM5/10/17
to tesser...@googlegroups.com
make a collection of unicode devanagari fonts - look at fonts.google.com

make a large training text with nepali text

review and improve the wordlist in tesseract-ocr/langdata for nepali

I will share my modified training scripts, which use small sections of the large training text for each font. 

Please note that so far I have not had success in improving the accuracy of hindi traineddata with my experiments.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all
Reply to author
Forward
0 new messages