Nepali Tesseract OCR data files for tesseract ocr

Rajesh Pandey

unread,

Apr 20, 2012, 1:57:36 AM4/20/12

to tesser...@googlegroups.com, foss-...@googlegroups.com, Art Rhyno

Hi

Has anyone tried to create Nepali language data for tesseract ?

I think Hindi/Sanskrit data files can also be used for tesseract.
I don't know which place is it to discuss about this : tesseract ocr forum or fossnepal.

Any suggestions on this ?

Art is a librarian at the University of Windsor and have been working on using open source OCR for newspaper collections. He was asked about Nepali by a friend and became curious but he doesn't have a specific project for the language at this point. He opts tesseract for this and wants to use it for newspaper pages in batch.

Earlier I was interested in creating a Nepali OCR but I am these days more into creating Nepali Translator [Hindi or English to Nepali text translator]
I read tesseract-ocr threads daily but still I prefer to be called a noob in this regards.

====
On Tue, Apr 17, 2012 at 7:26 AM, Art W Rhyno <> wrote:

Hi,

I was curious whether you were aware of any efforts to create Nepali language data for tesseract 3.01 and above. I see the 2.x test data but I can't find anything more recent.

art

---

Art Rhyno

Systems Librarian

University of Windsor

art

---

Art Rhyno

Systems Librarian

University of Windso

On Sat, Dec 3, 2011 at 7:10 PM, Bal Krishna Bal <balkris...@gmail.com> wrote:

Hi Anish,

On Sat, Dec 3, 2011 at 2:04 PM, ANISH SHRESTHA <connect...@gmail.com> wrote:

Great. that's good news!! I have been wondering how accurate does it analyze the handwritten scanned documents.

100% accuracy is hardly possible to obtain and there are of course, there are many criteria that would support in its accuracy. Is it possible to know the current status of the project? are we ready to jump into digitization already? How optimistic can we be about the digitization of the documents in near future?

Honestly, glad to hear about the progress of the project! Cheers!!

There is still a long way to go for an accurate OCR application for scanned image of printed text let alone the handwritten scanned documents, the latter entailing additional challenges as handwritten texts are hardly uniform and clean compared to printed text. There are issues over segmentation of words in Nepali. Tesseract although is a good classifier or recognition engine does not recognize conjoined characters. There is hence the challenge to develop an accurate segmentation module. We have made some effort

in the past on this front in the past and it would be really great if somebody could take this up to further the work. Details of the work can be found in the links that I shared earlier under this thread.
Regards,
Bal Krishna

On Sat, Dec 3, 2011 at 1:42 PM, Sushil <sush...@gmail.com> wrote:

Hi,

I am from OTRC
Yes we have been working on Nepali OCR.
The most difficult portion for us was the segmentation of nepali
characters so that it could be trained in tesseract-ocr engine.
But recently we have got some good results in segmentation. So the
remaining portion is training in tesseract for devnagiri characters
and building a good user interface.

May be we can collaborate for further development.
You can reach me @ 9849038151 for more information.

On Dec 3, 11:38 am, ANISH SHRESTHA <connectingan...@gmail.com> wrote:
> Dhanyabad sir. I should correspond to OTRC and LTK for more details. Very
> hoping it might help digitization of the govt data! I totally very
> appreciate your help sir.
>
> Cheers!!!
>

> On Sat, Dec 3, 2011 at 10:25 AM, Rajesh Pandey <pandey.pan...@gmail.com>wrote:

>
>
>
>
>
>
>
>
>
> > Yes i worked for nepaliocr at mpp after my thesis on it at Kathmandu
> > University. Currently OTRC and LTK are working on it. Tesseract for
> > devanagari and sanskritocr are some other ocrs that i know. Accuracy of
> > Sanskritocr is fairly good however it produces result in German/Roman.
>
> > Sent from my <your samsung devicename>.

> > On Dec 1, 2011 4:26 PM, "ANISH SHRESTHA" <connectingan...@gmail.com>

> > wrote:
>
> >> Thank you everyone! Will get back for more details!!
>
> >> Totally appreciate the help.
>
> >> On Thu, Dec 1, 2011 at 11:26 AM, Bal Krishna Bal <

> >> balkrishna7...@gmail.com> wrote:
>
> >>> Hi,
> >>> The link below lists down some efforts on the Research and Development
> >>> of the Nepali OCR in the past.
>

> >>>http://nepalinux.org/index.php?option=com_content&task=view&id=46&Ite...

>
> >>> I think the Open Technology Resource Center (OTRC) guys were also
> >>> working on it.
> >>>http://www.otrc.gov.np/?q=projects/devanagari-ocr
>
> >>> Please feel free to contact the Language Technology Kendra (LTK,
> >>>http://ltk.org.np) if you require further information.
>
> >>> Regards,
> >>> Bal Krishna Bal
> >>> Chief Technical Officer
> >>> Language Technology Kendra
> >>> Lalitpur, PatanDhoka
> >>> Assistant Professor
> >>> Department of Computer Science and Engineering
> >>> Kathmandu University
> >>> Dhulikhel, Kavre
> >>> Nepal
>

> >>> On Thu, Dec 1, 2011 at 10:43 AM, Sagar Kshetri <connect...@gmail.com>wrote:

>
> >>>> I think it is underdevelopment on mpp or ku.
> >>>> project close bhayo re bhanne halla pani suneko ho.
> >>>> better to contact mpp or ku
>
> >>>> On Wed, Nov 30, 2011 at 4:08 PM, ANISH SHRESTHA <

> >>>> connectingan...@gmail.com> wrote:
>
> >>>>> I have been searching Nepali OCR and found some researches was going
> >>>>> about that at NepalLinux couple year ago. But could not track anything
> >>>>> later that!!
>
> >>>>> Would be very grateful if anyone could help me on this!!
>
> >>>>> Thank you in advance.
>
> >>>>> Cheers!
>
> >>>>> --
> >>>>> Anish Shrestha
> >>>>> Mob:(+977)-9841472979

> >>>>> connectingan...@gmail.com

> >>>>>http://aniXification.com
> >>>>> Lalitpur, Nepal.
>
> >>>>> --
> >>>>> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>>>>http://groups.google.com/group/foss-nepal
> >>>>> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >>>>> Mailing List Guidelines:
> >>>>>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >>>>> Community website:http://www.fossnepal.org/
>
> >>>> --
> >>>> Regards
>
> >>>> ><((((º>`·.¸¸.·´¯`·.¸.·´¯`·...¸><((((º>¸.
>
> >>>> ·´¯`·.¸. , . .·´¯`·.. ><((((º>`·.¸¸.·´¯`·.¸.·´¯`·...¸><((((º>
> >>>> Mr. Sagar Kshetri (ASK?)
> >>>> Url:www.sagarkshetri.com.np
>
> >>>> --
> >>>> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>>>http://groups.google.com/group/foss-nepal
> >>>> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >>>> Mailing List Guidelines:
> >>>>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >>>> Community website:http://www.fossnepal.org/
>
> >>> --
> >>> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>>http://groups.google.com/group/foss-nepal
> >>> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >>> Mailing List Guidelines:
> >>>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >>> Community website:http://www.fossnepal.org/
>
> >> --
> >> Anish Shrestha
> >> Mob:(+977)-9841472979

> >> connectingan...@gmail.com

> >>http://aniXification.com
> >> Lalitpur, Nepal.
>
> >> --
> >> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>http://groups.google.com/group/foss-nepal
> >> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >> Mailing List Guidelines:
> >>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >> Community website:http://www.fossnepal.org/
>
> > --
> > FOSS Nepal mailing list: foss-...@googlegroups.com
> >http://groups.google.com/group/foss-nepal
> > To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> > Mailing List Guidelines:
> >http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> > Community website:http://www.fossnepal.org/
>
> --
> Anish Shrestha
> Mob:(+977)-9841472979

> connectingan...@gmail.comhttp://aniXification.com

> Lalitpur, Nepal.

--
FOSS Nepal mailing list: foss-...@googlegroups.com
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: foss-nepal+...@googlegroups.com

Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/

--
Anish Shrestha
Mob:(+977)-9841472979
connect...@gmail.com
http://aniXification.com
Lalitpur, Nepal.

--
FOSS Nepal mailing list: foss-...@googlegroups.com
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: foss-nepal+...@googlegroups.com

Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/

--
FOSS Nepal mailing list: foss-...@googlegroups.com
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: foss-nepal+...@googlegroups.com

Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/

--
Rajesh Pandey

Falke

unread,

Apr 23, 2012, 5:38:47 AM4/23/12

to tesseract-ocr

On Apr 20, 1:57 am, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> Hi
>
> Has anyone tried to create Nepali language data for tesseract ?
>
> I think Hindi/Sanskrit data files can also be used for tesseract.

I think it should work with the current "-l hin" option (tesseract's
hindi language traineddata)

Have YOU tried it yourself?

I got some errors, but have not played with the resolution, etc., to
try to reduce the errors.

> I don't know which place is it to discuss about this : tesseract ocr forum
> or fossnepal.
>

I don't see tesseract explicitly among the applications on fossnepal
front page (if that's any indication)

> Any suggestions on this ?
>
> Art is a librarian at the University of Windsor and have been working on
> using open source OCR for newspaper collections. He was asked about Nepali
> by a friend and became curious but he doesn't have a specific project for
> the language at this point. He opts

> tesseract<http://code.google.com/p/tesseract-ocr/>for this and wants

> to use it for newspaper pages in batch.
>
> Earlier I was interested in creating a Nepali OCR but I am these days more

You were going to write the whole engine, from scratch? Wow.

> into creating Nepali Translator [Hindi or English to Nepali text

> translator<http://code.google.com/p/nepaliwikipediatranslator>
> ]

> I read tesseract-ocr threads daily but still I prefer to be called a noob
> in this regards.

Have you tried tesseract with "-l hin" on nepali images??

Let us know your accuracy (and perhaps some idea of the resolution and
quality of your scan, etc.)

I believe accuracy has increased with the most recent (3.02) tesseract
version.

You need to compile 3.02 from svn. Read the INSTALL.SVN, and don't
forget "make install-langs" at the end.

The correspondence you quote, i believe, predates the recent
improvements and additions, in 3.02. Also, the assertion in it that
tesseract does not recognize conjoined characters is wrong.

I believe it is AND WAS wrong, in general, even prior before 3.02

conceptually, that is: I don't think tesseract is **aware** of
conjuncts, per se, as an object or algorithm -- it simply stores the
conjunct's prototype image (like any other glyph image) in its data
set, where that image is mapped to its utf8 code representation
(however many bytes that utf8 representation might take (though there
*IS* a limit)).

As I understand it -- the challenge of improving this (and other
scripts') recognition accuracy has a lot to do with the training
(although perhaps not exclusively).

Regarding handwritten documents: that seems daunting to me, but I may
not know enough of tesseract's internals to assess how much harder
handwritten images would be, than typeset ones. I know that tesseract
makes multiple recognition passes, as it builds certain assumptions
and confidences on its first pass, to be used in the second. Well, if
hand-written documents have a certain consistency which tesseract can
algorithmify (for pass 2+) then that's a plus. But it seems it would
take very stylistically consistent hand writing, for that to take
effect.

Rajesh Pandey

unread,

Apr 26, 2012, 2:18:33 PM4/26/12

to tesser...@googlegroups.com

> Earlier I was interested in creating a Nepali OCR but I am these days more

You were going to write the whole engine, from scratch? Wow.

Yes indeed. We(as a team) were creating a complete OCR. We were researching and developing a full fledged Nepali OCR.

Some of the work is still there at code.google.com/p/nepaliocr

I haven't tried to train again. I was asking if anyone had ever tried for Nepali because there might be some people who had luck. If I'd know that people had luck training, it would be worth trying it. Its nearly 3 years I had attempted to train tesseract for Nepali.

Fossnepal is a group of Nepali Open source community.

On Mon, Apr 23, 2012 at 3:08 PM, Falke <haw...@flight.us> wrote:

On Apr 20, 1:57 am, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> Hi
>
> Has anyone tried to create Nepali language data for tesseract ?
>
> I think Hindi/Sanskrit data files can also be used for tesseract.

I think it should work with the current "-l hin" option (tesseract's
hindi language traineddata)

Have YOU tried it yourself?

I got some errors, but have not played with the resolution, etc., to
try to reduce the errors.

> I don't know which place is it to discuss about this : tesseract ocr forum
> or fossnepal.
>

I don't see tesseract explicitly among the applications on fossnepal
front page (if that's any indication)

> Any suggestions on this ?
>
> Art is a librarian at the University of Windsor and have been working on
> using open source OCR for newspaper collections. He was asked about Nepali
> by a friend and became curious but he doesn't have a specific project for
> the language at this point. He opts

> tesseract<http://code.google.com/p/tesseract-ocr/>for this and wants

> to use it for newspaper pages in batch.
>

> into creating Nepali Translator [Hindi or English to Nepali text

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--
Rajesh Pandey

Falke

unread,

Apr 29, 2012, 2:40:05 PM4/29/12

to tesseract-ocr

On Apr 26, 2:18 pm, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> > > Earlier I was interested in creating a Nepali OCR but I am these days
> > more
>
> > You were going to write the whole engine, from scratch? Wow.
>
> Yes indeed. We(as a team) were creating a complete OCR. We

> *were*researching and developing a full fledged Nepali OCR.

>
> Some of the work is still there at code.google.com/p/nepaliocr
>
> I haven't tried to train again. I was asking if anyone had ever tried for
> Nepali because there might be some people who had luck. If I'd know that
> people had luck training, it would be worth trying it. Its nearly 3 years I
> had attempted to train tesseract for Nepali.
>
> Fossnepal is a group of Nepali Open source community.
>

If you uploaded a sample scanned image to this forum, others
(including myself) could try it with tesseract. I'm not sure how much
difference there is between font(s) in (older?) Nepali documents and
Hindi documents... While the alphabet is the same (correct me if i'm
wrong), maybe the styles (font variations) are different enough to
call for separate training (?) But I don't think it should be SO
different as to negate the following deductive statement: "If
tesseract is trainable for Hindi, it should be trainable for Nepali
". Or, IOW: At best -- you can piggyback on the hindi training; at
worst, you'll need to train specifically for nepali (therewith
achieving accuracy comparable to the one with Hindi).

Of course, not being an expert on this, i may have to eat my words ...

Rajesh Pandey

unread,

Apr 30, 2012, 9:43:14 PM4/30/12

to tesser...@googlegroups.com

Hi Falke,
Here is a sample image

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--
Rajesh Pandey

Rajesh Pandey

unread,

Apr 30, 2012, 9:45:59 PM4/30/12

to tesser...@googlegroups.com

Hi Falke,
Here is a sample Image. I have more images that are used for testing but they are copyrighted so I can't send them here in public but I can email them individually.

On Mon, Apr 30, 2012 at 12:10 AM, Falke <haw...@flight.us> wrote:

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--
Rajesh Pandey

nepaliSampleImage.png

Falke

unread,

May 1, 2012, 7:17:01 AM5/1/12

to tesseract-ocr

On Apr 30, 9:45 pm, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> Hi Falke,
> Here is a sample Image. I have more images that are used for testing but
> they are copyrighted so I can't send them here in public but I can email
> them individually.
>

OK, this looks like "standard" devanagari... not much different from
Hindi.

Hmm... my accuracy was pretty bad (see text below). But I believe the
resolution has a lot to do with it. This looks like either 300 or 150
dpi. I would try scanning at 600dpi.

Also, it just occurred to me: Even if the fonts are similar, you'd
have to create a separate, Nepali dictionary, to use that feature
optimally.

--------- my results ------------

आदृछे रुत्साश्या षाड़च्चे षाणाहर-अष्ट सक्शच्चा चप्तादृब्ब रु
जुम्लद्धआन्न ग्राणा
हो श्यच्चे च्चाश्लो बुब्लदृव्रच्चे ड्डणयंणा क्या ख्याग्लाइं झ फ्लाफ्ला
छा
ल्यश्चाढ नुप्लव्रच्चे क्लान्नाश्ले म्नहाँव्रच्चा ब्लॉ व्रच्चाग्लाई
च्चाझ्यादृव्रच्चे छा ख्वा
स्थाश्या ०१२- व्रब्लणाझं' क्षट्टक्लक्षट्टदृम्लन्न व्रहानं णादृछंश्ले
आज़ धसां छादृब्ल
व्रच्चाश्या श्लोह्म णाड़न्ना ह्माह्मरूक्रव्रश्ले छा णश्लो दृह्माव्रदृ
व्लाझप्लन्नाई
दृरूयुब्वच्चों शुदृच्चादृ व्रच्चानुच्चे ड्डेम्भह्मतुहगृ आख्या
व्रच्चाणा ग्रह; ५ म्नसादृव्रच्चा रूज
ठशाझसल्शई रुवप्लाप्रा सिध्यम्लनुच्चे ब्लणुव्रअ ब्रच्चाणा च्चिदृ ५
णाइम्लश्या हैनं हो बीते
न्नणा, यं' णादृछ क्या ख्याश्लो व्रच्चाह्म श्चाठा आणा दृहेछा

------ end my results --------

Again, this is 3.02

Falke

unread,

May 1, 2012, 10:21:38 AM5/1/12

to tesseract-ocr

I subjected your png to some pre-processing (resize, blur, threshold,
etc.) and got slightly better results:

---------- my results -----------
दृप्राक्वछ संसारन्मा पाइज्ञे प्रश्मीइरनंआ टूपबम्भन्दा चन्नाग्ध र
बुद्धिन्मात्न प्राणी
हो । यसले अक्वफ्लो बुद्धिको उपयोग ब्वगरेर संसारन्नाई बं सत्नाएको छा
ह्नरासँचद्धरि इसको चन्नान्धीठो रांहॉका सवं प्रग़गीत्माई कट्सभाएको छा
एक
सइपरांन्मा सार:; प्रग़गीहाँ टज्ञइगत्मरज्ञइचक्वत्म चह्मर्ल आक्वछठो
छात्र पात्रों छाडी
चन्द्रछग़आ सठोत्त पाइत्मा इउप्तिसकेको छा रांप्तत्ये शयद्धत्प्त
न्माप्तिसत्माई
हृपृह्नयुको हपुरब्रबाट बत्ताठज्ञे अऋत्तलुल्या औंषणा बत्नाएप्त कि ।
संसारका सवं
न्माप्तिटूपत्माई ष्टकप्ताश सिपइरांश्वठज्ञे अज्वगुबन्म बत्नाएत्न कि ।
घऊहिरिएर हैच हो अले
न्नानंछ, शो अद-छे रास संदृपारको कति ड्डप्तलौठो प्राणी रहेछ ।
---------- end my results ------

But, essentially, it's much better to start with higher-resolution
scans.

Rajesh Pandey

unread,

May 1, 2012, 12:51:18 PM5/1/12

to tesser...@googlegroups.com

Hi Falke,

Thanks for trying this out.
The hindi language tesseract data files should work. While I was working in 2007-2008, Hindi language data files were not available. A bengali guy called debayanin tried hard to use hindi / devanagari.
Today the hindi language data files (tessdata) are available. I haven't tested it. But I am sure it should work.
The question has been answered. Nepali Language should be able to use the hindi data files. It all depends on how much accurate the results for Hindi are. If Hindi is detected flawlessly, it should work similarly with Nepali. There is a slight difference in Nepali that some characters from Hindi are not used. However they are in the devanagari chart. Its good for Nepali that Nepali does not use those characters. If it had been the reverse, we should train again to incorporate those characters.

So everything should be fine.
Thanks for testing out with the Nepali sample image. The result is not good but I think it can be done after digging out with correct Hindi tessdata and the new tesseract. Uh thanks everyone for reading this.

2012/5/1 Falke <haw...@flight.us>

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--
Rajesh Pandey

Falke

unread,

May 1, 2012, 2:29:43 PM5/1/12

to tesseract-ocr

On May 1, 12:51 pm, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> The hindi language tesseract data files should work. While I was working in
> 2007-2008, Hindi language data files were not available. A bengali guy
> called debayanin tried hard to use hindi / devanagari.
> Today the hindi language data files (tessdata) are available. I haven't
> tested it. But I am sure it should work.
> The question has been answered. Nepali Language should be able to use the
> hindi data files. It all depends on how much accurate the results for Hindi
> are. If Hindi is detected flawlessly, it should work similarly with Nepali.

Except for the dictionary, as I mentioned above. Nepali dictionary is
definitely different from Hindi dictionary. The difference would
probably be reflected in the accuracy and/or speed. AFAIK, the
dictionary is instrumental in the algorithms. (Someone, correct me if
I'm wrong.)

The above, of course, would beg the question: Can you just swap out
the dictionary component of traineddata? I am assuming one can. (So
as not to have to retrain from scratch)

> There is a slight difference in Nepali that some characters from Hindi are
> not used. However they are in the devanagari chart. Its good for Nepali
> that Nepali does not use those characters. If it had been the reverse, we
> should train again to incorporate those characters.

Just out of curiosity -- what bearing does this have on Sanskrit? Are
there certain Sanskrit glyphs that are missing from the current
tesseract Hindi set?

Thanks

Nick White

unread,

May 2, 2012, 8:36:12 AM5/2/12

to tesser...@googlegroups.com

On Tue, May 01, 2012 at 11:29:43AM -0700, Falke wrote:
> The above, of course, would beg the question: Can you just swap out
> the dictionary component of traineddata? I am assuming one can. (So
> as not to have to retrain from scratch)

You should be able to, yes, using combine_tessdata to extract,
wordlist2dawg to create new dictionary files (see the training wiki
page), and combine_tessdata to recombine the training data with the new
dictionary.

An alternative would be to just specify in a config to use a custom
dictionary file, and not use those in the existing training file.
This is explained well in the "CONFIG FILES AND AUGMENTING WITH USER
DATA" section of
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html

Nick

Rajesh Pandey

unread,

May 2, 2012, 12:25:40 PM5/2/12

to tesser...@googlegroups.com

On Tue, May 1, 2012 at 11:59 PM, Falke <haw...@flight.us> wrote:

On May 1, 12:51 pm, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> The hindi language tesseract data files should work. While I was working in

> 2007-2008, Hindi language data files were not available. A guy

> called debayanin tried hard to use hindi / devanagari.
> Today the hindi language data files (tessdata) are available. I haven't
> tested it. But I am sure it should work.
> The question has been answered. Nepali Language should be able to use the
> hindi data files. It all depends on how much accurate the results for Hindi
> are. If Hindi is detected flawlessly, it should work similarly with Nepali.

Except for the dictionary, as I mentioned above. Nepali dictionary is

definitely different from Hindi dictionary.

Yes the dictionary is a bit different. However a lot of words are similar, specially the words derived from Sanskrit are mostly common in Hindi and Nepali. Nouns are approximately 80% similar, adjectives may be 50% similar, verbs are a bit different, the suffixes and prefixes attached to verb, noun and adjectives are mostly different.
So there are chances that even the dictionary files could also be used to some extent. But that's just a guess without actually using it.
eg: this is a list of nouns common in both language that I have compiled for a different project.
adjectives , verbs, nouns

I am still a newbie so I don't know much about the dictionary files and unicharset so I should not be writing about it. Earlier while I trained, I used few empty files for them and didn't use a dictionary, I just used zero sized files just to make tesseract work.

The difference would
probably be reflected in the accuracy and/or speed. AFAIK, the
dictionary is instrumental in the algorithms. (Someone, correct me if
I'm wrong.)

The above, of course, would beg the question: Can you just swap out
the dictionary component of traineddata? I am assuming one can. (So
as not to have to retrain from scratch)

> There is a slight difference in Nepali that some characters from Hindi are
> not used. However they are in the devanagari chart. Its good for Nepali
> that Nepali does not use those characters. If it had been the reverse, we
> should train again to incorporate those characters.

Just out of curiosity -- what bearing does this have on Sanskrit? Are
there certain Sanskrit glyphs that are missing from the current
tesseract Hindi set?

Well someone who knows Sanskrit better must know better about this.

Thanks

--

--
Rajesh Pandey

Rajesh Pandey

unread,

Jun 18, 2012, 12:25:49 PM6/18/12

to tesser...@googlegroups.com

Hoi Richard,
Don't get confused with the image. The image was provided to give a glimpse of how a sample Nepali text would look like.
These are machine generated images. Although I had images from books which were scanned (copyrighted).

I usually create sample images by going to http://ne.wikipedia.org the Nepali language wikipedia, grab a screenshot and then that's it. You'd have a full text as well as the image.

Well currently we don't have much expectations with the ocr engines. So its too early to talk about rugged, and old images from old books, which need noise removal .

If you need the utf text, here it is :
"
मान्छे संसारमा पाइने प्राणीहरुमा सबभन्दा चलाख र बुद्धिमान प्राणी हो। यसले आफ्नो बुद्धिको उपयोग गरेर संसारलाई नै सजाएको छ। त्यसैगरि उसको चलाखीले यहाँका सबै प्राणीलाई कज्याएको छ। एक समयमा अरु प्राणीझैं जङ्गलजङ्गल चहार्ने मान्छेले आज धर्ती छाडी चन्द्रमामा समेत पाइला हालिसकेको छ। यसले एकएक मानिसलाई मृत्युको मुखबाट बचाउने अमृततुल्य औषधी बनाएन कि ! संसारका सबै मानिसलाई एकसाथ सिध्याउने अणुबम बनाएन कि ! गहिरिएर हेर्ने हो भने लाग्छ, यो मान्छे यस संसारको कति अनौठो प्राणी रहेछ।
"

Translation:
Man is the most intelligent and wise animal in the world. He has used his wisdom to decorate the world. Similarly his wisdom has tamed all the animals. The man who used to wonder in the jungles has now even stepped on the moon. He has even managed to save each person from death by making ambrosial medicines. While he has also created atom bomb which can destroy everyone in the world. Its interesting to see how incredible humans are.

I didn't understand what you wanted, I have sent the image, its text, (I typed it, so excuse me for any typing mistakes), and there is an English Translation of that text.

Hope this helps.

Cheers,

On Sun, Jun 17, 2012 at 12:33 AM, Blavatsky3 <nine.eleven.is...@gmail.com> wrote:

Hi Rajesh,
A couple of questions...
1) when you use the sample png file to train ... to create language data files ... is there a complementary text file which much be present as a utf-8 text file for the tiff file.... getting image file is confusing to me.

2) If there are pairs of image and text files for training how does one name them so that the program knows what to do ?

Or have I got it all wrong ? I need someone to explain this as though I were a 5-year old.

Any help is appreciated

Richard

PS I am trying to use Tesseract to create my own Fraktur German language data files to enhance ocr accuracy.

tesseract-ocr+unsubscribe@googlegroups.com

For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--
Rajesh Pandey

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en