Hi,I was curious whether you were aware of any efforts to create Nepali language data for tesseract 3.01 and above. I see the 2.x test data but I can't find anything more recent.art---Art RhynoSystems LibrarianUniversity of Windsor
Hi Anish,On Sat, Dec 3, 2011 at 2:04 PM, ANISH SHRESTHA <connect...@gmail.com> wrote:
Great. that's good news!! I have been wondering how accurate does it analyze the handwritten scanned documents.100% accuracy is hardly possible to obtain and there are of course, there are many criteria that would support in its accuracy. Is it possible to know the current status of the project? are we ready to jump into digitization already? How optimistic can we be about the digitization of the documents in near future?Honestly, glad to hear about the progress of the project! Cheers!!There is still a long way to go for an accurate OCR application for scanned image of printed text let alone the handwritten scanned documents, the latter entailing additional challenges as handwritten texts are hardly uniform and clean compared to printed text. There are issues over segmentation of words in Nepali. Tesseract although is a good classifier or recognition engine does not recognize conjoined characters. There is hence the challenge to develop an accurate segmentation module. We have made some effortin the past on this front in the past and it would be really great if somebody could take this up to further the work. Details of the work can be found in the links that I shared earlier under this thread.Regards,Bal Krishnaconnect...@gmail.comOn Sat, Dec 3, 2011 at 1:42 PM, Sushil <sush...@gmail.com> wrote:
Hi,
I am from OTRC
Yes we have been working on Nepali OCR.
The most difficult portion for us was the segmentation of nepali
characters so that it could be trained in tesseract-ocr engine.
But recently we have got some good results in segmentation. So the
remaining portion is training in tesseract for devnagiri characters
and building a good user interface.
May be we can collaborate for further development.
You can reach me @ 9849038151 for more information.
> On Sat, Dec 3, 2011 at 10:25 AM, Rajesh Pandey <pandey.pan...@gmail.com>wrote:
On Dec 3, 11:38 am, ANISH SHRESTHA <connectingan...@gmail.com> wrote:
> Dhanyabad sir. I should correspond to OTRC and LTK for more details. Very
> hoping it might help digitization of the govt data! I totally very
> appreciate your help sir.
>
> Cheers!!!
>
>> > On Dec 1, 2011 4:26 PM, "ANISH SHRESTHA" <connectingan...@gmail.com>
>
>
>
>
>
>
>
>
> > Yes i worked for nepaliocr at mpp after my thesis on it at Kathmandu
> > University. Currently OTRC and LTK are working on it. Tesseract for
> > devanagari and sanskritocr are some other ocrs that i know. Accuracy of
> > Sanskritocr is fairly good however it produces result in German/Roman.
>
> > Sent from my <your samsung devicename>.
> > wrote:
>
> >> Thank you everyone! Will get back for more details!!
>
> >> Totally appreciate the help.
>
> >> On Thu, Dec 1, 2011 at 11:26 AM, Bal Krishna Bal <
> >> balkrishna7...@gmail.com> wrote:> >>>http://nepalinux.org/index.php?option=com_content&task=view&id=46&Ite...
>
> >>> Hi,
> >>> The link below lists down some efforts on the Research and Development
> >>> of the Nepali OCR in the past.
>
>> >>> On Thu, Dec 1, 2011 at 10:43 AM, Sagar Kshetri <connect...@gmail.com>wrote:
> >>> I think the Open Technology Resource Center (OTRC) guys were also
> >>> working on it.
> >>>http://www.otrc.gov.np/?q=projects/devanagari-ocr
>
> >>> Please feel free to contact the Language Technology Kendra (LTK,
> >>>http://ltk.org.np) if you require further information.
>
> >>> Regards,
> >>> Bal Krishna Bal
> >>> Chief Technical Officer
> >>> Language Technology Kendra
> >>> Lalitpur, PatanDhoka
> >>> Assistant Professor
> >>> Department of Computer Science and Engineering
> >>> Kathmandu University
> >>> Dhulikhel, Kavre
> >>> Nepal
>
>
> >>>> I think it is underdevelopment on mpp or ku.
> >>>> project close bhayo re bhanne halla pani suneko ho.
> >>>> better to contact mpp or ku
>
> >>>> On Wed, Nov 30, 2011 at 4:08 PM, ANISH SHRESTHA <
> >>>> connectingan...@gmail.com> wrote:> >>>>> connectingan...@gmail.com
>
> >>>>> I have been searching Nepali OCR and found some researches was going
> >>>>> about that at NepalLinux couple year ago. But could not track anything
> >>>>> later that!!
>
> >>>>> Would be very grateful if anyone could help me on this!!
>
> >>>>> Thank you in advance.
>
> >>>>> Cheers!
>
> >>>>> --
> >>>>> Anish Shrestha
> >>>>> Mob:(+977)-9841472979
> >> connectingan...@gmail.com> >>>>>http://aniXification.com
> >>>>> Lalitpur, Nepal.
>
> >>>>> --
> >>>>> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>>>>http://groups.google.com/group/foss-nepal
> >>>>> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >>>>> Mailing List Guidelines:
> >>>>>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >>>>> Community website:http://www.fossnepal.org/
>
> >>>> --
> >>>> Regards
>
> >>>> ><((((º>`·.¸¸.·´¯`·.¸.·´¯`·...¸><((((º>¸.
>
> >>>> ·´¯`·.¸. , . .·´¯`·.. ><((((º>`·.¸¸.·´¯`·.¸.·´¯`·...¸><((((º>
> >>>> Mr. Sagar Kshetri (ASK?)
> >>>> Url:www.sagarkshetri.com.np
>
> >>>> --
> >>>> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>>>http://groups.google.com/group/foss-nepal
> >>>> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >>>> Mailing List Guidelines:
> >>>>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >>>> Community website:http://www.fossnepal.org/
>
> >>> --
> >>> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>>http://groups.google.com/group/foss-nepal
> >>> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >>> Mailing List Guidelines:
> >>>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >>> Community website:http://www.fossnepal.org/
>
> >> --
> >> Anish Shrestha
> >> Mob:(+977)-9841472979
> >>http://aniXification.com> connectingan...@gmail.comhttp://aniXification.com
> >> Lalitpur, Nepal.
>
> >> --
> >> FOSS Nepal mailing list: foss-...@googlegroups.com
> >>http://groups.google.com/group/foss-nepal
> >> To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> >> Mailing List Guidelines:
> >>http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> >> Community website:http://www.fossnepal.org/
>
> > --
> > FOSS Nepal mailing list: foss-...@googlegroups.com
> >http://groups.google.com/group/foss-nepal
> > To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
>
> > Mailing List Guidelines:
> >http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
> > Community website:http://www.fossnepal.org/
>
> --
> Anish Shrestha
> Mob:(+977)-9841472979
> Lalitpur, Nepal.
--
FOSS Nepal mailing list: foss-...@googlegroups.com
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/
--
Anish Shrestha
Mob:(+977)-9841472979
http://aniXification.com
Lalitpur, Nepal.
--
FOSS Nepal mailing list: foss-...@googlegroups.com
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/
--
FOSS Nepal mailing list: foss-...@googlegroups.com
http://groups.google.com/group/foss-nepal
To unsubscribe, e-mail: foss-nepal+...@googlegroups.com
Mailing List Guidelines: http://wiki.fossnepal.org/index.php?title=Mailing_List_Guidelines
Community website: http://www.fossnepal.org/
> Earlier I was interested in creating a Nepali OCR but I am these days more
You were going to write the whole engine, from scratch? Wow.
On Apr 20, 1:57 am, Rajesh Pandey <pandey.pan...@gmail.com> wrote:I think it should work with the current "-l hin" option (tesseract's
> Hi
>
> Has anyone tried to create Nepali language data for tesseract ?
>
> I think Hindi/Sanskrit data files can also be used for tesseract.
hindi language traineddata)
Have YOU tried it yourself?
I got some errors, but have not played with the resolution, etc., to
try to reduce the errors.
I don't see tesseract explicitly among the applications on fossnepal
> I don't know which place is it to discuss about this : tesseract ocr forum
> or fossnepal.
>
front page (if that's any indication)
> tesseract<http://code.google.com/p/tesseract-ocr/>for this and wants
> Any suggestions on this ?
>
> Art is a librarian at the University of Windsor and have been working on
> using open source OCR for newspaper collections. He was asked about Nepali
> by a friend and became curious but he doesn't have a specific project for
> the language at this point. He opts
> to use it for newspaper pages in batch.
>
> into creating Nepali Translator [Hindi or English to Nepali text
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
OK, this looks like "standard" devanagari... not much different from
Hindi.
Hmm... my accuracy was pretty bad (see text below). But I believe the
resolution has a lot to do with it. This looks like either 300 or 150
dpi. I would try scanning at 600dpi.
Also, it just occurred to me: Even if the fonts are similar, you'd
have to create a separate, Nepali dictionary, to use that feature
optimally.
--------- my results ------------
आदृछे रुत्साश्या षाड़च्चे षाणाहर-अष्ट सक्शच्चा चप्तादृब्ब रु
जुम्लद्धआन्न ग्राणा
हो श्यच्चे च्चाश्लो बुब्लदृव्रच्चे ड्डणयंणा क्या ख्याग्लाइं झ फ्लाफ्ला
छा
ल्यश्चाढ नुप्लव्रच्चे क्लान्नाश्ले म्नहाँव्रच्चा ब्लॉ व्रच्चाग्लाई
च्चाझ्यादृव्रच्चे छा ख्वा
स्थाश्या ०१२- व्रब्लणाझं' क्षट्टक्लक्षट्टदृम्लन्न व्रहानं णादृछंश्ले
आज़ धसां छादृब्ल
व्रच्चाश्या श्लोह्म णाड़न्ना ह्माह्मरूक्रव्रश्ले छा णश्लो दृह्माव्रदृ
व्लाझप्लन्नाई
दृरूयुब्वच्चों शुदृच्चादृ व्रच्चानुच्चे ड्डेम्भह्मतुहगृ आख्या
व्रच्चाणा ग्रह; ५ म्नसादृव्रच्चा रूज
ठशाझसल्शई रुवप्लाप्रा सिध्यम्लनुच्चे ब्लणुव्रअ ब्रच्चाणा च्चिदृ ५
णाइम्लश्या हैनं हो बीते
न्नणा, यं' णादृछ क्या ख्याश्लो व्रच्चाह्म श्चाठा आणा दृहेछा
------ end my results --------
Again, this is 3.02
---------- my results -----------
दृप्राक्वछ संसारन्मा पाइज्ञे प्रश्मीइरनंआ टूपबम्भन्दा चन्नाग्ध र
बुद्धिन्मात्न प्राणी
हो । यसले अक्वफ्लो बुद्धिको उपयोग ब्वगरेर संसारन्नाई बं सत्नाएको छा
ह्नरासँचद्धरि इसको चन्नान्धीठो रांहॉका सवं प्रग़गीत्माई कट्सभाएको छा
एक
सइपरांन्मा सार:; प्रग़गीहाँ टज्ञइगत्मरज्ञइचक्वत्म चह्मर्ल आक्वछठो
छात्र पात्रों छाडी
चन्द्रछग़आ सठोत्त पाइत्मा इउप्तिसकेको छा रांप्तत्ये शयद्धत्प्त
न्माप्तिसत्माई
हृपृह्नयुको हपुरब्रबाट बत्ताठज्ञे अऋत्तलुल्या औंषणा बत्नाएप्त कि ।
संसारका सवं
न्माप्तिटूपत्माई ष्टकप्ताश सिपइरांश्वठज्ञे अज्वगुबन्म बत्नाएत्न कि ।
घऊहिरिएर हैच हो अले
न्नानंछ, शो अद-छे रास संदृपारको कति ड्डप्तलौठो प्राणी रहेछ ।
---------- end my results ------
But, essentially, it's much better to start with higher-resolution
scans.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
On May 1, 12:51 pm, Rajesh Pandey <pandey.pan...@gmail.com> wrote:
> The hindi language tesseract data files should work. While I was working in
> 2007-2008, Hindi language data files were not available. A guy
> called debayanin tried hard to use hindi / devanagari.
> Today the hindi language data files (tessdata) are available. I haven't
> tested it. But I am sure it should work.
> The question has been answered. Nepali Language should be able to use the
> hindi data files. It all depends on how much accurate the results for Hindi
> are. If Hindi is detected flawlessly, it should work similarly with Nepali.
Except for the dictionary, as I mentioned above. Nepali dictionary is
definitely different from Hindi dictionary.
The difference would
probably be reflected in the accuracy and/or speed. AFAIK, the
dictionary is instrumental in the algorithms. (Someone, correct me if
I'm wrong.)
The above, of course, would beg the question: Can you just swap out
the dictionary component of traineddata? I am assuming one can. (So
as not to have to retrain from scratch)
Just out of curiosity -- what bearing does this have on Sanskrit? Are
> There is a slight difference in Nepali that some characters from Hindi are
> not used. However they are in the devanagari chart. Its good for Nepali
> that Nepali does not use those characters. If it had been the reverse, we
> should train again to incorporate those characters.
there certain Sanskrit glyphs that are missing from the current
tesseract Hindi set?
Thanks
--
Hi Rajesh,
A couple of questions...
1) when you use the sample png file to train ... to create language data files ... is there a complementary text file which much be present as a utf-8 text file for the tiff file.... getting image file is confusing to me.
2) If there are pairs of image and text files for training how does one name them so that the program knows what to do ?
Or have I got it all wrong ? I need someone to explain this as though I were a 5-year old.
Any help is appreciated
Richard
PS I am trying to use Tesseract to create my own Fraktur German language data files to enhance ocr accuracy.
--
Rajesh Pandey
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en