Dear All,
It's great news that OCROPUS alpha-2 is released. Developers are working on several issues for its improvement. However at this moment I feel that we need to focus on the integration of other scripts with OCROPUS. First of all we need a complete guideline for this. We (me and souro) are working in this issue for the past few months to integrate Bangla language. Mark Stillwell already made his effort on this issue and he has shown the way of training and recognition of Bangla characters. We began with his work and after a complete analysis of the source code we understand the procedures. I realized that we should have an efficient segmentation algorithm to make Bangla script recognized by OCROPUS. So I moved my focus on segmentation and souro continue his work on understanding the techniques of OCROPUS. In the mean time we tested tesseract on OCROPUS to recognize Bangla characters and successfully done that. However we are more interested in ocr-bpnet at this moment and want to recognize Bangla character using the bpnet. From yesterday we are trying to test the bpnet to train and test Bangla characters, but due to several problems we are failing again and again. Souro already emailed several times regarding to the problems in the training and tesing of the isolated Bangla characters. We are still exploring all the possible ways to get it done. At this moment I feel that if we had a guideline to integrate the non-latin script into OCROPUS then it would be very much easy for us to integrate our language and test the performance. I hope Thomas Breuel will consider this issue.
ka kA ki kI ku kU kIku kAki ke kE kaki ko kEಕಿ kaಕೂ kakA kaki
ಕ ಕಾ ಕಿ ಕೀ ಕು ಕೂ ಕೃ ಕೄ ಕೆ ಕೇ ಕೈ ಕೊ ಕೊಕಾ ಕೌ ಕಂ ಕಃ
कं का कि की कु कू कृ कॄ के कॆ कै को कॊ कौ कं कः
ക കാ കി കീ കു കൂ കൃ കകൠ ಕಾക കേ കെക ಕಾകാ ಕಾകാ ಕಾകಕಾ കം കഃ
Right... I understand (roughly) how the scripts work. The question is
how you are adapting Tesseract to work with them.
It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics. Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German. Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.
Thanks for the inspiration. I wish OCRopus would first work on the subset of trained Nepali/Devanagari characters without considering all other complexities.
It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics. Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German. Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.
That's pretty easy to do, once we document the training procedure. But is there any practical use for a Devanagari recognizer that doesn't deal with ligatures?
Thanks for the inspiration. I wish OCRopus would first work on the subset of trained Nepali/Devanagari characters without considering all other complexities.
It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics. Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German. Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.
That's pretty easy to do, once we document the training procedure. But is there any practical use for a Devanagari recognizer that doesn't deal with ligatures?