Guideline to integrate other scripts with OCROPUS

53 views
Skip to first unread message

Hasnat

unread,
Jun 1, 2008, 6:49:54 AM6/1/08
to ocr...@googlegroups.com

Dear All,

              It's great news that OCROPUS alpha-2 is released. Developers are working on several issues for its improvement. However at this moment I feel that we need to focus on the integration of other scripts with OCROPUS. First of all we need a complete guideline for this. We (me and souro) are working in this issue for the past few months to integrate Bangla language. Mark Stillwell already made his effort on this issue and he has shown the way of training and recognition of Bangla characters. We began with his work and after a complete analysis of the source code we understand the procedures. I realized that we should have an efficient segmentation algorithm to make Bangla script recognized by OCROPUS. So I moved my focus on segmentation and souro continue his work on understanding the techniques of OCROPUS. In the mean time we tested tesseract on OCROPUS to recognize Bangla characters and successfully done that. However we are more interested in ocr-bpnet at this moment and want to recognize Bangla character using the bpnet. From yesterday we are trying to test the bpnet to train and test Bangla characters, but due to several problems we are failing again and again. Souro already emailed several times regarding to the problems in the training and tesing of the isolated Bangla characters. We are still exploring all the possible ways to get it done. At this moment I feel that if we had a guideline to integrate the non-latin script into OCROPUS then it would be very much easy for us to integrate our language and test the performance. I hope Thomas Breuel will consider this issue.


Regards,
--
Hasnat
Center for Research on Bangla Language Processing (CRBLP)

Tom Breuel

unread,
Jun 1, 2008, 9:41:40 AM6/1/08
to ocropus
Hi,

thanks for your response.

Getting the bpnet stuff documented and improving it are highest
priority for the next couple of months; see http://sites.google.com/site/ocropus/roadmap

The fact that this isn't done is the main reason this release is
called "alpha2" instead of "beta".

I believe Bengali is similar to Devanagari, which means that building
a recognizer for it quite a bit more involved than for Latin scripts.

I've started writing up some of the training issues in general, and
some of the "complex script" issues in particular on this page:
http://sites.google.com/site/ocropus/documentation/training-on-a-new-script

In short, my recommendation would be to try out different segmentation
methods and glyph sets, and then train and optimize the isolated
character MLP recognizer on those glyphs. That work will transfer to
the line recognizer once we have documented (and refactored) that code
better. In fact, we may be able to give you an adapter that lets you
use the isolated character recognizer for line recognition in the
future.

I suspect that the best glyph set for both Bengali and Devanagari may
be complete syllables (consonant clusters, vowels, diacritics), but
there are a lot of those, so there are probably tweaks and changes
needed to the code itself.

Cheers,
Thomas.

Hasnat

unread,
Jun 1, 2008, 9:59:28 AM6/1/08
to ocr...@googlegroups.com
Thanks for your response. Now the first task is to check out the documentation and follow the instructions. Definitely we will let you know the updates. I think as bpnet is not matured at this moment then we can concentrate on tesseract engine (as we got success in that) of OCROPUS and make that workable to recognize Bangla script. And by the time while bpnet will mature enough then we can integrate Bangla script with that. If my idea is wrong then please let me know.

Tom Breuel

unread,
Jun 1, 2008, 10:31:04 AM6/1/08
to ocropus


On Jun 1, 3:59 pm, Hasnat <mhas...@gmail.com> wrote:
> I think as bpnet is not matured at this moment then we can
> concentrate on tesseract engine (as we got success in that) of OCROPUS and
> make that workable to recognize Bangla script.

That is certainly useful, for several reasons: as another recognizer,
to align text and images for an initial model, for OCR engine
combination, and for other kinds of training.

Have you written up anywhere what you have done to train Tesseract on
Bengali? Basically, many of the decisions you needed to make there
(diacritics, ligatures, coding, etc.) are similar to what needs to be
done to make the bpnet recognizer work.

Why don't you add that information to our Wiki?

Cheers,
Thomas.

Hasnat

unread,
Jun 1, 2008, 10:47:43 AM6/1/08
to ocr...@googlegroups.com
Actually we (souro and me) have done the preliminary testing to recognize Bangla script more specifically the Bangla isolated characters and quickly moved to bpnet. Currently I am finishing the character segmentation algorithm (which will also work for Devanagari also) and thinking about the integration of segmentation with OCROPUS. At the beginning while we observed the work done for Bangla (by Mark where training has been done using pnglist which is not using at this moment) we thought the development of integrating bpnet is finished. We were more focused on bpnet at that time and thats why we didn't do much in tesseract. However I think we should move our focus on tesseract and wait for bpnet. So, now we will focus on tesseract also. Soon I will write to wiki the procedures to training and testing Bangla character recognition.

74yrs old

unread,
Jun 1, 2008, 11:31:28 AM6/1/08
to ocr...@googlegroups.com
Ray has already improved relevant source codes for Kannada script in Tesseract-ocr 2.03 and trying to improve further
to support other Indic languages also  I am assisting Ray in Kannada project which is basis for other world languages having complex script ( i.e. combination of consonants plus depended vowel) also.
In fact Kannada script is most complicated script among Indic(Indian languages).
As a sample  please  see attached bmp file and output text file which is self explanatory.
-sriranga(75yrsold)
4lang.bmp
4langoutput.txt

Thomas Breuel

unread,
Jun 1, 2008, 12:35:22 PM6/1/08
to ocr...@googlegroups.com
Hi,

have you or Ray written up anywhere what you have been doing with Tesseract in order to support Kannada?

It looks like you trained on about 700 different ligature/vowel combinations.  Do you know what kind of text coverage you're getting with that in Kannada?

Cheers,
Thomas.

2008/6/1 74yrs old <withbl...@gmail.com>:
ka kA ki kI ku kU kIku kAki ke kE kaki ko kEಕಿ kaಕೂ kakA kaki
ಕ ಕಾ ಕಿ ಕೀ ಕು ಕೂ ಕೃ ಕೄ ಕೆ ಕೇ ಕೈ ಕೊ ಕೊಕಾ ಕೌ ಕಂ ಕಃ
कं का कि की कु कू कृ कॄ के कॆ कै को कॊ कौ कं कः
ക കാ കി കീ കു കൂ കൃ കകൠ ಕಾക കേ കെക ಕಾകാ ಕಾകാ ಕಾകಕಾ കം കഃ



74yrs old

unread,
Jun 1, 2008, 1:58:39 PM6/1/08
to ocr...@googlegroups.com
Latin:(Phonetic  of Indic languages)

ka kA ki kI ku kU kIku kAki ke kE kaki ko kEಕಿ kaಕೂ kakA

Kannada script

ಕ ಕಾ ಕಿ ಕೀ ಕು ಕೂ ಕೃ ಕೄ ಕೆ ಕೇ ಕೈ ಕೊ ಕೊಕಾ ಕೌ ಕಂ ಕಃ

Devanagari script

कं का कि की कु कू कृ कॄ के कॆ कै को कॊ कौ कं कः

Malayalam script

ക കാ കി കീ കു കൂ കൃ കകൠ ಕಾക കേ കെക ಕಾകാ ಕಾകാ ಕಾകಕಾ കം കഃ
I know Kannada only. Other scripts translated from Kannada to other Indian langaugues with help of BarahaIME tool.

Please visit www.baraha.com  - for phonetic keylayout  as well as transliteration for all Indic languages  - for details.
BarahaIME  tool  is  free download - works in MSwindows platform only

74yrs old

unread,
Jun 1, 2008, 2:00:18 PM6/1/08
to ocr...@googlegroups.com
There are few mistakes in output -which manually has to be corrected.

Tom Breuel

unread,
Jun 1, 2008, 3:18:32 PM6/1/08
to ocropus
Right... I understand (roughly) how the scripts work. The question is
how you are adapting Tesseract to work with them.

It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics. Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German. Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.

But in most Indic scripts as used in day-to-day writing, there are
several hundred possible consonant clusters. If you take all
combinations of consonant clusters, vowels, and other diacritics, you
end up with thousands of different glyphs. I don't think Tesseract is
currently capable of handling thousands of glyphs.

So, the question is: how are you dealing with ligatures and
diacritics. Are you extending Tesseract to be trainable on thousands
of glyphs? Or are you just training on a subset? Or are you removing
the diacritics, then recognizing the consonant clusters and diacritics
separately, and then putting things back together again?

Your example doesn't contain any ligatures.

Cheers,
Thomas.

PS:

This wikipedia article contains a list of on-line transliteration
resources for Indic scripts:

http://en.wikipedia.org/wiki/Devanagari_transliteration

I find the following useful:

http://www.google.com/transliterate/indic
http://www.iit.edu/~laksvij/language/sanskrit.html
http://quillpad.in/hindi/

On Jun 1, 7:58 pm, "74yrs old" <withblessi...@gmail.com> wrote:
> *Latin*:(Phonetic  of Indic languages)
> ka kA ki kI ku kU kIku kAki ke kE kaki ko kEಕಿ kaಕೂ kakA
> *
> Kannada script*
> ಕ ಕಾ ಕಿ ಕೀ ಕು ಕೂ ಕೃ ಕೄ ಕೆ ಕೇ ಕೈ ಕೊ ಕೊಕಾ ಕೌ ಕಂ ಕಃ
> *
> Devanagari script*
> कं का कि की कु कू कृ कॄ के कॆ कै को कॊ कौ कं कः
>
> *Malayalam script*
> ക കാ കി കീ കു കൂ കൃ കകൠ ಕಾക കേ കെക ಕಾകാ ಕಾകാ ಕಾകಕಾ കം കഃ
> I know Kannada only. Other scripts translated from Kannada to other Indian
> langaugues with help of BarahaIME tool.
>
> Please visitwww.baraha.com - for *phonetic* keylayout  as well as *
> transliteration* for all Indic languages  - for details.
> BarahaIME  tool  is  free download - works in MSwindows platform only
>
> On Sun, Jun 1, 2008 at 10:05 PM, Thomas Breuel <tmb...@gmail.com> wrote:
> > Hi,
>
> > have you or Ray written up anywhere what you have been doing with Tesseract
> > in order to support Kannada?
>
> > It looks like you trained on about 700 different ligature/vowel
> > combinations.  Do you know what kind of text coverage you're getting with
> > that in Kannada?
>
> > Cheers,
> > Thomas.
>
> > 2008/6/1 74yrs old <withblessi...@gmail.com>:
>
> >> Ray has already improved relevant source codes for Kannada script in
> >> Tesseract-ocr 2.03 and trying to improve further
> >> to support other Indic languages also  I am assisting Ray in Kannada
> >> project which is basis for other world languages having complex script (
> >> i.e. combination of consonants plus depended vowel) also.
> >> In fact Kannada script is most complicated script among Indic(Indian
> >> languages).
> >> As a sample  please  see attached bmp file and output text file which is
> >> self explanatory.
> >> -sriranga(75yrsold)
>

Rajesh Pandey

unread,
Jun 2, 2008, 3:59:25 PM6/2/08
to ocr...@googlegroups.com
Hi there,
 It's good that this thread has started so that more languages could be covered. I wish that Ocropus will be able to recognize Nepali language too. Since Nepali uses Devanagari script, we have something in common. 

I would like to mention that I have trained a subset of Nepali characters in tesseract. The result was astonishingly good.

Enlightened by the results of tesseract, I created a simple wrapper in Visual C++ .NET 2003 for tesseract and which removes headlines over the Devanagari characters  (some people also call it matra, but we call it Dika), and then let the tesseract recognize the image.

(I'm still working on this wrapper and also on saving uncompressed tiff file for tesseract (300  dpi, bit depth = 1, uncompressed and so on :))


In the mean time, Mr. Nimesh from Nepal is trying his best to use OCRopus for Nepali language.

On 6/2/08, Tom Breuel <tmb...@gmail.com> wrote:

Right... I understand (roughly) how the scripts work.  The question is
how you are adapting Tesseract to work with them.

It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics.  Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German.  Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.

 
Thanks for the inspiration. I wish OCRopus would first work on the subset of trained Nepali/Devanagari characters without considering all other complexities.


I hope the character segmentation algorithms that Hasnat has been developing might be useful for me too, for using it with Nepali language and Devanagari scripts.



--
Regards
Rajesh Pandey

Thomas Breuel

unread,
Jun 3, 2008, 5:06:34 AM6/3/08
to ocr...@googlegroups.com


It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics.  Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German.  Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.

 
Thanks for the inspiration. I wish OCRopus would first work on the subset of trained Nepali/Devanagari characters without considering all other complexities.

That's pretty easy to do, once we document the training procedure.  But is there any practical use for a Devanagari recognizer that doesn't deal with ligatures?

Tamil might be a better target, since it's an Indic script, but they got rid of most ligatures.

Cheers,
Thomas.

Hasnat

unread,
Jun 3, 2008, 5:33:03 AM6/3/08
to ocr...@googlegroups.com
That's pretty easy to do, once we document the training procedure.  But is there any practical use for a Devanagari recognizer that doesn't deal with ligatures?
 
Definitely we have to consider the ligatures. For Devanagari and Bangla the training data size will be large enough (I think around 300 - 350). I have tried several approaches to segment the ligatures and I think its impractical to think about that because the general nature of the ligatures are not same. So better to consider them as separate unit/character.

Regards,

Rajesh Pandey

unread,
Jun 3, 2008, 11:27:18 AM6/3/08
to ocr...@googlegroups.com
On 6/3/08, Thomas Breuel <tmb...@gmail.com> wrote:



It's not the character shapes or matra that makes the Indic languages
difficult, it's the ligatures and diacritics.  Some scripts have few
ligatures (e.g., Tamil, Brahmi, Gumurkhi(?)), and they should be not
much harder to recognize than French or German.  Likewise, Kannada or
Devanagari written "typewriter style" (with virama instead of
ligatures) should not be that hard to recognize.

 
Thanks for the inspiration. I wish OCRopus would first work on the subset of trained Nepali/Devanagari characters without considering all other complexities.

That's pretty easy to do, once we document the training procedure.  But is there any practical use for a Devanagari recognizer that doesn't deal with ligatures?

We are ready to do anything to get ocropus to work with Nepali /Devanagari. We will be waiting for the completion of documentation of the training procedure.

--
Rajesh Pandey
Madan Puraskar Pustakalaya
Patan Dhoka, Lalitpur
Reply all
Reply to author
Forward
0 new messages