See new dictionary patch:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/89b945e39f695cb8#
Ray.
On Aug 5, 9:07 am, "74yrs old" <
withblessi...@gmail.com> wrote:
> With reference to
> "You can solve the issue by adding a script which will process the generated
> output text."
> Since I am not programmer nor developer, I am unable to add script as
> suggested by you.
>
> On Tue, Aug 5, 2008 at 6:01 PM, Hasnat <
mhas...@gmail.com> wrote:
>
> > On Tue, Aug 5, 2008 at 1:21 PM, 74yrs old <
withblessi...@gmail.com> wrote:
>
> >> Hi,
> >> Yres - Indic scripts have dependent vowels and as such possible
> >> combinations of consonants plus dependent vowels other than independent
> >> vowels must be trained.
>
> >> You have observed how dependent vowels merge with consonants for example
> >> ಕ ಾ = ಕಾ
> >> क ा = का
> >> क ि = कि
> >> ि क = क
> >> How it works:
> >> { * *ಕ <--BackSpaced ಾ becomes ಕಾ }
> >> On Tue, Aug 5, 2008 at 9:30 AM, Hasnat <
mhas...@gmail.com> wrote:
>
> >>> Sorry for late reply. It is really nice to discuss with you about the
> >>> real issues that we observed during our script recognition experiment with
> >>> tesseract.
>
> >>> I would like to start from your first comment where your wrote about the
> >>> effect of freq-dawg and word-dawg on Indic scripts. Can you go further to
> >>> find out any way to make these two files useful? In our own implemented OCR
> >>> we are already using external spell checker and it is helping us to improve
> >>> the OCR output by correcting few misspelled words. But the real fact is that
> >>> it is not effecting the recognizer while generating the output. So, you can
> >>> not completely rely on the external spell checker to correct everything. And
> >>> for this reason we have to make the two dictionary file useful for us.
>
> >>> From your second comment I understand that when you increse the number of
> >>> training sample per class then the accuracy is not increasing. I didn't
> >>> provide more than one training sample for any character yet. But in my case
> >>> I got the experience that if you increase the number of training class then
> >>> the accuracy will decrease.
>
> >>> I totally agree with your third comment .From my experience I can say
> >>> that it is obvious for Bangla and Devenagari to train all possible
> >>> combination. I wrote about this in our blog. If you are interested then you
> >>> can read it from the following link:
> >>>
http://crblpocr.blogspot.com/2008/08/why-tesseract-need-to-train-all....
> >>> And the problem of proper ordering and viewing the unicode characters in
> >>> the output text must have to be solved by additional function.
>
> >>> By variations in training I mean that we have a plan to train characters
> >>> with different font, degraded character image etc..
>
> >>> A step by step procedure to install ocropus on ubuntu is written in our
> >>> blog in the following link:
> >>>
http://crblpocr.blogspot.com/2008/08/how-to-install-ocropus-for-newbi...
>
> >>> I will be very happy if you and other who are doing their experiments in
> >>> Indic scripts recognition share their experience.
>
> >>> On Wed, Jul 30, 2008 at 11:02 PM, 74yrs old <
withblessi...@gmail.com>wrote:
>
> >>>> Parawise comments noted below
>
> >>>> On Wed, Jul 30, 2008 at 4:47 PM, Hasnat <
mhas...@gmail.com> wrote:
>
> >>>>> To prepare the training data for Bangla characters I considered the
> >>>>> following:
> >>>>> - Basic characters (vowel + consonant) / units, numerals and symbols
> >>>>> - consonant + vowel modifiers
> >>>>> - consonant + consonant modifiers
> >>>>> - combined consonants (compound character)
> >>>>> - compound character + vowel modifiers
> >>>>> - compound character + consonant modifiers
>
> >>>>> Nice to know that you have tested tesseract with different combination
> >>>>> of the dictionary files. I also did the same task for Bangla as follows:
>
> >>>>> Approach - 1 : freq-dawg and word-dawg generated from simple
> >>>>> words_list.txt (contained only the basic characters as word, no real word)
> >>>>> and frequent_words_list.txt (same as words_list)
>
> >>>>> Approach - 2 : freq-dawg and word-dawg generated from large
> >>>>> words_list.txt (~180K words)frequent_words_list.txt (~30K words)
>
> >>>>> I observed that there is no effect on the generated output using both
> >>>>> of the approaches. So, its impossible to observe the effect of using these
> >>>>> dictionary files.
>
> >>>> *comments*: Yes. No effect on the generated output. In fact
> >>>> datafiles of freq-dawg and word-dawg are independently generated - no
> >>>> connection with other datafiles. Even if you use eng.freq-dawg and
> >>>> eng.word-dawg, it will not have any effect on Indic. External Spell checker
> >>>> has to be used for corrections!!
>
> >>>>> Point-4 is a bit confusing for me. However still what I understand that
> >>>>> in your testing you got better result while you trained with only one set of
> >>>>> character and the result decrease when you trained with four set of
> >>>>> character images of the same alphabet.
>
> >>>> *
> >>>> Comments*: There is no confusion. Out of four set, three set of
> >>>> character are copied of "one set of characters"
>
> >>>>> In my case of testing Bangla script my observations are:
>
> >>>>> - Teserract deals with bounding box of segmented using during
> >>>>> recognition. Say, I trained basic characters and the vowel modifiers
> >>>>> separately. When I try to recognize any character image where the vowel
> >>>>> modifier is disconnected from the basic character but has a shadow over the
> >>>>> basic character then it will fail to recognize.
>
> >>>> *
> >>>> comments:* No surprise. I have already tested with only consonants and
> >>>> dependent vowels(vowel modifier) separately but failed. In case, if the
> >>>> output shows consonant and dependent vowel(vowel modifier) separately, then
> >>>> you have to press "back-space" key to connect/merge the vowel-modifier with
> >>>> (towards) consonant then only vowel-modifier will merge with the consonant
> >>>> automatically i.e. consonant plus dependent vowel(vowel modifier). You can
> >>>> test here itself.
> >>>> like (ಕ ಾ = ಕಾ क ा = का ) It is felt that suitable code or
> >>>> function(e.g: consonant * vowel-modifier=combined-characer) to merge
> >>>> vowel-modifier with consonant is required. If available, necessity of
> >>>> training all possible combination characters can be eliminated. In other
> >>>> words, training procedure will be very simple - similar to English
> >>>> alphabets. Moreover it will be benefited for other world languages which
> >>>> have similar dependent vowels(vowel modifiers).
>
> >>>> As such, if you are able to develop source code or function of merging
> >>>> consonants with vowel-modifier and then post the same under issue of source
> >>>> code project of Tesseract for including in the relevant source code of
> >>>> tesseract after detailed examination on merits by the developer.
>
> >>>> So, the decision that I take is : we have to train all combination of
> >>>>> possible characters that might appear in the real document. I would like to
> >>>>> mention here that I didn't try any variation of the training data yet. Like
> >>>>> I didn't try the followings which I have plan to do soon:
>
> >>>> *comments*: Yes, you have to - till codes of merging of vowel modifier
> >>>> with consonants are available, no other go - to train all combination of
> >>>> possible characters.
>
> ...
>
> read more »