Re: Issue : usage of dictionary files ( freq-dawg && word-dawg ) in tesseract

Hasnat

unread,

Jul 28, 2008, 11:12:42 PM7/28/08

to Ray Smith, OCROPUS, Thomas Breuel, Tesseract

Dear Ray,
Can you please provide me the information about point of relationship between the classifier and the dictionary files. I had the understanding that the generated output will depends on the dictionary data (similar to continuous speech recognition), but now it proves wrong in this case. So, I am very much interested to know the internal usage of the dictionary data for output generation.

Here I am providing information about my approach of training and testing tesseract for my language (Bangla). I trained tesseract with 2900 character/unit image (1 sample per unit, definitely I will increase it later) following the appropriate procedure. To create language specific files I created word_dawg file using 190K words and freq_dawg using 30K file (I mentioned this previously).

To test I randomly select any line of text image and observe the performance. The initial observation shows that the generated output is not effected by the dictionary at all. So, I am confused about the whole process of linkage between the classifier output and the dictionary data.

If there is any documentation available anywhere regarding to my queries then please send me.

On Tue, Jul 29, 2008 at 3:39 AM, Ray Smith <thera...@gmail.com> wrote:

Without more data, it is hard to say what is going on. The classifier ratings for the correct and chosen characters would be the minimum useful information. It would also be useful to know the size of your test set. Your procedure sounds correct, but there is no guarantee of a better result on any specific word, as it does not do any wildcarding and the classifier has to get the correct character somewhere near the top of its list of choices.

You can increase the strength of the dictionary by increasing the value of non_word and garbage_word significantly, maybe as high as 5.

Ray.

On Mon, Jul 28, 2008 at 1:48 AM, Hasnat <mha...@gmail.com> wrote:

Dear All,
I am confused about the usage of the dictionary files after testing tesseract because I got the same output after updating the word-dawg and freq-dawg files with a large amount of words.

I collected a word list (words_list.txt) of around 190K words and frequent word list (frequent_words_list.txt) of 30K words. Using these two files I created two dictionary files ban.word-dawg and ban.freq-dawg. I kept the third language related file (ban.user-words) file empty.

After prepare these files I used these to recognize text-image. The point of upset is I got the same result as previous where the word list and frequent word list contained only 60 basic characters.

I think I have done something wrong otherwise the output should be different. Hope, I will get useful reply from anyone who already gone through these issues. Any comment in this issue will be appreciable.

Regards,
--
Hasnat
Center for Research on Bangla Language Processing (CRBLP)
http://mhasnat.googlepages.com/

--
Hasnat
Center for Research on Bangla Language Processing (CRBLP)
http://mhasnat.googlepages.com/

74yrs old

unread,

Jul 29, 2008, 5:54:09 AM7/29/08

to Tesseract

Hasnat,
With reference to your point: "2900 character/unit image (1 sample per unit)"
It is not clear whether you have trained single image character of (combined consonant + dependent-vowel )
OR one single image of consonant character and one single image of dependent-vowel character for training purpose?

I also tested in Kannada - (which is most complex script among the Indic lang) and observed that few characters does not displayed correctly or improved in output txt.

1) If tested with freq-dawg and word-dawg generated from blank words_list.txt - no improvement or not effected at all..
2) If tested with freq-dawg and word-dawg generated from words_list.txt which are not blank
- no improvement but more inaccuracy in spelling are noticed.
3) usage of datafiles viz.DangAmbigs and user-words irrespective of fact whether blank or not - does not effected in the generated output.
4) Tested with one notepad text containing one set of trained "all Alphabets of kannada characters" - output will be more accuracy ( 1 or 2% mistakes) whereas if the same notepad text containing more than one set (say 4 times) trained same "all Alphabets of kannada characters" - output will not be more accuracy In this connection, I like to have your practical experience in Bangala.

Have you done research to train Bangala characters text generated with help of baraha - which is phonetic tool and suceeded?

Ray Smith

unread,

Jul 29, 2008, 5:42:12 PM7/29/08

to tesser...@googlegroups.com

Use of the dictionaries is very simple.
"Words" (a string of space-delimited characters) fall into several categories, including:
freq_word (In the frequent words dictionary, and punctuation is acceptable.)
good_word (In a dictionary and surrounding punctuation is acceptable)
ok_word (In a dictionary but surrounding punctuation is not acceptable)
non_word (Not in a dictionary, but punctuation OK)
garbage_word (Not in a dictionary, and punctuation not ok)
Each of these categories has an internal "variable".
The total "rating" (classifier distance) for all the characters in a word is scaled by a different constant, set by the internal variable, for each of these types. A dictionary word will be picked, even when it is not the best in terms of classifier distance, if the ratio between the total rating for the dictionary word and the top classifier choice is less than the multiplicative factor. You can therefore strengthen the dictionary by increasing the factor for the non-dictionary words.

The dictionary is presently ineffective for Kannada and Bangla perhaps because there is too much ambiguity in the large character (grapheme/akshara) set Indic languages, and the correct character is too far down the list for the current weights to reverse.

Ray.

Hasnat

unread,

Jul 30, 2008, 7:17:42 AM7/30/08

to tesser...@googlegroups.com

To prepare the training data for Bangla characters I considered the following:
- Basic characters (vowel + consonant) / units, numerals and symbols
- consonant + vowel modifiers
- consonant + consonant modifiers
- combined consonants (compound character)
- compound character + vowel modifiers
- compound character + consonant modifiers

Nice to know that you have tested tesseract with different combination of the dictionary files. I also did the same task for Bangla as follows:

Approach - 1 : freq-dawg and word-dawg generated from simple words_list.txt (contained only the basic characters as word, no real word) and frequent_words_list.txt (same as words_list)

Approach - 2 : freq-dawg and word-dawg generated from large words_list.txt (~180K words)frequent_words_list.txt (~30K words)

I observed that there is no effect on the generated output using both of the approaches. So, its impossible to observe the effect of using these dictionary files.

Point-4 is a bit confusing for me. However still what I understand that in your testing you got better result while you trained with only one set of character and the result decrease when you trained with four set of character images of the same alphabet. In my case of testing Bangla script my observations are:

- Teserract deals with bounding box of segmented using during recognition. Say, I trained basic characters and the vowel modifiers separately. When I try to recognize any character image where the vowel modifier is disconnected from the basic character but has a shadow over the basic character then it will fail to recognize. So, the decision that I take is : we have to train all combination of possible characters that might appear in the real document. I would like to mention here that I didn't try any variation of the training data yet. Like I didn't try the followings which I have plan to do soon:

1. Train each unit with at least 10 samples. (Right now, number of sample per unit = 1)
2. Train each unit with variations.

I compared the output of tesseract and ocropus-tesseract and observed that ocropus-tesseract provides better output as it performs high quality preprocessing task.

I didn't use bhara ever for my research purpose. Actually I don't know much about it. Can you please tell me how bhara is useful in your case?

I wrote my experience with tesseract in a blog. If you are interested then you can have a look at the following link:

http://crblpocr.blogspot.com/2008/07/testing-performance-of-ocropus.html

Hasnat

unread,

Jul 30, 2008, 7:55:31 AM7/30/08

to tesser...@googlegroups.com

Dear Ray Smith,
thanks for your answer. I would like to know what do you mean by "increasing the factor for the non-dictionary words"?

It is quite disappointing for us that the dictionary is presently ineffective because we need the help of dictionary to increase the efficiency of the recognizer. I would like to request you for giving me advice on how we can make the dictionary useful so that it can be useful to increase the accuracy.

74yrs old

unread,

Jul 30, 2008, 1:02:39 PM7/30/08

to tesser...@googlegroups.com

Parawise comments noted below

On Wed, Jul 30, 2008 at 4:47 PM, Hasnat <mha...@gmail.com> wrote:

To prepare the training data for Bangla characters I considered the following:
- Basic characters (vowel + consonant) / units, numerals and symbols
- consonant + vowel modifiers
- consonant + consonant modifiers
- combined consonants (compound character)
- compound character + vowel modifiers
- compound character + consonant modifiers

Nice to know that you have tested tesseract with different combination of the dictionary files. I also did the same task for Bangla as follows:

Approach - 1 : freq-dawg and word-dawg generated from simple words_list.txt (contained only the basic characters as word, no real word) and frequent_words_list.txt (same as words_list)

Approach - 2 : freq-dawg and word-dawg generated from large words_list.txt (~180K words)frequent_words_list.txt (~30K words)

I observed that there is no effect on the generated output using both of the approaches. So, its impossible to observe the effect of using these dictionary files.

comments: Yes. No effect on the generated output. In fact
datafiles of freq-dawg and word-dawg are independently generated - no connection with other datafiles. Even if you use eng.freq-dawg and eng.word-dawg, it will not have any effect on Indic. External Spell checker has to be used for corrections!!

Point-4 is a bit confusing for me. However still what I understand that in your testing you got better result while you trained with only one set of character and the result decrease when you trained with four set of character images of the same alphabet.

Comments: There is no confusion. Out of four set, three set of character are copied of "one set of characters"

In my case of testing Bangla script my observations are:

- Teserract deals with bounding box of segmented using during recognition. Say, I trained basic characters and the vowel modifiers separately. When I try to recognize any character image where the vowel modifier is disconnected from the basic character but has a shadow over the basic character then it will fail to recognize.

comments: No surprise. I have already tested with only consonants and dependent vowels(vowel modifier) separately but failed. In case, if the output shows consonant and dependent vowel(vowel modifier) separately, then you have to press "back-space" key to connect/merge the vowel-modifier with (towards) consonant then only vowel-modifier will merge with the consonant automatically i.e. consonant plus dependent vowel(vowel modifier). You can test here itself.
like (ಕ ಾ = ಕಾ क ा = का ) It is felt that suitable code or function(e.g: consonant * vowel-modifier=combined-characer) to merge vowel-modifier with consonant is required. If available, necessity of training all possible combination characters can be eliminated. In other words, training procedure will be very simple - similar to English alphabets. Moreover it will be benefited for other world languages which have similar dependent vowels(vowel modifiers).

As such, if you are able to develop source code or function of merging consonants with vowel-modifier and then post the same under issue of source code project of Tesseract for including in the relevant source code of tesseract after detailed examination on merits by the developer.

So, the decision that I take is : we have to train all combination of possible characters that might appear in the real document. I would like to mention here that I didn't try any variation of the training data yet. Like I didn't try the followings which I have plan to do soon:

comments: Yes, you have to - till codes of merging of vowel modifier with consonants are available, no other go - to train all combination of possible characters.

1. Train each unit with at least 10 samples. (Right now, number of sample per unit = 1)
2. Train each unit with variations.

comments: what do you mean variations? post the findings.

I compared the output of tesseract and ocropus-tesseract and observed that ocropus-tesseract provides better output as it performs high quality preprocessing task.

comments: I have not tested in the Ocropus. If you can guide me step by step procedure, I shall test and feedback to you.

I didn't use bhara ever for my research purpose. Actually I don't know much about it. Can you please tell me how bhara is useful in your case?

comments: please visit www.baraha.com Free down loadable software
In baraha contains Bengali It will work only in MSwindows.
I will be more happy, if the project for Indic is succeeded - thereby other world languages other than Indic will be benefited.

Donatas G.

unread,

Jul 31, 2008, 3:57:28 AM7/31/08

to tesser...@googlegroups.com

I am also very interested in figuring this issue out. Are the words that are ocr'ed incorrectly, present (in exact form) in any of the dawg files? I think an appropriate test would be this:

find in the scanned and ocr'ed file the words that are read incorrectly
add these particular words (in a particular word form) to freq-dawg
repeat the test
if resuls get better, remove those words from freq-dawg, add them to word-dawg and repeat the test

This test assumes that tesseract is trained in that particular language to an extent that it is able to recognize the characters in the words in question in other contexts correctly.

This way you will surely find out if dawg files are of any use in the program.

Donatas

2008/7/29 Hasnat <mha...@gmail.com>:

Hasnat

unread,

Aug 5, 2008, 12:00:30 AM8/5/08

to tesser...@googlegroups.com

Sorry for late reply. It is really nice to discuss with you about the real issues that we observed during our script recognition experiment with tesseract.

I would like to start from your first comment where your wrote about the effect of freq-dawg and word-dawg on Indic scripts. Can you go further to find out any way to make these two files useful? In our own implemented OCR we are already using external spell checker and it is helping us to improve the OCR output by correcting few misspelled words. But the real fact is that it is not effecting the recognizer while generating the output. So, you can not completely rely on the external spell checker to correct everything. And for this reason we have to make the two dictionary file useful for us.

From your second comment I understand that when you increse the number of training sample per class then the accuracy is not increasing. I didn't provide more than one training sample for any character yet. But in my case I got the experience that if you increase the number of training class then the accuracy will decrease.

I totally agree with your third comment .From my experience I can say that it is obvious for Bangla and Devenagari to train all possible combination. I wrote about this in our blog. If you are interested then you can read it from the following link:
http://crblpocr.blogspot.com/2008/08/why-tesseract-need-to-train-all.html
And the problem of proper ordering and viewing the unicode characters in the output text must have to be solved by additional function.

By variations in training I mean that we have a plan to train characters with different font, degraded character image etc..

A step by step procedure to install ocropus on ubuntu is written in our blog in the following link: http://crblpocr.blogspot.com/2008/08/how-to-install-ocropus-for-newbies.html

I will be very happy if you and other who are doing their experiments in Indic scripts recognition share their experience.

http://crblpocr.blogspot.com/

74yrs old

unread,

Aug 5, 2008, 3:21:28 AM8/5/08

to tesser...@googlegroups.com, ocr...@googlegroups.com

Hi,
Yres - Indic scripts have dependent vowels and as such possible combinations of consonants plus dependent vowels other than independent vowels must be trained.

You have observed how dependent vowels merge with consonants for example
ಕ ಾ = ಕಾ
   क ा   = का
   क   ि = कि
ि क   =   क
How it works:
{ ಕ   <--BackSpaced ಾ becomes ಕಾ }
{ क <--BackSpaced ा   becomes का }
{ क   <--BackSpaced ि becomes कि   }
{ ि   <--BackSpaced क becomes क instead of कि }

From the above examples, it could clearly be seen that all dependent vowels merge with consonants from the right side to left side (using "backspace" key)
As such feasibility of writing suitable codes or functions such a way dependent vowels merge ( <-- backspace or left side direction) with consonants, independent vowels etc.
has to be examined in detail by developers/programmer

If succeeded the same function or codes can be incorporated in the relevant source code of tesseract by Ray after detailed beta testing.

Thus training of all possible combinations of consonants plus dependent can be avoided/eliminated.
Moreover it will help all types of scripts of world lang (Indic as well as non-indic) which have dependent vowels.
If adopted the above functions/codes TesseractOCR will be first in the world !!

I think the above concept can be examined for Ocropus also

-Cheers

.

Hasnat

unread,

Aug 5, 2008, 8:31:45 AM8/5/08

to tesser...@googlegroups.com

On Tue, Aug 5, 2008 at 1:21 PM, 74yrs old <withbl...@gmail.com> wrote:

Hi,
Yres - Indic scripts have dependent vowels and as such possible combinations of consonants plus dependent vowels other than independent vowels must be trained.

You have observed how dependent vowels merge with consonants for example
ಕ ಾ = ಕಾ
   क ा   = का
   क   ि = कि
ि क   =   क
How it works:
{ ಕ   <--BackSpaced ಾ becomes ಕಾ }
{ क <--BackSpaced ा   becomes का }
{ क   <--BackSpaced ि becomes कि   }
{ ि   <--BackSpaced क becomes क instead of कि }

From the above examples, it could clearly be seen that all dependent vowels merge with consonants from the right side to left side (using "backspace" key)
As such feasibility of writing suitable codes or functions such a way dependent vowels merge ( <-- backspace or left side direction) with consonants, independent vowels etc.
has to be examined in detail by developers/programmer

The details that you just mentioned above is a solution of text processing problems at the generated output for Indic scripts which have independent vowel as a modifier. You can solve the issue by adding a script which will process the generated output text.

If succeeded the same function or codes can be incorporated in the relevant source code of tesseract by Ray after detailed beta testing.

Thus training of all possible combinations of consonants plus dependent can be avoided/eliminated.

I disagree. Becase training depends on character image. As tesseract deals with bounding box of each character/unit, so at this moment it is not possible to make tesseract understand about the identification of bounding box of modifier and basic unit seperately if they have overlap in their image. Hence you need to consider a single bounding box for the unit (modifier + basic unit). This is a problem to be solved by image processing. So, I think at this moment we have no other choice and we must train all possible combinations.

Moreover it will help all types of scripts of world lang (Indic as well as non-indic) which have dependent vowels.
If adopted the above functions/codes TesseractOCR will be first in the world !!

I think the above concept can be examined for Ocropus also

Actually Ocropus do not have the same problem. In Ocropus this is not a problem if you consider bpnet classifier. Because bpnet deals with color information of each character image. So, if you are able to provide different color for the independent vowel modifier and the basic color then you can avoid training all possible combinations.

After finishing fist stage experiment for recognizing Bangla script using tesseract engine, two challenges are ahead of us:
1. Find out a way to minimize number of training data character/unit.
1. Find out a way to make the dictionary files working.

Any sort of advice regarding these two issues will be highly appretiable.

74yrs old

unread,

Aug 5, 2008, 12:07:29 PM8/5/08

to tesser...@googlegroups.com

With reference to

"You can solve the issue by adding a script which will process the generated output text."

Since I am not programmer nor developer, I am unable to add script as suggested by you.

Ray

unread,

Aug 14, 2008, 7:22:29 PM8/14/08

to tesseract-ocr

See new dictionary patch:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/89b945e39f695cb8#
Ray.

On Aug 5, 9:07 am, "74yrs old" <withblessi...@gmail.com> wrote:
> With reference to
> "You can solve the issue by adding a script which will process the generated
> output text."
> Since I am not programmer nor developer, I am unable to add script as
> suggested by you.
>

> On Tue, Aug 5, 2008 at 6:01 PM, Hasnat <mhas...@gmail.com> wrote:

>
> > On Tue, Aug 5, 2008 at 1:21 PM, 74yrs old <withblessi...@gmail.com> wrote:
>
> >> Hi,
> >> Yres - Indic scripts have dependent vowels and as such possible
> >> combinations of consonants plus dependent vowels other than independent
> >> vowels must be trained.
>
> >> You have observed how dependent vowels merge with consonants for example
> >> ಕ ಾ = ಕಾ
> >> क ा = का
> >> क ि = कि
> >> ि क = क
> >> How it works:

> >> { * *ಕ <--BackSpaced ಾ becomes ಕಾ }

> >> On Tue, Aug 5, 2008 at 9:30 AM, Hasnat <mhas...@gmail.com> wrote:
>
> >>> Sorry for late reply. It is really nice to discuss with you about the
> >>> real issues that we observed during our script recognition experiment with
> >>> tesseract.
>
> >>> I would like to start from your first comment where your wrote about the
> >>> effect of freq-dawg and word-dawg on Indic scripts. Can you go further to
> >>> find out any way to make these two files useful? In our own implemented OCR
> >>> we are already using external spell checker and it is helping us to improve
> >>> the OCR output by correcting few misspelled words. But the real fact is that
> >>> it is not effecting the recognizer while generating the output. So, you can
> >>> not completely rely on the external spell checker to correct everything. And
> >>> for this reason we have to make the two dictionary file useful for us.
>
> >>> From your second comment I understand that when you increse the number of
> >>> training sample per class then the accuracy is not increasing. I didn't
> >>> provide more than one training sample for any character yet. But in my case
> >>> I got the experience that if you increase the number of training class then
> >>> the accuracy will decrease.
>
> >>> I totally agree with your third comment .From my experience I can say
> >>> that it is obvious for Bangla and Devenagari to train all possible
> >>> combination. I wrote about this in our blog. If you are interested then you
> >>> can read it from the following link:

> >>>http://crblpocr.blogspot.com/2008/08/why-tesseract-need-to-train-all....

> >>> And the problem of proper ordering and viewing the unicode characters in
> >>> the output text must have to be solved by additional function.
>
> >>> By variations in training I mean that we have a plan to train characters
> >>> with different font, degraded character image etc..
>
> >>> A step by step procedure to install ocropus on ubuntu is written in our
> >>> blog in the following link:

> >>>http://crblpocr.blogspot.com/2008/08/how-to-install-ocropus-for-newbi...

>
> >>> I will be very happy if you and other who are doing their experiments in
> >>> Indic scripts recognition share their experience.
>

> >>> On Wed, Jul 30, 2008 at 11:02 PM, 74yrs old <withblessi...@gmail.com>wrote:
>
> >>>> Parawise comments noted below
>

> >>>> On Wed, Jul 30, 2008 at 4:47 PM, Hasnat <mhas...@gmail.com> wrote:
>
> >>>>> To prepare the training data for Bangla characters I considered the
> >>>>> following:
> >>>>> - Basic characters (vowel + consonant) / units, numerals and symbols
> >>>>> - consonant + vowel modifiers
> >>>>> - consonant + consonant modifiers
> >>>>> - combined consonants (compound character)
> >>>>> - compound character + vowel modifiers
> >>>>> - compound character + consonant modifiers
>
> >>>>> Nice to know that you have tested tesseract with different combination
> >>>>> of the dictionary files. I also did the same task for Bangla as follows:
>
> >>>>> Approach - 1 : freq-dawg and word-dawg generated from simple
> >>>>> words_list.txt (contained only the basic characters as word, no real word)
> >>>>> and frequent_words_list.txt (same as words_list)
>
> >>>>> Approach - 2 : freq-dawg and word-dawg generated from large
> >>>>> words_list.txt (~180K words)frequent_words_list.txt (~30K words)
>
> >>>>> I observed that there is no effect on the generated output using both
> >>>>> of the approaches. So, its impossible to observe the effect of using these
> >>>>> dictionary files.
>

> >>>> *comments*: Yes. No effect on the generated output. In fact

> >>>> datafiles of freq-dawg and word-dawg are independently generated - no
> >>>> connection with other datafiles. Even if you use eng.freq-dawg and
> >>>> eng.word-dawg, it will not have any effect on Indic. External Spell checker
> >>>> has to be used for corrections!!
>
> >>>>> Point-4 is a bit confusing for me. However still what I understand that
> >>>>> in your testing you got better result while you trained with only one set of
> >>>>> character and the result decrease when you trained with four set of
> >>>>> character images of the same alphabet.
>

> >>>> *
> >>>> Comments*: There is no confusion. Out of four set, three set of

> >>>> character are copied of "one set of characters"
>
> >>>>> In my case of testing Bangla script my observations are:
>
> >>>>> - Teserract deals with bounding box of segmented using during
> >>>>> recognition. Say, I trained basic characters and the vowel modifiers
> >>>>> separately. When I try to recognize any character image where the vowel
> >>>>> modifier is disconnected from the basic character but has a shadow over the
> >>>>> basic character then it will fail to recognize.
>

> >>>> *
> >>>> comments:* No surprise. I have already tested with only consonants and

> >>>> dependent vowels(vowel modifier) separately but failed. In case, if the
> >>>> output shows consonant and dependent vowel(vowel modifier) separately, then
> >>>> you have to press "back-space" key to connect/merge the vowel-modifier with
> >>>> (towards) consonant then only vowel-modifier will merge with the consonant
> >>>> automatically i.e. consonant plus dependent vowel(vowel modifier). You can
> >>>> test here itself.
> >>>> like (ಕ ಾ = ಕಾ क ा = का ) It is felt that suitable code or
> >>>> function(e.g: consonant * vowel-modifier=combined-characer) to merge
> >>>> vowel-modifier with consonant is required. If available, necessity of
> >>>> training all possible combination characters can be eliminated. In other
> >>>> words, training procedure will be very simple - similar to English
> >>>> alphabets. Moreover it will be benefited for other world languages which
> >>>> have similar dependent vowels(vowel modifiers).
>
> >>>> As such, if you are able to develop source code or function of merging
> >>>> consonants with vowel-modifier and then post the same under issue of source
> >>>> code project of Tesseract for including in the relevant source code of
> >>>> tesseract after detailed examination on merits by the developer.
>
> >>>> So, the decision that I take is : we have to train all combination of
> >>>>> possible characters that might appear in the real document. I would like to
> >>>>> mention here that I didn't try any variation of the training data yet. Like
> >>>>> I didn't try the followings which I have plan to do soon:
>

> >>>> *comments*: Yes, you have to - till codes of merging of vowel modifier

> >>>> with consonants are available, no other go - to train all combination of
> >>>> possible characters.
>

> ...
>
> read more »

Reply all

Reply to author

Forward