Dictionary patch

189 views
Skip to first unread message

Ray

unread,
Aug 14, 2008, 7:19:19 PM8/14/08
to tesseract-ocr
I have just uploaded to svn a patch for wordlist2dawg. It was failing
to make correct dawgs on windows in particular, sometimes leading to
crashes, and leading always to poor correction of words during
recognition.

If you are generating your own dawgs, and you run windows, you
definitely need this fix.

If you generate your own dawgs, and you run any other OS, it may still
be helpful, as there are also changes for signed char problems that
may have affected your dawgs too.

Even if you don't make your own dawgs, and you are running non-
English, you may get an accuracy boost from this patch.

The files required can be pulled directly from svn via the web
interface from here:
http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/dict

The required files are:
dawg.cpp
dawg.h
lookdawg.cpp
makedawg.cpp
trie.cpp
trie.h

Ray.

Ray

unread,
Aug 20, 2008, 2:26:48 PM8/20/08
to tesseract-ocr
Now let's try that again! After more extensive testing, I have fixed
another problem. If a prefix to a word occurs later in the wordlist
than the longer version, eg.
father
...
fat
then at the introduction of fat, all previous words beginning with fat
were lost! In other words, to work correctly, the wordlist had to be
sorted by length.
This problem is now fixed, and the wordlist may be in any order. The
updated files are now in svn:
dawg.cpp
dawg.h
lookdawg.cpp
lookdawg.h
makedawg.cpp
trie.cpp

They can be pulled from the web interface here:
http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/dict

If you didn't get the first patch, add trie.h

Also training/wordlist2dawg.cpp is changed to reduce the amount of
memory it uses.
Ray.

Hasnat

unread,
Aug 21, 2008, 4:34:17 AM8/21/08
to tesser...@googlegroups.com
Dear Ray smith,
                         I just download the latest svn and install it. I was eagerly waiting to see the difference between the generated output of the previous and current dawg files on Bangla script. But I am very sorry to let you know that there is no difference in the generated output. The only change that I noticed is the build time of the dawg files is significantly improved.

As for an example the generated word is এখ্যনে, where the correct word  এখানে is present in the dictionary. I know it is possible to correct these errors using dictionary look up. However I would be more happy if the dawg files can contribute to correct these errors.
--
Hasnat
Center for Research on Bangla Language Processing (CRBLP)
http://mhasnat.googlepages.com/

Keith Beaumont

unread,
Aug 21, 2008, 4:55:49 AM8/21/08
to tesser...@googlegroups.com
A temporary fix is to put ALL your words in the third file, is it
user_words or something. It isn't dawged & all words are found. I had
this problem a while ago!
KB

Hasnat

unread,
Aug 21, 2008, 5:17:47 AM8/21/08
to tesser...@googlegroups.com
Well, I just followed your advice and put all the dictionary words (around 190K) into the user_words file. However, still there is no change in the generated output.

Keith Beaumont

unread,
Aug 21, 2008, 10:46:38 AM8/21/08
to tesser...@googlegroups.com
Is the user_words file in the same dir as your other language files?

Ray Smith

unread,
Aug 21, 2008, 4:35:56 PM8/21/08
to tesser...@googlegroups.com
Now the dawg file actually contains the words that you put in it, it should work to increase the strength by increasing garbage_word and non_word (see this thread: http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/6a14c25cafb84a5f?lnk=gst&q=garbage_word#6a14c25cafb84a5f)
Ray.

Keith Beaumont

unread,
Aug 22, 2008, 7:21:38 AM8/22/08
to tesser...@googlegroups.com
If dawgs are still failing, generate a pair of EMPTY dawgs & try
again. Hope I'm on the right lines here.
KB

Donatas G.

unread,
Aug 22, 2008, 9:01:37 PM8/22/08
to tesser...@googlegroups.com
On Wednesday 20 August 2008 21:26:48 Ray wrote:
> Now let's try that again! After more extensive testing, I have fixed
> another problem. If a prefix to a word occurs later in the wordlist
> than the longer version, eg.
> father
> ...
> fat
> then at the introduction of fat, all previous words beginning with fat
> were lost! In other words, to work correctly, the wordlist had to be
> sorted by length.
> This problem is now fixed, and the wordlist may be in any order. The
> updated files are now in svn:
> dawg.cpp
> dawg.h
> lookdawg.cpp
> lookdawg.h
> makedawg.cpp
> trie.cpp
>
> They can be pulled from the web interface here:
> http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/dict
>
> If you didn't get the first patch, add trie.h
>
> Also training/wordlist2dawg.cpp is changed to reduce the amount of
> memory it uses.
> Ray.

And the time used to make a file decreases dramatically.

However, I still get only half of the words from the source file included into
the generated dawg, or at least such is an output of dawg2txt... Attaching the
test case files...

for reference I also include the old dawg file generated before the patches
were applied.

Donatas
--
Donatas G.:
http://dg.lapas.info

mmanovisi_1500.txt
mmanovisi_1500.txt.new.dawg
mmanovisi_1500.txt.dawg
mmanovisi_1500.txt.new.dawg.undawged

74yrsold

unread,
Aug 24, 2008, 8:21:53 AM8/24/08
to tesseract-ocr
No improvement by using generated empty dawgs - which I have already
experimented for Kannada even for English.

On Aug 22, 4:21 pm, "Keith Beaumont" <beaumon...@gmail.com> wrote:
> If dawgs are still failing, generate a pair of EMPTY dawgs & try
> again. Hope I'm on the right lines here.
> KB
>
> On 8/21/08, Keith Beaumont <beaumon...@gmail.com> wrote:
>
> > Is the user_words file in the same dir as your other language files?
>
> > On 8/21/08, Hasnat <mhas...@gmail.com> wrote:
> > > Well, I just followed your advice and put all the dictionary words (around
> > > 190K) into the user_words file. However, still there is no change in the
> > > generated output.
>
> > > On Thu, Aug 21, 2008 at 2:55 PM, Keith Beaumont <beaumon...@gmail.com>
> > > wrote:
> > > > A temporary fix is to put ALL your words in the third file, is it
> > > > user_words or something. It isn't dawged & all words are found. I had
> > > > this problem a while ago!
> > > > KB
>
> > > > On 8/21/08, Hasnat <mhas...@gmail.com> wrote:
> > > > > Dear Ray smith,
> > > > > I just download the latest svn and install it.
> > > I
> > > > > was eagerly waiting to see the difference between the generated output
> > > of
> > > > > the previous and current dawg files on Bangla script. But I am very
> > > sorry to
> > > > > let you know that there is no difference in the generated output. The
> > > only
> > > > > change that I noticed is the build time of the dawg files is
> > > significantly
> > > > > improved.
>
> > > > > As for an example the generated word is এখ্যনে, where the correct word
> > > > > এখানে is present in the dictionary. I know it is possible to correct
> > > these
> > > > > errors using dictionary look up. However I would be more happy if the
> > > dawg
> > > > > files can contribute to correct these errors.
>

74yrs old

unread,
Aug 24, 2008, 8:44:28 AM8/24/08
to tesseract-ocr
How to view/compare the  the generated word in the dawg with reference to words_list.txt.

Ray Smith

unread,
Aug 24, 2008, 4:04:56 PM8/24/08
to tesser...@googlegroups.com
After running wordlist2dawg, with the newest version, you can run it again with the same command line, except insert -t (with a space) before the first argument, and it will list the words found in the dictionary, and, the crucial thing, tell you how many words were missing. Please let me know and provide the wordlist, if you can get it to give you a non-zero number here, but always using the same word list as you used to create the dictionary. In my own tests, with Kannada, I haven't managed to get a non-zero output since fixing the problems.
Ray.

Donatas G.

unread,
Aug 25, 2008, 4:14:46 AM8/25/08
to tesser...@googlegroups.com
On Sunday 24 August 2008 23:04:56 Ray Smith wrote:
> After running wordlist2dawg, with the newest version, you can run it again
> with the same command line, except insert -t (with a space) before the
> first argument, and it will list the words found in the dictionary, and,
> the crucial thing, tell you how many words were missing. Please let me know
> and provide the wordlist, if you can get it to give you a non-zero number
> here, but always using the same word list as you used to create the
> dictionary. In my own tests, with Kannada, I haven't managed to get a
> non-zero output since fixing the problems.
> Ray.

OK, so I do receive correct number of words after all... My previous test was
using the program dawg2txt, but perhaps that program was problematic. Now I
always get 0 missing words when checking my dawg files.

Reply all
Reply to author
Forward
0 new messages