UTF8 & wordlist2dawg

ArabicOCR

unread,

Aug 27, 2007, 6:11:20 PM8/27/07

to tesseract-ocr

Hi,

I am trying to make freq-dawg file from a utf8 Arabic text file. But
it seems that the Wordlist2dawg is not ready for unicode.
Wordlist2dawg makes the freq-dawg without any error, but I think it is
not a correct file because it does not affect the recognition even for
one character. I debuged the Wordlist2dawg program, and I found that
it treat the file as ANSI file! I am still not sure!
Is there any flowchart or document to undreastand the Wordlist2dawg
and related operation? I want to know how Tesseract uses freq-dawg
file.
How I can solve this problem?

Best Regards,
Arsalan

Ray Smith

unread,

Aug 28, 2007, 9:12:46 PM8/28/07

to tesser...@googlegroups.com

The dawgs treat their data as a string of characters, so there was very little change to them to support utf-8. There are problems with the training process for some utf-8 sequences - particularly ligatures that take more than 4 bytes to represent. I have a fix for these problems coming. If you want to find out what is going on, assume that your dawg files are ok, and put a breakpoint in def_letter_is_okay in dawg.cpp and find out whether it ever returns true, and if not why not.

Please let me know how you get on, as I am curious how you get on with Arabic. (I wouldn't expect it to be too successful, as the chopper probably isn't upto separating the script into characters.)
Ray.

ArabicOCR

unread,

Aug 31, 2007, 8:07:31 AM8/31/07

to tesseract-ocr

Hi Dear Ray,

Thank you for reply. I'll try to debug the dawg.cpp. But as far as I
know there isn't any ligature in Arabic and Farsi that take more than
4 bytes. I think the problem is somewhere else.
I am evaluating the Tesseract ability to handle the Arabic and Farsi
scripts. I will inform you the results, and I will be appreiciate if
you give me the main guidlines on this topic.

Regards,
Arsalan

On Aug 29, 4:12 am, "Ray Smith" <theraysm...@gmail.com> wrote:
> The dawgs treat their data as a string of characters, so there was very
> little change to them to support utf-8. There are problems with the training
> process for some utf-8 sequences - particularly ligatures that take more
> than 4 bytes to represent. I have a fix for these problems coming. If you
> want to find out what is going on, assume that your dawg files are ok, and
> put a breakpoint in def_letter_is_okay in dawg.cpp and find out whether it
> ever returns true, and if not why not.
>
> Please let me know how you get on, as I am curious how you get on with
> Arabic. (I wouldn't expect it to be too successful, as the chopper probably
> isn't upto separating the script into characters.)
> Ray.
>

> On 8/27/07, ArabicOCR <Ghasr...@googlemail.com> wrote:
>
>
>
>
>
> > Hi,
>
> > I am trying to make freq-dawg file from a utf8 Arabic text file. But
> > it seems that the Wordlist2dawg is not ready for unicode.
> > Wordlist2dawg makes the freq-dawg without any error, but I think it is
> > not a correct file because it does not affect the recognition even for
> > one character. I debuged the Wordlist2dawg program, and I found that
> > it treat the file as ANSI file! I am still not sure!
> > Is there any flowchart or document to undreastand the Wordlist2dawg
> > and related operation? I want to know how Tesseract uses freq-dawg
> > file.
> > How I can solve this problem?
>
> > Best Regards,

> > Arsalan- Hide quoted text -
>
> - Show quoted text -

Reply all

Reply to author

Forward