I am trying to make freq-dawg file from a utf8 Arabic text file. But
it seems that the Wordlist2dawg is not ready for unicode.
Wordlist2dawg makes the freq-dawg without any error, but I think it is
not a correct file because it does not affect the recognition even for
one character. I debuged the Wordlist2dawg program, and I found that
it treat the file as ANSI file! I am still not sure!
Is there any flowchart or document to undreastand the Wordlist2dawg
and related operation? I want to know how Tesseract uses freq-dawg
file.
How I can solve this problem?
Best Regards,
Arsalan
Thank you for reply. I'll try to debug the dawg.cpp. But as far as I
know there isn't any ligature in Arabic and Farsi that take more than
4 bytes. I think the problem is somewhere else.
I am evaluating the Tesseract ability to handle the Arabic and Farsi
scripts. I will inform you the results, and I will be appreiciate if
you give me the main guidlines on this topic.
Regards,
Arsalan
On Aug 29, 4:12 am, "Ray Smith" <theraysm...@gmail.com> wrote:
> The dawgs treat their data as a string of characters, so there was very
> little change to them to support utf-8. There are problems with the training
> process for some utf-8 sequences - particularly ligatures that take more
> than 4 bytes to represent. I have a fix for these problems coming. If you
> want to find out what is going on, assume that your dawg files are ok, and
> put a breakpoint in def_letter_is_okay in dawg.cpp and find out whether it
> ever returns true, and if not why not.
>
> Please let me know how you get on, as I am curious how you get on with
> Arabic. (I wouldn't expect it to be too successful, as the chopper probably
> isn't upto separating the script into characters.)
> Ray.
>
> On 8/27/07, ArabicOCR <Ghasr...@googlemail.com> wrote:
>
>
>
>
>
> > Hi,
>
> > I am trying to make freq-dawg file from a utf8 Arabic text file. But
> > it seems that the Wordlist2dawg is not ready for unicode.
> > Wordlist2dawg makes the freq-dawg without any error, but I think it is
> > not a correct file because it does not affect the recognition even for
> > one character. I debuged the Wordlist2dawg program, and I found that
> > it treat the file as ANSI file! I am still not sure!
> > Is there any flowchart or document to undreastand the Wordlist2dawg
> > and related operation? I want to know how Tesseract uses freq-dawg
> > file.
> > How I can solve this problem?
>
> > Best Regards,
> > Arsalan- Hide quoted text -
>
> - Show quoted text -