utf-8 supports unicode? That means indic too?

24 views
Skip to first unread message

Debayan Banerjee

unread,
Nov 19, 2009, 3:59:42 PM11/19/09
to tesser...@googlegroups.com
I am extremely perplexed trying to figure out why the dictionary is
absolutely worthless for Indic scripts. I have recorded my
observations at
http://hacking-tesseract.blogspot.com/2009/11/utf-8-ok.html . Kindly
comment/advise. I need it badly :(

--
Regards,
Debayan Banerjee

Debayan Banerjee

unread,
Nov 20, 2009, 3:40:19 AM11/20/09
to tesser...@googlegroups.com
2009/11/20 Debayan Banerjee <deba...@gmail.com>:
> I am extremely perplexed trying to figure out why the dictionary is
> absolutely worthless for Indic scripts.

"Starting with GNU glibc 2.2, the type wchar_t is officially intended
to be used only for 32-bit ISO 10646 values, independent of the
currently used locale."

Copied verbatim from http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 .

i do not see any wchar_t types anywhere near the dictionary code.
Then why do the authors of Tesseract say that it supports Unicode?

Also, from the 2nd paragraph of
http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html :

"The source has a design mistake, in that there is no type unichar for
Unicode character. Instead, Unicode strings are carried around in
UTF-8, together with an array that gives the lengths of the substrings
that represent the individual Unicode characters. This causes code and
dictionary bloat, slows down the program, and causes worse OCR
performance. "


So my question to Ray and team is whether this something that should
be fixed. If yes, how and where?

--
Regards,
Debayan Banerjee

Debayan Banerjee

unread,
Nov 25, 2009, 1:49:43 PM11/25/09
to tesser...@googlegroups.com
2009/11/20 Debayan Banerjee <deba...@gmail.com>:

> i do not  see any wchar_t types anywhere near the dictionary code.
> Then why do the authors of Tesseract say that it supports Unicode?
I found the answers to my own questions. I realised that UTF-8, while
not the best way to store unicode stuff, is one of the ways to do so.
Hence, Tesseract is supposed to support Indic in the dictionary too.
However I experimented a lot and added numerous cprintf statements to
see where the DAWG files were going wrong. I have noted the results of
the experiments at
http://hacking-tesseract.blogspot.com/2009/11/tesseract-dictionary-finally-works-for.html
. Kindly read and comment.
Wait for my next post where I will analyse how to solve the
Indic-dictionary bug.



--
Regards,
Debayan Banerjee

Debayan Banerjee

unread,
Nov 25, 2009, 2:13:02 PM11/25/09
to tesser...@googlegroups.com
2009/11/26 Debayan Banerjee <deba...@gmail.com>:

> Wait for my next post where I will analyse how to solve the
> Indic-dictionary  bug.

Infact it was a single line change. Here is the patch. The change is
in dict/permute.cpp

--- tesseract-2.04/dict/permute.cpp 2008-11-14 23:07:17.000000000 +0530
+++ tessmod/dict/permute.cpp 2009-11-26 00:34:50.660737699 +0530
@@ -1077,6 +1077,7 @@
return (NULL);
if (permute_only_top)
return result_1;
+ any_alpha=1;
if (any_alpha && array_count (char_choices) <= MAX_WERD_LENGTH) {
result_2 = permute_words (char_choices, rating_limit);
if (class_probability (result_1) < class_probability (result_2)


For non-eng script the if condition was never getting satisfied and
hence the DAWG files were not being scanned properly. Adding a
any_alpha=1 on the top explicitly on the top solves this problem for
the time. There is probably a more elegant solution though.
By the way, I do not see this particular if condition in the trunk
anywhere in the file. Perhaps the deveopers have fixed it in the trunk
already.





--
Regards,
Debayan Banerjee
Reply all
Reply to author
Forward
0 new messages