2009/11/20 Debayan Banerjee <
deba...@gmail.com>:
> I am extremely perplexed trying to figure out why the dictionary is
> absolutely worthless for Indic scripts.
"Starting with GNU glibc 2.2, the type wchar_t is officially intended
to be used only for 32-bit ISO 10646 values, independent of the
currently used locale."
Copied verbatim from
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 .
i do not see any wchar_t types anywhere near the dictionary code.
Then why do the authors of Tesseract say that it supports Unicode?
Also, from the 2nd paragraph of
http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html :
"The source has a design mistake, in that there is no type unichar for
Unicode character. Instead, Unicode strings are carried around in
UTF-8, together with an array that gives the lengths of the substrings
that represent the individual Unicode characters. This causes code and
dictionary bloat, slows down the program, and causes worse OCR
performance. "
So my question to Ray and team is whether this something that should
be fixed. If yes, how and where?
--
Regards,
Debayan Banerjee