I am trying to train tesseract for Dzongkha
(http://en.wikipedia.org/wiki/Dzongkha_language) language. I am using
tesseract-2.03 on Debian Squeeze. I have created a training image file
in gray-scale. When I try to create the box file using "tesseract
dzo1.tif dzo1 batch.nochop makebox" , I get some output which says:
Using substitute bounding box at (562,2547)->(1376,2629)
Using substitute bounding box at (988,2372)->(1830,2440)
Using substitute bounding box at (560,1924)->(1374,2006)
Using substitute bounding box at (222,1841)->(2135,1916)
Using substitute bounding box at (220,1216)->(2132,1291)
Using substitute bounding box at (291,675)->(1370,757)
Afterwards, while trying to edit the box file using
tesseractTrainer.py, I notice that some 20 clearly separated blocks
are grouped into one box (Please see the attached file 1.png).
Also, Some of the image blocks (characters) are not exactly fitting
inside the bounding box (attached file 2.png).
It would be very helpful if i could get some solutions to this or if
this has been discussed earlier in the list, i would be grateful to
get a link to it.
Regards
Tenzin