Hello colleagues,
I have the following problem: after a successful training, during the OCR process Tesseract puts additional spaces non-existing in the text in the middle of some words, e.g. it splits the word “HRISTOVICH” to “HRISTO” + [space] + “VICH”. In this particular example the word is printed in really standard font: Arial, size 9pt, Italic (scanned at 300 DPI) and Tesseract is trained exactly on the same font with sufficiently large amount of text with capital letters only.
Following Ray Smith’s recommendations I tried to change some of the constants in the file textord/tospace.cpp but with no success. There are hundreds of constants but it is not clear how they affect the spacing algorithms.
Does anybody know what I need to change in order to tell Tesseract that spaces should be wider than it thinks they are?
Another question: is there a way to train Tesseract what is the usual width of the [space] for particular language? I think Tesseract currently completely ignores the spacing between the letters during the training process.
Svetlin Nakov