Tesseract OCR produces non-existing spaces in the middle of the words: how to change spacing tolarance?

2,839 views
Skip to first unread message

Svetlin Nakov

unread,
Nov 12, 2009, 6:02:12 AM11/12/09
to tesser...@googlegroups.com

Hello colleagues,

 

I have the following problem: after a successful training, during the OCR process Tesseract puts additional spaces non-existing in the text in the middle of some words, e.g. it splits the word “HRISTOVICH” to “HRISTO” + [space] + “VICH”. In this particular example the word is printed in really standard font: Arial, size 9pt, Italic (scanned at 300 DPI) and Tesseract is trained exactly on the same font with sufficiently large amount of text with capital letters only.

 

Following Ray Smith’s recommendations I tried to change some of the constants in the file textord/tospace.cpp but with no success. There are hundreds of constants but it is not clear how they affect the spacing algorithms.

 

Does anybody know what I need to change in order to tell Tesseract that spaces should be wider than it thinks they are?

 

Another question: is there a way to train Tesseract what is the usual width of the [space] for particular language? I think Tesseract currently completely ignores the spacing between the letters during the training process.

 

Svetlin Nakov

unread,
Nov 13, 2009, 6:07:30 PM11/13/09
to tesseract-ocr
In fact tesseract constantly and consistently fails on italic
uppercase fonts. In such fonts characters are have low spacing (in
measured in vertical spacing) and in many cases even overlap. I tried
to fix the source code with no success. It is not a matter of
ajdusting few constants. It is a design isssue that will need deep
refactoring of the spacing algorithms defined in textord/tospace.cpp.
I found that for itaclic uppercase fonts tesseract internally
calculates the usual letter spacing as small number (between -1 and 2
pixels) and when it finds a spacing of 3 or 4 (by chance) it decides
to separate the word at this position.

I think the solution should detect italic fonts and handle them in
individually as separate case.

Svetlin Nakov

On 12 Ноем, 13:02, "Svetlin Nakov" <svet...@nakov.com> wrote:
> Hello colleagues,
>
> I have the following problem: after a successful training, during the OCR
> process Tesseract puts additional spaces non-existing in the text in the
> middle of some words, e.g. it splits the word "HRISTOVICH" to "HRISTO" +
> [space] + "VICH". In this particular example the word is printed in really
> standard font: Arial, size 9pt, Italic (scanned at 300 DPI) and Tesseract is
> trained exactly on the same font with sufficiently large amount of text with
> capital letters only.
>
> Following Ray
> <http://groups.google.com/group/tesseract-ocr/msg/69ee99fc6f8a395f>  Smith's
> recommendations I tried to change some of the constants in the file
> textord/tospace.cpp but with no success. There are hundreds of constants but
> it is not clear how they affect the spacing algorithms.
>
> Does anybody know what I need to change in order to tell Tesseract that
> spaces should be wider than it thinks they are?
>
> Another question: is there a way to train Tesseract what is the usual width
> of the [space] for particular language? I think Tesseract currently
> completely ignores the spacing between the letters during the training
> process.
>
> Svetlin Nakov
>
> Author of the book  <http://www.introprogramming.info/> "Introduction to

patrickq

unread,
Nov 13, 2009, 6:26:41 PM11/13/09
to tesseract-ocr
I have had the same experience getting spaces in many spots where none
should exist. Since I have no idea how to navigate the many Tess
variables, my approach has been to test and remove such spaces myself
post-scan, based on the width & spacing of characters in the current
word. Indeed italic or sloped text poses a challenge, but this
situation can be detected by the fact that in such words the spacing
between letters is either small or negative. If your code can't handle
that challenge, it can at least avoid interfering with such text and
correct spaces in the normal upright characters case. If anyone is
interested in doing this, I am happy to share whatever I have learned
in the process.

Patrick

Ray Smith

unread,
Nov 13, 2009, 10:29:48 PM11/13/09
to tesser...@googlegroups.com
Yes the spacing algorithm needs a total rewrite.
The problem is that trying to be general makes it more difficult to get the typical case right.

When text is justified in a narrow column, eg a newpaper, the space between letters and between words can vary from line to line, so it is difficult to trust the block-level statistics. Add to that pair-kerning, italics, and the fact that digits are spaced differently to letters in most fonts, and you have a deceptively difficult problem.

Ray.

2009/11/13 patrickq <patrick.q...@gmail.com>
Reply all
Reply to author
Forward
0 new messages