Maximum box/window/font size

Elan

unread,

Jul 13, 2015, 1:23:10 AM7/13/15

to tesser...@googlegroups.com

Hi all,

I am fairly new to tesseract, I have done some playing around with training new fonts, and loading config files etc. I have an issue with the images I am trying to OCR.

In many cases, there is a dotted horizontal line about 5-10 pixels above the text. Tesseract mistakenly assumes this is apart of the text and puts the box around the character and around the line above the text.

An example below

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Example of lines above text

1. text to read

2. text to read

3. text to read

4. text to read

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

It reads lines 2 and 3 almost perfectly, however, 1 and 4 is inconsistent and can vary. Most of the time its gibberish. It makes it hard to train tesseract properly as the box files have been produced

I was wondering if there is a parameter or configuration were I could set the maximum font size or maximum box size to avoid it from including the lines above the text?

I would do some morphological operations on the lines to get rid of them but the lines are about the same thickness as the font and I would worry it would degrade the text.

I know tesseract requires minimum size 10 font to get acceptable results, so I was wondering if there is a way to set the max font size.

The font size should be fairly even across the images (obviously camera distortion may result in an offset of a pixel or two but roughly the same)

I am aware I could segment the image and pull out the regions in between the lines. I guess I am just seeing if there is a quick configuration or parameter I could parse to satisfy this requirement?

Can anyone help me?

Is pre-processing the only way to solve this?

Thanks,

Elan

Tom Morris

unread,

Jul 16, 2015, 12:38:01 PM7/16/15

to tesser...@googlegroups.com

I'm confused by your description. Is this for training? If so, there are probably not very many images and you can just edit them by hand.

If this is not training, but regular recognition, you might want to investigate the page segmentation mode parameter.

Tom

Elan

unread,

Jul 16, 2015, 7:46:27 PM7/16/15

to tesser...@googlegroups.com

Hi Tom,

Thanks for your response.

I have trained tesseract already. The issue that occurred when training was the lines as described above, and the issue is still there after I have trained tesseract (I used the same images to train tesseract). I'm basically asking if I can restrict the size of the font that tesseract attempts to recognize so that it doesn't include the dotted lines above the text. Anyways, I will try pre processing and extracting each section and running tesseract on each section rather than the whole image.

Cheers

Elan

Jon Symons

unread,

Oct 17, 2017, 1:43:45 AM10/17/17

to tesseract-ocr

Hi,

Did you manage to find an answer?

Reply all

Reply to author

Forward