Hi all,
I am fairly new to tesseract, I have done some playing around with training new fonts, and loading config files etc. I have an issue with the images I am trying to OCR.
In many cases, there is a dotted horizontal line about 5-10 pixels above the text. Tesseract mistakenly assumes this is apart of the text and puts the box around the character and around the line above the text.
An example below
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Example of lines above text
1. text to read
2. text to read
3. text to read
4. text to read
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
It reads lines 2 and 3 almost perfectly, however, 1 and 4 is inconsistent and can vary. Most of the time its gibberish. It makes it hard to train tesseract properly as the box files have been produced
I was wondering if there is a parameter or configuration were I could set the maximum font size or maximum box size to avoid it from including the lines above the text?
I would do some morphological operations on the lines to get rid of them but the lines are about the same thickness as the font and I would worry it would degrade the text.
I know tesseract requires minimum size 10 font to get acceptable results, so I was wondering if there is a way to set the max font size.
The font size should be fairly even across the images (obviously camera distortion may result in an offset of a pixel or two but roughly the same)
I am aware I could segment the image and pull out the regions in between the lines. I guess I am just seeing if there is a quick configuration or parameter I could parse to satisfy this requirement?
Can anyone help me?
Is pre-processing the only way to solve this?
Thanks,
Elan