Box file generator combines vertical lines across rows of text

Cameron McSweeney

unread,

Apr 24, 2018, 11:29:21 AM4/24/18

to tesseract-ocr

I am working on character recognition at work so I can copy information from tables in giant TIFF files and write a program that can automatically use the information from those tables. The tables are computer-generated, but the information is unavailable to me in any format besides TIFF. The font is wonderfully consistent and relatively few characters are used, so this should be a fairly easy task.

I have had mild success training Tesseract 3.05, but whenever I make the box file for training, Tesseract combines vertical lines across rows into one tall, skinny box. The errant box character value is always a tilde (~) and the pixels are disqualified from being used in the correct letters. I have attached a picture that should better explain my problem.

Is there a way to prevent this? I created a completely new language (not .eng) for Tesseract with a box/tiff pair that did not include any of those bars, but when I recreate the box file with the new language the tall, incorrect boxes are still made.

Any help would be appreciated.

Thanks,

Cameron

Vertical Boxes.PNG

Message has been deleted

Cameron McSweeney

unread,

Apr 24, 2018, 2:33:01 PM4/24/18

to tesseract-ocr

Tesseract seems to be much too willing to find vertical lines. For example, Ds will be divided so that the straight, left portion is separate from the right, curved portion. The font is fixed, so stuff like that shouldn't happen

ShreeDevi Kumar

unread,

Apr 24, 2018, 2:58:18 PM4/24/18

to tesser...@googlegroups.com

Have you tried the latest version, tesseract 4.0.0beta?

On Wed 25 Apr, 2018, 12:03 AM Cameron McSweeney, <mcswe...@gmail.com> wrote:

Tesseract seems to be much too willing to find vertical lines. For example, Ds will be divided so that the straight, left portion is separate from the right, curved portion. The font is fixed, so stuff like that shouldn't happen

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3bf929b0-1446-47ac-9a68-eaa376b63c71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cameron McSweeney

unread,

Apr 24, 2018, 4:30:24 PM4/24/18

to tesseract-ocr

Yes, and the box files 4.0 made still had the same problem. The accuracy with 4.0 was much better but it still needs some tweaking, so I figured I would be better off fixing the problem in 3.05

ShreeDevi Kumar

unread,

Apr 25, 2018, 1:28:35 AM4/25/18

to tesser...@googlegroups.com

Please provide a sample tiff, single page will do, for testing.

On 25-Apr-2018 2:00 AM, "Cameron McSweeney" <mcswe...@gmail.com> wrote:

Yes, and the box files 4.0 made still had the same problem. The accuracy with 4.0 was much better but it still needs some tweaking, so I figured I would be better off fixing the problem in 3.05

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1fbc2f34-2f8f-474e-81de-3a63565de8ad%40googlegroups.com.

Cameron McSweeney

unread,

Apr 25, 2018, 7:25:16 AM4/25/18

to tesseract-ocr

After some experimenting I found that setting PSM to 6 worked well. I have still attached the TIFF and the trained data file I am using