I am working on character recognition at work so I can copy information from tables in giant TIFF files and write a program that can automatically use the information from those tables. The tables are computer-generated, but the information is unavailable to me in any format besides TIFF. The font is wonderfully consistent and relatively few characters are used, so this should be a fairly easy task.
I have had mild success training Tesseract 3.05, but whenever I make the box file for training, Tesseract combines vertical lines across rows into one tall, skinny box. The errant box character value is always a tilde (~) and the pixels are disqualified from being used in the correct letters. I have attached a picture that should better explain my problem.
Is there a way to prevent this? I created a completely new language (not .eng) for Tesseract with a box/tiff pair that did not include any of those bars, but when I recreate the box file with the new language the tall, incorrect boxes are still made.
Any help would be appreciated.
Thanks,
Cameron