Tesseract not seeing all my decimals

94 views

Skip to first unread message

T G

unread,

Feb 23, 2016, 3:16:47 PM2/23/16

to tesseract-ocr

I'm working on scanning in a bunch of tax documents which can't be given to me electronically. Some property descriptions, and lots of numbers. For the most part, Tesseract is a godsend, saving me from retyping a lot. It seems like the biggest part requiring manual correction is missing decimals. If I look at the source PDF, they're clear as day, if admittedly on the small side. They make it through the Ghostscript conversion to TIFF, and up to a limit the resolution of the conversion doesn't matter (low resolution makes everything worse, of course, but I stay away from that). When I tell Tesseract to output the converted image (tessedit_write_images), I can still see the little guys there, but Tesseract misses ~5%-10% of them.

Is there a good option to play with to bump up Tesseract's sensitivity with respect to grabbing a decimal in, say, "76.50"? My own trial and error isn't coming up with much. I thought I could use eng.user-patterns, but it doesn't like wildcards in the first four characters, and that would seem to be exactly what I'd want to do -- I feel like I'd want to use a Perl type regular expression of ([\d]{1-3}\.[\d]{2}\-?) (1-3 decimals followed by a decimal point and precisely two decimals, with an optional negative sign at the end) but that wouldn't seem to be an option.

Reply all

Reply to author

Forward

0 new messages