Bank statement hOCR issues

104 views

Skip to first unread message

Andrew Lentvorski

unread,

Dec 6, 2015, 1:17:26 PM12/6/15

to tesseract-ocr

I'm trying to chew through an OCR for some bank statements, and I'm having difficulty with the hOCR. I could use some overall advice as well as specific issues.

1) The insertion of tags like <strong> without a corresponding bbox attribute is really irritating when trying to programmatically extract text. Can I turn this off somehow without recompiling the universe? (Personally, why aren't these part of the attributes anyway? In reality the value that should be being returned is *weight* or *slant* and those don't easily correspond with html without CSS anyway).

2) My .tif files look a bit ... "fuzzy" after threshold and deskew. Any suggestions for filtering to help out Tesseract (Tesseract really fumbles this by completely missing the 9 and 5 before the decimal). ImageMagick tends to be my Swiss-army chainsaw for such operations, but if I need a different tool, I am open to it.

3) This font is confusing tesseract a bit (small l(L) is particularly bad for obvious reasons). Is there any way to help it out by indicating font characteristics?

Overall, though, things aren't bad. ABBYY is probably about 10% more accurate on word detection; it seems to work much harder to detect and preserve clusters of characters as a word. Tesseract occasionally splits things like "3,475.56" into "3", "475" and "56" and loses either the comma or the period. It's probably about twice a page that it occurs. That's fairly irritating.

Layout detection is, as for any OCR, a disaster. It's remarkable how hard it is to code layout. I wonder if it wouldn't be better to just have a list of the words and their bbox and attributes rather than a bunch of _area, _line, etc that are all just broken. Maybe if I were digitizing books this would appeal to me more as presumably lines/paragraphs/pictures are easier to detect.

Anyway, thanks for all the hard work. I couldn't even have tried to do this programmatically without Tesseract.

Auto Generated Inline Image 1

Auto Generated Inline Image 2

Auto Generated Inline Image 3

Auto Generated Inline Image 4

Auto Generated Inline Image 5

Reply all

Reply to author

Forward

0 new messages