Retaining table structure with OCR

108 views

Skip to first unread message

Javed Shaikh

unread,

May 10, 2016, 7:35:06 AM5/10/16

to tesseract-ocr

Hi,

I have to load non-readable PDFs which are mainly invoices. They are mostly scans of excel generated data and are in tabular format. I am able to read the data within these tables however in some cases the position or column of a particular value in the table is important to me (so as to determine what attributes I need to set in my code).

Some of the scans are pretty complex (with certain columns blank so I need to assume a 0 or blank value) but after the OCR is done these minor yet significant details are missed out. Due to confidentiality reasons I cannot share the complete images but can share some part of them Any help with this effort is appreciated.

Thanks,

Javed

test3-snap.png

Tom Morris

unread,

May 10, 2016, 11:51:09 AM5/10/16

to tesseract-ocr

On Tuesday, May 10, 2016 at 7:35:06 AM UTC-4, Javed Shaikh wrote:

I have to load non-readable PDFs which are mainly invoices. They are mostly scans of excel generated data and are in tabular format. I am able to read the data within these tables however in some cases the position or column of a particular value in the table is important to me (so as to determine what attributes I need to set in my code).
Some of the scans are pretty complex (with certain columns blank so I need to assume a 0 or blank value) but after the OCR is done these minor yet significant details are missed out.

The hOCR output includes coordinates of where on the page the text was found. You could use this with your favorite XML parser as a starting point.