Combining -psm 4 with OSD?

58 views

Skip to first unread message

Jarl Arntzen

unread,

Oct 18, 2018, 3:29:02 PM10/18/18

to tesseract-ocr

Hi, all. Iam OCRing 10k invoices for AI training and, as it turns out, using Tesseract's -psm 4 exported as txt is ideal for this as it provides each individual line item as one uninterrupted line of text across the page, including all columns.

Example:

Product     Description        Quantity       Unit Price     Total
1001        Boots              2              $ 100.00       $ 200.00

The only drawback is that -psm 4 does not use OSD (Orientation and Script Detection) and will only accept invoices that are already correctly oriented. To solve this i will first have to run -psm 0 to get individual .osd-files with orientation of each file/page and then run convert -rotate 90 on the .TIF-files where the invoice orientation is not already correct.

My question is: Can I somehow create my own -psm 4, combining the full width text extraction with the Orientation (and Script Detection) from -psm 1?

Or is there any other way to somehow invoke OSD or ensure full page width text as with -psm 4?