Hi, all. Iam OCRing 10k invoices for AI training and, as it turns out, using Tesseract's -psm 4 exported as txt is ideal for this as it provides each individual line item as one uninterrupted line of text across the page, including all columns.
Example:
Product Description Quantity Unit Price Total
1001 Boots 2 $ 100.00 $ 200.00The only drawback is that -psm 4 does not use OSD (Orientation and Script Detection) and will only accept invoices that are already correctly oriented. To solve this i will first have to run -psm 0 to get individual .osd-files with orientation of each file/page and then run convert -rotate 90 on the .TIF-files where the invoice orientation is not already correct.
My question is: Can I somehow create my own -psm 4, combining the full width text extraction with the Orientation (and Script Detection) from -psm 1?
Or is there any other way to somehow invoke OSD or ensure full page width text as with -psm 4?
Thanks.