The immediate problem:
I'm using tesseract (3.04.1) via command-line to generate hOCR as follows:
tesseract files/test-cases0001.tif files/test1-psm3 -l eng -psm 3 hocr
But I'm seeing output that seems to be from OSD:
OSD: Weak margin (2.88) for 29022 blob text block, but using orientation anyway: 3
Is this the correct behavior? I was hoping to avoid orientation detection, since the docs say -psm 3 should have no OSD:
3 Fully automatic page segmentation, but no OSD.
As a quick test, I generated another hocr file with psm set to 1 (to enable OSD, in theory) and ran a diff comparing that to the psm 3 file, and the two appear identical.
Is this a bug, or a misunderstanding on my part? I'd like to skip orientation if possible -- all of the pages that mention getting the orientation of "3" (instead of "0") are producing gibberish text.
The larger context (if it helps):
I have very low-noise 600dpi grayscale tifs, all correctly oriented ("right-side-up," with no skew that I can detect). The content features complex layouts, and is written almost entirely in English. I'm using tesseract via command-line to generate hOCR.
For some pages, I'm getting 95% gibberish (mostly 0's) back from tesseract. In the terminal output, these pages all mention a different "orientation" being assigned to them (the bad ones have "3," while the good ones have "0"). So I'd like to see if setting the orientation to 0 for all pages gets rid of the issue.
I don't see any way to set the orientation directly, but I see in the docs that "-psm 3" should skip OSD and still do page segmentation, which I'm assuming will achieve the same result.