Segmentation without orientation-detection -- Shouldn't "-psm 3" skip OSD?

124 views

Skip to first unread message

S

unread,

Dec 2, 2016, 4:45:24 AM12/2/16

to tesseract-ocr

The immediate problem:

I'm using tesseract (3.04.1) via command-line to generate hOCR as follows:

tesseract files/test-cases0001.tif files/test1-psm3 -l eng -psm 3 hocr

But I'm seeing output that seems to be from OSD:

OSD: Weak margin (2.88) for 29022 blob text block, but using orientation anyway: 3

Is this the correct behavior? I was hoping to avoid orientation detection, since the docs say -psm 3 should have no OSD:

3 Fully automatic page segmentation, but no OSD.

As a quick test, I generated another hocr file with psm set to 1 (to enable OSD, in theory) and ran a diff comparing that to the psm 3 file, and the two appear identical.

Is this a bug, or a misunderstanding on my part? I'd like to skip orientation if possible -- all of the pages that mention getting the orientation of "3" (instead of "0") are producing gibberish text.

The larger context (if it helps):

I have very low-noise 600dpi grayscale tifs, all correctly oriented ("right-side-up," with no skew that I can detect). The content features complex layouts, and is written almost entirely in English. I'm using tesseract via command-line to generate hOCR.

For some pages, I'm getting 95% gibberish (mostly 0's) back from tesseract. In the terminal output, these pages all mention a different "orientation" being assigned to them (the bad ones have "3," while the good ones have "0"). So I'd like to see if setting the orientation to 0 for all pages gets rid of the issue.

I don't see any way to set the orientation directly, but I see in the docs that "-psm 3" should skip OSD and still do page segmentation, which I'm assuming will achieve the same result.

S

unread,

Dec 2, 2016, 6:08:09 PM12/2/16

to tesseract-ocr

Strangely, I've had some luck by removing the "-psm 3" argument and using "-c tessedit_pageseg_mode=3" instead. There's no more output from OSD, and I'm getting mostly good hOCR from the problem pages.

As a side note, it looks like the output from tesseract --print-parameters may be out of date, as it claims "2=auto" and "3=col" and the numbers are based on "PageSegMode enum in publictypes.h," but actually looking at publictypes.h, 3 is auto and 4 is col. Maybe that needs to be updated?

Reply all

Reply to author

Forward

0 new messages