Best way to find the best psm?

141 views
Skip to first unread message

H. Mijail Antón Quiles

unread,
Jul 1, 2023, 7:53:35 AM7/1/23
to tesseract-ocr
tesseract 5.3.1 on macOS 13.4.1

I have a PDF containing a scanned page from a book, single column. The text seems to get extracted OK, but with psm 4 and 6 the text can't be selected linearly in macOS' Preview.app; instead, while selecting, the selection jumps between words across lines. Selection works well in Adobe Acrobat, though.

With psm 11, selection works well in every reader... as far as I have tried. But checking this is a manual and error-prone process.

So my questions are:
  • should I just keep using psm 11, or is there a reason to prefer one over the others? Is there some deeper explanation of what each psm does?
  • is there any way to quickly diagnose what did the page segmentation do? For example, would be nice to have a debug mode where the center of each letter is connected with a line to the next letter; that way any unexpected jump in the flow would be immediate to see.
  • I suspect that there must be already something like that, but I couldn't find anything. --loglevel prints nothing, no matter what level I select. The debug viewer description sounds like it won't help for my case. I have tried setting various config variables (textord_debug_baselines sounded promising) but for most I didn't see any output. Am I missing something?




Reply all
Reply to author
Forward
0 new messages