tesseract 5.3.1 on macOS 13.4.1
I have a PDF containing a scanned page from a book, single column. The text seems to get extracted OK, but with psm 4 and 6 the text can't be selected linearly in macOS' Preview.app; instead, while selecting, the selection jumps between words across lines. Selection works well in Adobe Acrobat, though.
With psm 11, selection works well in every reader... as far as I have tried. But checking this is a manual and error-prone process.
So my questions are:
- should I just keep using psm 11, or is there a reason to prefer one over the others? Is there some deeper explanation of what each psm does?
- is there any way to quickly diagnose what did the page segmentation do? For example, would be nice to have a debug mode where the center of each letter is connected with a line to the next letter; that way any unexpected jump in the flow would be immediate to see.
- I suspect that there must be already something like that, but I couldn't find anything. --loglevel prints nothing, no matter what level I select. The debug viewer description sounds like it won't help for my case. I have tried setting various config variables (textord_debug_baselines sounded promising) but for most I didn't see any output. Am I missing something?