I am using Tesseract 4.0 to extract text from scanned PDF documents. I first use pdftoppm to split the document into pages represented as png files, and then use the following command to perform OCR
tesseract page.pdf stdout -l eng --psm 4
The pages generally have section numbers down the left hand side of the page. Sometimes, these are extracted as a column of text, and the actual text is extracted as a second column. Since I have set --psm 4, I am expecting to get the entire page returned as a single column - and indeed, for some pages I do get what I want.
Why is tesseract sometimes extracting the text in columns even when I tell it not to, and what can I do about it?