Is there a way to get end-of-page (FF) encoded in PDF?

46 views
Skip to first unread message

ArtmanDC

unread,
Jan 14, 2022, 1:21:46 PM1/14/22
to tesseract-ocr
In my project I am scanning images on microfilm, then using Tesseract (v. 5.0.0) to create a PDF including the OCR'ed text layer.

The input images are text (monospaced typewriter), and I combine several (2-8 typically) images in a multipage tif.

I use the following command in Windows 10—

tesseract multipage.tif output --psm 1 pdf

This works as expected, producing a multi-page output.pdf. (I added the <--psm 4> after I discovered that when several consecutive lines had word spaces above each other, the program interpreted this as a gap between columns, leading to unwanted results.)

As a check in my workflow, I highlight the image in the PDF (CTRL-A) and copy/paste into my editor (notepad++). This pastes the OCR text from all pages in the document.

The result is reasonably good except that paragraph and page breaks are not indicated. Line breaks are.

If I replace the <pdf> with a <txt> in the command, the resulting text file has a blank line between paragraphs <LF LF> (Linux style, even though I'm using Windows) and a page break <FF>  at the end of each page.

I would like my PDF text layer to have the more user-friendly display that tesseract deploys in a text file.

Is this possible?  If so, how?

Thanks!


Anand babu

unread,
Sep 17, 2022, 2:08:28 PM9/17/22
to tesseract-ocr
Hi Artman, Im working on a similar project to convert PDF to image to text to editor PDF for ML. Could you please shar your github code?
Reply all
Reply to author
Forward
0 new messages