Choosing background when generating output using PDF config.

34 views
Skip to first unread message

Jonas Winkler

unread,
Nov 11, 2020, 11:25:07 AM11/11/20
to tesseract-ocr
Hello.

I've got some input document input.pdf. This comes straight from a scanner and thus I do some preprocessing to improve accuracy (i.e., unpaper, black/white, increased contrast), which yields preprocessed.png.

When using the command

tesseract preprocessed.png output pdf

I receive a document, which has the ocr'ed text embedded. Great! However: Can I tell tesseract to use the original document input.pdf as the background (i.e., the one without preprocessing) of the generated PDF while still performing ocr on the preprocessed input?

Thanks,
Jonas

Quan Nguyen

unread,
Dec 20, 2020, 12:30:58 PM12/20/20
to tesseract-ocr
I don't think Tesseract supports this. You may want to try to generate a text-only searchable PDF file and superimpose it on the original PDF file.
Reply all
Reply to author
Forward
0 new messages