Merge text-only PDF with image-only PDF

722 views
Skip to first unread message

Georg Sauthoff

unread,
Oct 16, 2017, 1:09:13 AM10/16/17
to tesseract-ocr

Hello,

for some documents it would make sense to create a text-only PDF with tesseract (cf. -c textonly_pdf=1) and merge it with an image-only PDF; as described in https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#integrate-original-image-file-and-detected-text-into-pdf and the linked github issue comment.

Use-case: let tesseract do its OCR on very high-quality images but put some post-processed images into the resulting PDF file. Thus, you get high quality OCR results and a relatively small PDF file.

So the ansatz described in the FAQ/issue sounds nice, but how do I actually merge the 2 PDF files (on Linux)?

When googling for PDF merge tools I just find ones for concatenating PDF files ...

For the above merge the 2 PDF files have to be merged 'on top' of each other, i.e. the number of pages of the resulting PDF doesn't change, it 'just' gets the text layer added.

Best regards
Georg

ShreeDevi Kumar

unread,
Oct 16, 2017, 3:09:44 AM10/16/17
to tesser...@googlegroups.com
If you read the linked issue, you will find samples of merging pdf files eg.

  • pdftk input-without-text.pdf multibackground textonly.pdf output new-mixed-mode.pdf

On 16-Oct-2017 10:39 AM, "Georg Sauthoff" <georg.s...@gmail.com> wrote:

...

aug.i...@gmail.com

unread,
Oct 31, 2017, 1:00:18 PM10/31/17
to tesseract-ocr

Sir please help me regarding pdf to excel for Indian Regional language
my email.id
aug.i...@gmail.com
Reply all
Reply to author
Forward
0 new messages