pdf -> searchable PDF

Andreas Steibl

unread,

Jan 11, 2017, 12:34:57 AM1/11/17

to tesseract-ocr

Hello

I have a pdf (scanned) and now i make a searchable pdf from this

First i generate a black/white multipage tif, and with tesseract i can make a searchable pdf.

But is it somehow possible to integrate the original pdf images?

because the generated tif has not the same quality like the original (maybe the scaned image is in color)

James R Barlow

unread,

Jan 13, 2017, 11:45:29 AM1/13/17

to tesseract-ocr

Tesseract cannot rasterize PDFs. It is fairly straightforward to write a PDF like does, but very complex to rasterize one.

Programs like OCRmyPDF (which I develop) use Ghostscript, Tesseract and other tools to handle PDF to searchable PDF conversion.

ShreeDevi Kumar

unread,

Jan 13, 2017, 11:58:24 AM1/13/17

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/tesseract/issues/83 and other PDF related issues in GitHub repo with similar discussion.

- excuse the brevity, sent from mobile

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2dccb3d2-f45e-4f47-9d04-302814d7f4ce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wiki...@gmail.com

unread,

Jan 15, 2017, 10:10:50 AM1/15/17

to tesseract-ocr

Andreas,

we track your issue now as new issue https://github.com/tesseract-ocr/tesseract/issues/660 . Please don't miss to follow the discussion there.

It looks, as if the main developers are really interested in finding and implementing a solution (in which I am also very interested in.)

Zdenko Podobný

unread,

Jan 19, 2017, 3:47:30 PM1/19/17

to tesser...@googlegroups.com

If pdf was created by scanner (there are only images pdf) I use something like this:

podofoimgextract test.pdf .
ls pdfimage_* >filelist
tesseract filelist searchable pdf

podofoimgextract is part of podofo project http://podofo.sourceforge.net/

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ce7e7c9-48b5-445a-a8b2-34e8bf529126%40googlegroups.com.

Jeff Breidenbach

unread,

Jan 20, 2017, 7:48:16 PM1/20/17

to tesseract-ocr

There is a lengthy side discussion that is appropriate to move

back here. I've been asked to elaborate what I mean by image

extraction.

https://github.com/tesseract-ocr/tesseract/issues/660

There are two ways to turn a PDF file into images. One is to

render it, for example using a tool like pdftoppm. This is great

if there are things like fonts involved.

But far better, for bag-of-images PDF files, such as produced

by certain scanning machines, is to crack open the bag and

take out the images. This guarantees no rescaling, no loss

of image information, and no (possibly space inefficient) format

conversions.

Tools for image extraction are not super common, but it sounds

from the name like podofoimgextract does it. And for a fairly limited

set of formats, so does pdfimages from poppler-utils. The best case

scenario is image extract with no transcoding whatsoever. That's

not always possible (expecially when dealing with really fancy formats

like JBIG2) but it should be fine for PDF files produced by a scanner.

And also any PDF files produced by Tesseract.

Reply all

Reply to author

Forward