pdf -> searchable PDF

858 views
Skip to first unread message

Andreas Steibl

unread,
Jan 11, 2017, 12:34:57 AM1/11/17
to tesseract-ocr
Hello

I have a pdf (scanned) and now i make a searchable pdf from this
First i generate a black/white multipage tif, and with tesseract i can make a searchable pdf.

But is it somehow possible to integrate the original pdf images?
because the generated tif has not the same quality like the original (maybe the scaned image is in color)

James R Barlow

unread,
Jan 13, 2017, 11:45:29 AM1/13/17
to tesseract-ocr
Tesseract cannot rasterize PDFs. It is fairly straightforward to write a PDF like does, but very complex to rasterize one.

Programs like OCRmyPDF (which I develop) use Ghostscript, Tesseract and other tools to handle PDF to searchable PDF conversion.

ShreeDevi Kumar

unread,
Jan 13, 2017, 11:58:24 AM1/13/17
to tesser...@googlegroups.com
Please see https://github.com/tesseract-ocr/tesseract/issues/83 and other PDF related issues in GitHub repo with similar discussion.

- excuse the brevity, sent from mobile

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2dccb3d2-f45e-4f47-9d04-302814d7f4ce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wiki...@gmail.com

unread,
Jan 15, 2017, 10:10:50 AM1/15/17
to tesseract-ocr
Andreas,

we track your issue now as new issue https://github.com/tesseract-ocr/tesseract/issues/660 . Please don't miss to follow the discussion there.

It looks, as if the main developers are really interested in finding and implementing a solution (in which I am also very interested in.)

Zdenko Podobný

unread,
Jan 19, 2017, 3:47:30 PM1/19/17
to tesser...@googlegroups.com
If pdf was created by scanner (there are only images pdf) I use something like this:
podofoimgextract test.pdf .
ls pdfimage_* >filelist
tesseract filelist searchable pdf

podofoimgextract is part of podofo project http://podofo.sourceforge.net/


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Jeff Breidenbach

unread,
Jan 20, 2017, 7:48:16 PM1/20/17
to tesseract-ocr
There is a lengthy side discussion that is appropriate to move
back here. I've been asked to elaborate what I mean by image 

There are two ways to turn a PDF file into images. One is to
render it, for example using a tool like pdftoppm. This is great
if there are things like fonts involved.

But far better, for bag-of-images PDF files, such as produced
by certain scanning machines, is to crack open the bag and
take out the images. This guarantees no rescaling, no loss
of image information, and no (possibly space inefficient) format 
conversions.

Tools for image extraction are not super common, but it sounds
from the name like podofoimgextract does it. And for a fairly limited
set of formats, so does pdfimages from poppler-utils. The best case
scenario is image extract with no transcoding whatsoever. That's
not always possible (expecially when dealing with really fancy formats
like JBIG2) but it should be fine for PDF files produced by a scanner.
And also any PDF files produced by Tesseract.



Reply all
Reply to author
Forward
0 new messages