OCRing PDF Files (and converting them to TIFF)

235 views
Skip to first unread message

NA

unread,
Mar 11, 2008, 4:21:09 PM3/11/08
to tesseract-ocr
I really want to OCR PDF files, I know I need to convert them to TIFF.
Can someone tell me A) what is the best utility for this and B) what
the generic command line argument is to convert TIFF images?

I tried ghostscript and imagemagick and both returned really low
quality versions of the original TIFF image. I am not sure if I gave a
bad command line. I remember seeing something about providing the
resolution of the TIFF image to the converter but I will be batch
converting and the resolution of the documents vary.

alexweb

unread,
Mar 11, 2008, 4:41:52 PM3/11/08
to tesseract-ocr
Personally I like PDF Converter. It allows to define the resolution
(300 dpi are recommended for OCR) and to "print" a TIF file.

Alex

Hussein

unread,
Mar 11, 2008, 4:46:09 PM3/11/08
to tesser...@googlegroups.com
Just make sure not to try to OCR structured info in the PDF :)  If it is structured text, you can extract it directly with tools.
 
Only OCR the images inside the PDF.  The tools allow you to extract images from the PDF and then you save them as you like.  Of course, it is easier if the whole PDF is an image to use a direct tool to convert it to TIF.
 
Hussein Al-Hussein






> Date: Tue, 11 Mar 2008 13:21:09 -0700
> Subject: OCRing PDF Files (and converting them to TIFF)
> From: NAp...@gmail.com
> To: tesser...@googlegroups.com

alexweb

unread,
Mar 11, 2008, 4:49:26 PM3/11/08
to tesseract-ocr
i meant PDF Creator, not PDF Converter...

Alex

sorry, it's late here :)

NA

unread,
Mar 11, 2008, 4:53:16 PM3/11/08
to tesseract-ocr
The problem is I don't want to define a resolution. Choosing 300 is
all well and fine if your source PDF is actually 300 or less but what
if it's greater? It doesn't seem like a good idea to lose image
quality before OCRing by lowering the resolution to 300. I believe
that would may in less accurate OCR results than what's possible.

Nathan

Jeffrey Ratcliffe

unread,
Mar 11, 2008, 4:57:23 PM3/11/08
to tesser...@googlegroups.com
If you are using Linux, the commandline tool pdfimages will extract
the images from the PDF as PNM files, which you will then have to
convert to TIFF.

My gscan2pdf application will sort all of that out automatically, as
well as using Tesseract for the OCR, embedding the output behind the
image in the PDF.

alexweb

unread,
Mar 11, 2008, 5:02:54 PM3/11/08
to tesseract-ocr
Normally "more resolution" doesn't mean "better OCR". If you make a
scan of a document with 300 and with 1200 dpi, often the OCR results
of the 300 dpi will be better. But anyway, with PDF Creator you can
also define higher resolution than 300 dpi. I only mentionned that
feature. Should be better if you try by yourself and see if you like
it. It's freeware.

Scan...@gmail.com

unread,
Mar 11, 2008, 7:37:55 PM3/11/08
to tesseract-ocr
You are exactly correct more resolution does not equal more accuracy.
Also most pdf stuff is a mixture of lower res gray scale images and
font based text ie the reason to get it into one image file.

Nathan Apter

unread,
Mar 11, 2008, 8:33:41 PM3/11/08
to tesser...@googlegroups.com
Does gscan2pdf work from the command line or a pipe using a TIFF image? Is it open source? I implemented tesseract into my web site, www.abillionbillion.com to allow people to ocr tiff images, maybe I could implement your gscan2pdf to ocr PDFs... I do not want to reinvent the wheel.

Nathan Apter

unread,
Mar 11, 2008, 8:36:47 PM3/11/08
to tesser...@googlegroups.com
Ok so my question is then, would it be a good move to batch convert images of varying resolutions (above and below 300) to 300 DPI before OCRing? Wouldn't it be wiser to just leave them in their original resolution that they had in the PDF image? If so, is that possible with pdfcreator / ghostscript?

Hussein

unread,
Mar 12, 2008, 1:00:57 AM3/12/08
to tesser...@googlegroups.com
Higher resolution means longer time for processing and more memory.  However, resolutions lower than 300 DPI are bad for any OCR.
 
Hussein Al-Hussein






> Date: Tue, 11 Mar 2008 16:37:55 -0700
> Subject: Re: OCRing PDF Files (and converting them to TIFF)
> From: Scan...@gmail.com
> To: tesser...@googlegroups.com

Jeffrey Ratcliffe

unread,
Mar 12, 2008, 2:04:37 AM3/12/08
to tesser...@googlegroups.com
On 12/03/2008, Nathan Apter <nap...@gmail.com> wrote:
> Does gscan2pdf work from the command line or a pipe using a TIFF image? Is
> it open source? I implemented tesseract into my web site,
> www.abillionbillion.com to allow people to ocr tiff images, maybe I could
> implement your gscan2pdf to ocr PDFs... I do not want to reinvent the wheel.

gscan2pdf is a GUI and at the moment cannot be controlled from the
command line in the manner you want. It is open source.

The command line tools you are looking for are pdfimages to extract
the images, and then imagemagick to convert them to TIFF.

It is certainly better to retain the resolution of the images, rather
than down- or upsampling them, as information can only be lost that
way. The above technique extracts the images at the resolution with
which they were embedded.

malcook

unread,
Mar 12, 2008, 3:45:06 PM3/12/08
to tesseract-ocr
You can specify the -density to ImageMagick's convert command...

and...

... if you're doing OCR on scans of printed text, you probably also
want to specify `-monochrome`.

I used the following two-liner in linux to convert a PDF into OCRed
text files, one per page

convert -density 300x300 -monochrome MyDoc.pdf MyDoc-%03d.tiff
perl -e '`tesseract $_ $_` foreach @ARGV' -- MyDoc-*.tiff

the approach works for up to 999 pages, after which the last argument
to convert would need %04 for another significant digit of pages (i.e.
up to 9999)

I have no advice about which density is appropriate for you
application, and retaining "maximal" density, as you seem interested
in doing. Let us know what you decide!

--Malcolm

Nathan Apter

unread,
Mar 13, 2008, 11:14:42 AM3/13/08
to tesser...@googlegroups.com
I did what you suggested, pdfimages and then imagemagick. Using just the defaults for both programs I got a much higher quality image than I did with just using imagemagick's convert. Why is that?
 
After I extract the image this way and OCR it, is there a simple way for me to place the text into the original PDF file?
Are you planning on providing a gscan2pdf command line interface any time soon?
 
Thanks,
Nathan
Document Management for Everyone

Jeffrey Ratcliffe

unread,
Mar 13, 2008, 2:59:21 PM3/13/08
to tesser...@googlegroups.com
On 13/03/2008, Nathan Apter <nap...@gmail.com> wrote:
> I did what you suggested, pdfimages and then imagemagick. Using just the
> defaults for both programs I got a much higher quality image than I did with
> just using imagemagick's convert. Why is that?

imagemagick uses ghostscript and resamples the result.

> After I extract the image this way and OCR it, is there a simple way for me
> to place the text into the original PDF file?

Not really. If you can hack a bit of Perl, you could take the routine
from gscan2pdf - from that point of view it isn't hard, but I don't
know of another tool that does it.

> Are you planning on providing a gscan2pdf command line interface any time
> soon?

I have thought about it, but it isn't anywhere near the top of my todo
list at the moment.

Hussein

unread,
Mar 13, 2008, 3:14:26 PM3/13/08
to tesser...@googlegroups.com
> On 13/03/2008, Nathan Apter <nap...@gmail.com> wrote:
> > After I extract the image this way and OCR it, is there a simple way for me
> > to place the text into the original PDF file?

Well, any PDF extraction tool lets you grab images, text, and other objects from the PDF file.  It is easier for you to insert them in a new PDF file that you create replacing the image by the OCRed text of it. 
 
Hussein Al-Hussein

NA

unread,
Mar 13, 2008, 4:10:11 PM3/13/08
to tesseract-ocr
I don't want to replace the image with the resulting text. I just want
to place the text behind the image for copying / pasting / searching.

Scan...@gmail.com

unread,
Mar 14, 2008, 9:24:59 PM3/14/08
to tesseract-ocr
The answer is there is not an easy way. You choices are to just place
it as an annotation like gscan2pdf or you must use a PDF library like
adobe or PDFlib and place in the in more type setting postcript
terms. You could also use fpdf.org and do it from php scripts for
basic PDF files.


Glen

Jeffrey Ratcliffe

unread,
Mar 15, 2008, 3:30:32 AM3/15/08
to tesser...@googlegroups.com
On 15/03/2008, gl...@jetsoftdev.com <Scan...@gmail.com> wrote:
> The answer is there is not an easy way. You choices are to just place
> it as an annotation like gscan2pdf or you must use a PDF library like
> adobe or PDFlib and place in the in more type setting postcript
> terms. You could also use fpdf.org and do it from php scripts for
> basic PDF files.

As annotations are not indexed by Beagle, gscan2pdf also simply embeds
the text as plain text behind the image.

jryc...@gmail.com

unread,
Mar 25, 2008, 4:38:32 PM3/25/08
to tesseract-ocr
On Mar 15, 8:30 am, "Jeffrey Ratcliffe" <jeffrey.ratcli...@gmail.com>
wrote:
I read the discussion but couldn't quite figure out the conclusion.
I'm looking for something very similar: I have lots of scanned PDF
files and I'd like to pass them through OCR, placing whatever gets
recognized as plain text behind the image. I'd like everything else
within the PDFs to remain untouched (as much as possible).

Batch processing is the key here -- I'm not looking for a GUI
application, I'd like to run it regularly from cron.

Any suggestions?

thanks,
--J.

Nathan Apter

unread,
Mar 25, 2008, 4:53:24 PM3/25/08
to tesser...@googlegroups.com
How does gscan2pdf embed the text as plain text behind the image? Is this a simple thing to do?

Jeffrey Ratcliffe

unread,
Mar 26, 2008, 2:44:34 AM3/26/08
to tesser...@googlegroups.com
On 25/03/2008, Nathan Apter <nap...@gmail.com> wrote:
> How does gscan2pdf embed the text as plain text behind the image? Is this a
> simple thing to do?

It just uses the PDF::API Perl module, so if you know Perl, yes, it is
very simple.

Reply all
Reply to author
Forward
0 new messages