Re: Building tesseract 3.02.02 with leptonica 1.69

388 views
Skip to first unread message

Nick White

unread,
Apr 29, 2013, 6:27:07 AM4/29/13
to tesser...@googlegroups.com
> ALSO, I thought tesseract built with leptonica could handle any of the formats
> leptonica can handle, and that include PDF.

Nope, it doesn't support straight PDF. Best is to rip the images
out of the PDF first. If you have imagemagick, something like this
will do that:

convert my-test.pdf out.png

Nick

Nick White

unread,
Apr 29, 2013, 8:39:53 AM4/29/13
to tesser...@googlegroups.com
On Mon, Apr 29, 2013 at 04:10:43AM -0700, Steven McArdle wrote:
> What do you mean by "it doesn't support straight PDF" ?

I mean it only accepts image files. So you need to extract the
images from the PDF before getting Tesseract to process them.

Now I think of it, the 'pdfimages' tool is better for this than
imagemagick, as it will extract without converting or losing any
quality. But either would work fine (or Ghostscript, as you point
out).

Nick

TP

unread,
Apr 29, 2013, 8:58:26 AM4/29/13
to tesseract-ocr

On Mon, Apr 29, 2013 at 4:10 AM, Steven McArdle <steven....@gmail.com> wrote:
What do you mean by "it doesn't support straight PDF" ?


Leptonica only supports PDF for relatively simple *output*. See "I/O libraries Leptonica is dependent on" [1] and "Image I/O" [2]. If you don't believe that, see src\environ.h [3] for the I/O configuration section or src\pdfio.c [4] for the actual pdf *writing* support code. (In case you are wondering, it is a *LOT* easier write a PDF than to read it).

[1] http://tpgit.github.io/UnOfficialLeptDocs/leptonica/README.html#i-o-libraries-leptonica-is-dependent-on

[2] http://tpgit.github.io/UnOfficialLeptDocs/leptonica/README.html#image-i-o

[3] http://tpgit.github.io/Leptonica/environ_8h_source.html#l00082

[4] http://tpgit.github.io/Leptonica/pdfio_8c_source.html


Nick White

unread,
Apr 29, 2013, 9:32:14 AM4/29/13
to tesser...@googlegroups.com
Oh cool, I haven't actually used multi-page TIFFs before, it's nice
that Tesseract handles them well, straight from ghostscript.

Yes, at the moment I suppose you'll just have to make a little
script or something to wrap the ghostscript and tesseract steps
appropriately.

I have used pdfimages for a number of things, with scripts handling
the files one at a time. But I can see ghostscript would be a better
way of working for you (and quite possibly for me, next time I have
lots of stuff to process).

Nick

On Mon, Apr 29, 2013 at 05:51:49AM -0700, Steven McArdle wrote:
> Thanks Nick
>
> I already have it set up for ghostscript as it gives better results than
> imagemagick.
>
> As the PDF's are mostly multi-page and ghostscript can generate multi-page TIFF
> from these, I can feed these directly into Tesseract.
>
> So I don't think pdfimages is an option as it spits out multiple files.
>
> Steve
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
Reply all
Reply to author
Forward
0 new messages