[Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option *multi-page* capable ?

107 views
Skip to first unread message

Tom

unread,
Aug 5, 2014, 1:52:28 AM8/5/14
to tesser...@googlegroups.com
I am heavily using the new "pdf" option for ocr-ing single PDF pages (or their image equivalents), which works very well. Thanks for the new option in Tesseract svn trunk.

When inspecting the code I think found some pieces indicating a "multi-page" actions.
  • My question 1: Is Tesseract already supporting the OCR-ing of multi-page PDFs ?
  • My question 2: If answer is not: Are there initiatives to integrate this into Tesseract ?
I would appreciate if Tesseract "pdf" works also for multi-page PDFs.


Remark:

This is how I process multi-page PDFs currently:

At the moment I do have a script (using pdftk/PDFToolkit) to split a PDF into single image files, which I then convert one-by-one via Tesseract's "pdf" option, which single-page output I then have to collate again by another script into the final single mixed-mode output PDF file.


zdenko podobny

unread,
Aug 5, 2014, 3:25:35 AM8/5/14
to tesser...@googlegroups.com
Hello,

if you are referring to some code ("inspecting the code I think found some pieces...") please make a reference/link to it.

Tesseract is able to OCR everything that is leptonica able to open or everything you or programmer is able to convert to leptonica PIX structure ;-)

I did not have a change to test leptonica 1.71, but 1.70 was not able to open pdf. So the answer to your 1. question is no. leptonica/tesseract do not support OCR-ing of multi-page PDFs neither single pdf. But it support multi-page tif.

Regarding your question 2 - I am not aware about any such initiative. tesseract is OCRing images and pdf is not image format but document format (e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html etc.).



Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom

unread,
Aug 5, 2014, 3:40:35 AM8/5/14
to tesser...@googlegroups.com


Am Dienstag, 5. August 2014 09:25:35 UTC+2 schrieb zdenop:
Hello,

if you are referring to some code ("inspecting the code I think found some pieces...") please make a reference/link to it.

Tesseract is able to OCR everything that is leptonica able to open or everything you or programmer is able to convert to leptonica PIX structure ;-)

I did not have a change to test leptonica 1.71, but 1.70 was not able to open pdf. So the answer to your 1. question is no. leptonica/tesseract do not support OCR-ing of multi-page PDFs neither single pdf. But it support multi-page tif.

I have tried this twice, but this approach failed (as far as I remember I got these messages http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format ). I will try to investigate, why (or what I did wrong) and - in case that the problem persists - post as a regular bug report. Currently, I am unsure what really happened.


Regarding your question 2 - I am not aware about any such initiative. tesseract is OCRing images and pdf is not image format but document format (e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html etc.).

Uh., yes, I fully overlooked this, you are right!

Tesseract is according to the documentation and what you said able to OCR multi-page TIFF, and it can also create a PDF (dual-layer) file with the input image/s and ocr-ed text. So the only missing thing is the conversion of a multi-page PDF to a multi-page TIFF, this would then enable Tesseract to accept multi-page PDFs [sic] as input. My current investigation showed that Leptonica cannot convert an input multi-page PDF to TIFF multi-page.

zdenko podobny

unread,
Aug 5, 2014, 5:31:53 AM8/5/14
to tesser...@googlegroups.com
On Tue, Aug 5, 2014 at 9:40 AM, Tom <syr...@gmail.com> wrote:


Am Dienstag, 5. August 2014 09:25:35 UTC+2 schrieb zdenop:
Hello,

if you are referring to some code ("inspecting the code I think found some pieces...") please make a reference/link to it.

Tesseract is able to OCR everything that is leptonica able to open or everything you or programmer is able to convert to leptonica PIX structure ;-)

I did not have a change to test leptonica 1.71, but 1.70 was not able to open pdf. So the answer to your 1. question is no. leptonica/tesseract do not support OCR-ing of multi-page PDFs neither single pdf. But it support multi-page tif.

I have tried this twice, but this approach failed (as far as I remember I got these messages http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format ). I will try to investigate, why (or what I did wrong) and - in case that the problem persists - post as a regular bug report. Currently, I am unsure what really happened.

1. that would be the leptonica issue and not tesseract issue
2. there are already solutions, so there should be no problem to use convert pdf to tif

TP

unread,
Aug 5, 2014, 5:41:36 AM8/5/14
to tesseract-ocr

On Tue, Aug 5, 2014 at 12:40 AM, Tom <syr...@gmail.com> wrote:
My current investigation showed that Leptonica cannot convert an input multi-page PDF to TIFF multi-page.

Writing a PDF is orders of magnitude easier than being able to read an arbitrary PDF.
Reply all
Reply to author
Forward
0 new messages