tesseract output is of first page only

42 views
Skip to first unread message

ilevy

unread,
Aug 9, 2019, 5:41:15 AM8/9/19
to tesseract-ocr
I'm trying tesseract for the first time with a png of a multipage document I saved out of a pdf (which itself was just an image).

When I run tesseract, I get an output of the first page, but that's all. I notice that there's a control-L (^L) at the end of the text file.

How do I get the entire file output to txt?

ElGato ElMago

unread,
Aug 9, 2019, 6:28:34 AM8/9/19
to tesseract-ocr
Is it possible to have multiple pages in a png file in the first place?

2019年8月9日金曜日 14時41分15秒 UTC+9 ilevy:

Zdenko Podobny

unread,
Aug 9, 2019, 6:58:57 AM8/9/19
to tesser...@googlegroups.com
Provide exact information what you did.
Make sure you use the latest tesseract and leptonica.

Zdenko


pi 9. 8. 2019 o 7:41 ilevy <textr...@gmail.com> napísal(a):
I'm trying tesseract for the first time with a png of a multipage document I saved out of a pdf (which itself was just an image).

When I run tesseract, I get an output of the first page, but that's all. I notice that there's a control-L (^L) at the end of the text file.

How do I get the entire file output to txt?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4067da33-b1d1-4bbe-9909-9b5552c49549%40googlegroups.com.

Shree Devi Kumar

unread,
Aug 9, 2019, 2:42:59 PM8/9/19
to tesseract-ocr
Try creating a multipage tiff from your pdf and try.

--

ilevy

unread,
Aug 9, 2019, 6:05:42 PM8/9/19
to tesseract-ocr
That's a good question. The png was exported from a pdf, so there may have been some notion of pages encoded into it, but that's a guess. What I can say is that the result is consistent. Running

tesseract Downloads/foundations-of-mathematics.tiff foundations-of-mathematics


always yields the first page in foundations-of-mathematics.txt.

ilevy

unread,
Aug 9, 2019, 6:08:18 PM8/9/19
to tesseract-ocr
I exported a png from a pdf that seemed to be a scanned image of the original text. I installed the latest tesseract and leptonica via Homebrew. I then ran

tesseract Downloads/foundations-of-mathematics.tiff foundations-of-mathematics


and it consistently outputs the first page only.

On Thursday, August 8, 2019 at 11:58:57 PM UTC-7, zdenop wrote:
Provide exact information what you did.
Make sure you use the latest tesseract and leptonica.

Zdenko


pi 9. 8. 2019 o 7:41 ilevy <textr...@gmail.com> napísal(a):
I'm trying tesseract for the first time with a png of a multipage document I saved out of a pdf (which itself was just an image).

When I run tesseract, I get an output of the first page, but that's all. I notice that there's a control-L (^L) at the end of the text file.

How do I get the entire file output to txt?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

ilevy

unread,
Aug 9, 2019, 6:12:22 PM8/9/19
to tesseract-ocr
That worked, thank you very much Shree!

I could tell right away that it was working because it was writing to stdout:

Tesseract Open Source OCR Engine v4.1.0 with Leptonica

Page 1

Page 2

Page 3

Page 4

Page 5

Page 6

Page 7

Page 8

Page 9

Page 10

Page 11

Page 12

Page 13

Detected 14 diacritics

Page 14


and so no. And finally I had the txt with all of the text as expected.

Something should be noted somewhere that at least in certain contexts multipage png files -- whatever "multipage" means in the case of these files -- will not render correctly.

On Friday, August 9, 2019 at 7:42:59 AM UTC-7, shree wrote:
Try creating a multipage tiff from your pdf and try.

On Fri, 9 Aug 2019, 11:11 ilevy, <textr...@gmail.com> wrote:
I'm trying tesseract for the first time with a png of a multipage document I saved out of a pdf (which itself was just an image).

When I run tesseract, I get an output of the first page, but that's all. I notice that there's a control-L (^L) at the end of the text file.

How do I get the entire file output to txt?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages