Tesseract Multipage tiff to multipage pdf

125 views
Skip to first unread message

András Jeszenkovits

unread,
May 15, 2019, 9:43:56 AM5/15/19
to tesseract-ocr
Hello!

Can you help me with this problem? I'm testing the tesseract OCR engine. The input is a scanned multipage TIFF file. I tried to create a PDF from that, but the result is always one page.
I used this cmd line:
tesseract In\Test.tif Out\TestOutput -l rus+eng -c tessedit_page_number=-1  pdf
I found an option to create a multipage pdf with this part: " -c tessedit_page_number=-1" but it doesnt work. I tried to get txt data, but I only found text from the first page.
Can you help me with that?

Zdenko Podobny

unread,
May 15, 2019, 9:51:31 AM5/15/19
to tesser...@googlegroups.com
Why are you using  tessedit_page_number ? 
 
Zdenko


st 15. 5. 2019 o 15:43 András Jeszenkovits <jesz...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9ffedebc-c7b1-4856-bf31-f438d8213d01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

András Jeszenkovits

unread,
May 15, 2019, 9:57:25 AM5/15/19
to tesseract-ocr
Here: tesseract In\SPTest.tif Out\Test --psm 3 -l rus+eng -c tessedit_page_number=-1 pdf


2019. május 15., szerda 15:51:31 UTC+2 időpontban zdenop a következőt írta:
Why are you using  tessedit_page_number ? 
 
Zdenko


st 15. 5. 2019 o 15:43 András Jeszenkovits <jesz...@gmail.com> napísal(a):
Hello!

Can you help me with this problem? I'm testing the tesseract OCR engine. The input is a scanned multipage TIFF file. I tried to create a PDF from that, but the result is always one page.
I used this cmd line:
tesseract In\Test.tif Out\TestOutput -l rus+eng -c tessedit_page_number=-1  pdf
I found an option to create a multipage pdf with this part: " -c tessedit_page_number=-1" but it doesnt work. I tried to get txt data, but I only found text from the first page.
Can you help me with that?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,
May 15, 2019, 9:59:35 AM5/15/19
to tesser...@googlegroups.com
Please read my question once again.

Zdenko


st 15. 5. 2019 o 15:57 András Jeszenkovits <jesz...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
May 15, 2019, 11:29:36 AM5/15/19
to tesser...@googlegroups.com
 tesseract In\SPTest.tif Out\Test --psm 3 -l rus+eng pdf  

This should be enough to create a multi page pdf from a multi page tiff.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

András Jeszenkovits

unread,
May 16, 2019, 4:01:59 AM5/16/19
to tesseract-ocr
I thought that too, but the Tesseract create a one page pdf

András Jeszenkovits

unread,
May 16, 2019, 4:03:09 AM5/16/19
to tesseract-ocr
I found in the cmd tesseract --help the following option:
tessedit_page_number    -1      -1 -> All pages , else specific page to process
I think that would be default


2019. május 15., szerda 15:51:31 UTC+2 időpontban zdenop a következőt írta:
Why are you using  tessedit_page_number ? 
 
Zdenko


st 15. 5. 2019 o 15:43 András Jeszenkovits <jesz...@gmail.com> napísal(a):
Hello!

Can you help me with this problem? I'm testing the tesseract OCR engine. The input is a scanned multipage TIFF file. I tried to create a PDF from that, but the result is always one page.
I used this cmd line:
tesseract In\Test.tif Out\TestOutput -l rus+eng -c tessedit_page_number=-1  pdf
I found an option to create a multipage pdf with this part: " -c tessedit_page_number=-1" but it doesnt work. I tried to get txt data, but I only found text from the first page.
Can you help me with that?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
May 16, 2019, 4:31:33 AM5/16/19
to tesser...@googlegroups.com
What is your version of tesseract? Which O/S?

Have you tried it with just one language?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

András Jeszenkovits

unread,
May 16, 2019, 5:06:22 AM5/16/19
to tesseract-ocr
The OS is Windows 10, I use tesserac OCR engine v4.0.0.20190314, I tried english, russian, hungarian. I tried 32bit/64bit version, i tried a jpg file too, same result (1 page pdf)

Zdenko Podobny

unread,
May 16, 2019, 5:11:00 AM5/16/19
to tesser...@googlegroups.com
So provide your tif for reproducing problem.

Zdenko


št 16. 5. 2019 o 11:06 András Jeszenkovits <jesz...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
May 16, 2019, 6:51:55 AM5/16/19
to tesser...@googlegroups.com
Are you sure your tif is multi-page? Have you checked in an image program and browsed multiple pages?

A jpg file will not have multiple pages as far as I know.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

András Jeszenkovits

unread,
May 16, 2019, 6:57:44 AM5/16/19
to tesseract-ocr
Yes the jpg was false info. but the tiff is multipage, i scanned about 2 times, same result, i can see all the pages in acrobat reader / chrome etc.

András Jeszenkovits

unread,
May 16, 2019, 6:59:29 AM5/16/19
to tesseract-ocr
I cannot send you the tiff because there are sensitive company data in the tiff. But i tried to scan another, and the result still 1 page pdf, I think something bad with my tesseract version, or installation.

Zdenko Podobny

unread,
May 16, 2019, 7:08:45 AM5/16/19
to tesser...@googlegroups.com
tesseract NEVER has problem with multipage tiff.
If you do not share image file you are alone with your problems. Nobody can help you.

Zdenko


št 16. 5. 2019 o 12:59 András Jeszenkovits <jesz...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

András Jeszenkovits

unread,
May 16, 2019, 8:17:27 AM5/16/19
to tesseract-ocr
I downloaded the installer from here (https://github.com/UB-Mannheim/tesseract/wiki)
i tried both version
I just scanned 2 page with handwritten data, just for convert not for OCR.
You can download form here:

Shree Devi Kumar

unread,
May 16, 2019, 8:43:48 AM5/16/19
to tesser...@googlegroups.com
I just tested once again on my installation in ubuntu, it works fine. See attached.

Qns. Does multipage tif to txt, hocr, alto, tsv process all pages? Meaning, is the problem related only to pdf.

Try to OCR the tif I have attached to see whether that works for you.



To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.
multi.txt
multi.pdf
multi.tif

András Jeszenkovits

unread,
May 16, 2019, 8:58:49 AM5/16/19
to tesseract-ocr

I reinstalled with another Tesseract version (tesseract-ocr-setup-3.05.00dev) and it works wel...

Zdenko Podobny

unread,
May 24, 2019, 5:11:43 AM5/24/19
to tesser...@googlegroups.com
Quite strange - I tested it on windows and it does not work for me either (but other multipage tif yes) - I found out that there was missing one tiff format in tesseract check - this is fixed in tesseract master code. 

Zdenko


št 16. 5. 2019 o 14:58 András Jeszenkovits <jesz...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Stefan Weil

unread,
May 26, 2019, 1:59:28 PM5/26/19
to tesseract-ocr
András, I just made a new installer based on the latest Tesseract code. Maybe you want to try that.

Regards,
Stefan
Reply all
Reply to author
Forward
0 new messages