Scan pdf file instead png

Teo

unread,

Mar 28, 2020, 2:48:20 PM3/28/20

to tesseract-ocr

Is there an option to directly scan a pdf document containing multiple pages instead of the single png image?

Essam Zaky

unread,

Mar 28, 2020, 3:42:01 PM3/28/20

to tesseract-ocr

What do you mean by "scan a pdf " ?

If you mean recognize pdf file , you can not recognize pdf file directly because it's unsupported format by leptonica

see the following read me

https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc

The workarround is to find a tool which can extract pdf to images , then write the extracted images paths in one text file

i.e. test.pdf will be

test.txt

../image/path/1.png

../image/path/2.png

../image/path/3.png

then call tesseract as follow

tesseract test.txt path/to/output -l eng

the output.txt will contain all the recognition result for all files in test.txt

Best Regards

Essam

Zdenko Podobny

unread,

Mar 28, 2020, 3:44:51 PM3/28/20

to tesser...@googlegroups.com

Tesseract is OCR images not documents (pdf, docx, odt etc..)

If you need multipage support use tif image format instead of pdf for scanning.

Zdenko

so 28. 3. 2020 o 20:42 Essam Zaky <essa...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ffd9e7c7-8fdd-4ced-8707-eb6ceaf61b68%40googlegroups.com.

Teo

unread,

Mar 28, 2020, 5:53:55 PM3/28/20

to tesseract-ocr

yes I meant just this. Ok thanks for your support.

Teo

unread,

Mar 28, 2020, 5:54:06 PM3/28/20

to tesseract-ocr

Ok thanks

Reply all

Reply to author

Forward