Scan pdf file instead png

85 views
Skip to first unread message

Teo

unread,
Mar 28, 2020, 2:48:20 PM3/28/20
to tesseract-ocr
Is there an option to directly scan a pdf document containing multiple pages instead of the single png image?

Essam Zaky

unread,
Mar 28, 2020, 3:42:01 PM3/28/20
to tesseract-ocr
What do you mean by "scan a pdf " ?
If you mean recognize pdf file , you can not recognize pdf file directly because it's unsupported format by leptonica
see the following read me

 
The workarround is to find a tool which can extract pdf to images , then write the extracted images  paths in one text file 
i.e. test.pdf will be
test.txt
     ../image/path/1.png
     ../image/path/2.png
     ../image/path/3.png

then call tesseract as follow
tesseract test.txt path/to/output -l eng 


the output.txt will contain all the recognition result for all files in test.txt


Best Regards
Essam

Zdenko Podobny

unread,
Mar 28, 2020, 3:44:51 PM3/28/20
to tesser...@googlegroups.com
Tesseract is OCR images not documents (pdf, docx, odt etc..)
If you need multipage support use tif image format instead of pdf for scanning. 

Zdenko


so 28. 3. 2020 o 20:42 Essam Zaky <essa...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ffd9e7c7-8fdd-4ced-8707-eb6ceaf61b68%40googlegroups.com.

Teo

unread,
Mar 28, 2020, 5:53:55 PM3/28/20
to tesseract-ocr
yes I meant just this. Ok thanks for your support.

Teo

unread,
Mar 28, 2020, 5:54:06 PM3/28/20
to tesseract-ocr
Ok thanks
Reply all
Reply to author
Forward
0 new messages