is .pdf input supported ?

lvjkahvlwertfg

unread,

Dec 5, 2016, 3:43:24 PM12/5/16

to tesseract-ocr

Hello,

First time user here. Tried to feed some .pdf as input to tesseract v3.05, but all I got is a 0 byte output file, and this:

c:\Program Files (x86)\Tesseract-OCR>tesseract.exe ET_1920_01-06t.pdf ide
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Error during processing.
ObjectCache(6157AAC8)::~ObjectCache(): WARNING! LEAK! object 031E7ED0 still has
count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatapunc-dawg
)
ObjectCache(6157AAC8)::~ObjectCache(): WARNING! LEAK! object 031E9058 still has
count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddataword-dawg
)
ObjectCache(6157AAC8)::~ObjectCache(): WARNING! LEAK! object 031E9100 still has
count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatanumber-da
wg)
ObjectCache(6157AAC8)::~ObjectCache(): WARNING! LEAK! object 031F0970 still has
count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatabigram-da
wg)
ObjectCache(6157AAC8)::~ObjectCache(): WARNING! LEAK! object 031E90A8 still has
count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatafreq-dawg
)

OS is Windows 7 x64, I didn't want to struggle with build from sources, so I got a pre-built binary.
Is there maybe some log file, to get some idea about the "Error during processing." ?

Thanks,
Peter

Tom Morris

unread,

Dec 6, 2016, 7:29:42 PM12/6/16

to tesseract-ocr

Tesseract does not support PDF as an input file format. If it's PDF which contains scanned images, you can extract the images using a separate tool like pdfimages and run tesseract on the result. If it's a PDF containing text, then you don't need OCR at all and can use a PDF text extractor instead.

Tom

James R Barlow

unread,

Jan 16, 2017, 10:57:51 AM1/16/17

to tesseract-ocr, zr...@freemail.hu

Use a program like OCRmyPDF (which I develop) to handle PDF conversion using Tesseract. While you can extract images using "pdfimages" (from poppler), that procedure only works for the simplest PDFs. Some scanning software will multiple images per page in a PDF to improve compression, for example.

https://github.com/jbarlow83/OCRmyPDF

Reply all

Reply to author

Forward