Can tesseract be used to read a PDF and OCR it to text?

17,291 views
Skip to first unread message

pjfarley3

unread,
Jan 12, 2020, 2:01:59 PM1/12/20
to tesseract-ocr
I installed the 64-bit version of tesseract from UB Mannheim on my Win10 system but it will not read a PDF as the input "image".

Error messages:

Tesseract Open Source OCR Engine v5.0.0-alpha.20191030 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

I have tried using the Xpdf command-line tool pdftotext for this task, but even the latest V4.02 of pdftotext fails to process some apparently invalid character maps (both LATIN1 and utf-8) for some PDF's I need converted to text.

The PDF's are generated by a third party that I have no influence over to correct their PDF mistakes.

I was hoping tesseract might do a better job for my PDF-to-text need.

TIA for any info or suggestions you can provide.

Peter

Shree Devi Kumar

unread,
Jan 12, 2020, 8:52:51 PM1/12/20
to tesseract-ocr
Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those.

Or use wrappers that use tesseract.which take a PDF and convert to text. Look under add-ons in wiki.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3acec554-e508-4759-8a46-9ab7e1bb6e6f%40googlegroups.com.

pjfarley3

unread,
Jan 13, 2020, 1:49:31 AM1/13/20
to tesseract-ocr


On Sunday, January 12, 2020 at 8:52:51 PM UTC-5, shree wrote:
Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those.

Or use wrappers that use tesseract.which take a PDF and convert to text. Look under add-ons in wiki.


Thanks for that advice, I will check the wiki.

Peter

JB Data31

unread,
Jan 14, 2020, 5:42:57 AM1/14/20
to tesser...@googlegroups.com
OCRmyPDF do the job.

Linux native, but windows available :
https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows.


2020-01-13 7:49 UTC+01:00, 'pjfarley3' via tesseract-ocr
<tesser...@googlegroups.com>:
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/de8cf032-eb3d-41df-8127-805e84334909%40googlegroups.com.
>


--
@*JB*Δ <http://jbigdata.fr/jbigdata/index.html>

pjfarley3

unread,
Jan 17, 2020, 6:01:49 PM1/17/20
to tesseract-ocr
Thanks for that link, but some research showed me that pdfgrep depends on the poppler libraries, which do not preserve text formatting in PDF's very well at all.

XPDF's version (https://www.xpdfreader.com/download.html) of pdftotext does the best job I have found so far, but when the PDF has erroneous or corrupted character map tables (as many of the PDF's I get from banks and utility companies do) it can't resolve all of the PDF text.

I can use Adobe Reader to view all the text information in these PDF's even with such bad internal tables, but transcribing them by hand or by mouse highlight/cop/paste are very time consuming.

Also, ocrmypdf's documenttation of the "sidecar" option also indicates that actual text in PDF's is not output at all, only OCR'ed text.  This defeats my need for reading and outputting ALL the text, hopefully with at least most of the textual formatting preserved.

Guess I will just have to keep looking around.

Peter

On Tuesday, January 14, 2020 at 5:42:57 AM UTC-5, JB Data31 wrote:
OCRmyPDF do the job.

Linux native, but windows available :
https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows.


2020-01-13 7:49 UTC+01:00, 'pjfarley3' via tesseract-ocr
<tesser...@googlegroups.com>:
>
>
> On Sunday, January 12, 2020 at 8:52:51 PM UTC-5, shree wrote:
>>
>> Tesseract reads only image files, not pdf. You can convert PDF to image
>> (tif, png) and OCR those.
>>
>> Or use wrappers that use tesseract.which take a PDF and convert to text.
>> Look under add-ons in wiki.
>>
>>
> Thanks for that advice, I will check the wiki.
>
> Peter
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an

pjfarley3

unread,
Jan 17, 2020, 6:04:08 PM1/17/20
to tesseract-ocr
At least as of today the "add ons" part of the wiki doesn't actually have a PDF-to-OCR'ed-text wrapper as far as I can see.

Still searching for a solution, but thanks for trying to help.

Peter

Shree Devi Kumar

unread,
Jan 18, 2020, 3:09:04 AM1/18/20
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages