Tesseract OCR on PDF without converting into images

91 views
Skip to first unread message

Banti Kumar

unread,
Aug 11, 2022, 6:11:00 AM8/11/22
to tesseract-ocr
Can I use tesseract on pdf without converting pages into images? 
I have some pdf pages with digital text and Images with text, I just want to apply ocr on images but not on the digital text regions so I can get better accuracy for searchable pdfs

TIA

Zdenko Podobny

unread,
Aug 12, 2022, 3:57:07 AM8/12/22
to tesser...@googlegroups.com
No.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6e4670ed-04e9-40fe-ab7f-cd916908749an%40googlegroups.com.

Merlijn B.W. Wajer

unread,
Aug 12, 2022, 5:28:23 AM8/12/22
to tesser...@googlegroups.com
Hi Banti,
I've been working on something similar to this, but it's not ready for
doing exactly what you want. Basically, I have a tool to convert the
text layers of a PDF to hOCR, one of the output formats from Tesseract.
If you run that, and then also OCR the entire PDF with Tesseract, you
could try to "merge" the two hOCR files into one, preferring the
extracted text over the Tesseract text if they overlap - or do it based
on word confidence or so.

Of course, you'll have to figure out a proper scale, since Tesseract
requires the PDF to be rendered to an image, and the image pixels need
to line up with the hOCR coordinates extracted from the PDF.

You can find the tool here (but keep in mind I'm still actively working
on it / breaking things):
https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr

I don't (yet) have a tool to merge hOCR files.

Regards,
Merlijn
Reply all
Reply to author
Forward
0 new messages