Tesseract OCR on PDF without converting into images

Banti Kumar

Aug 11, 2022, 6:11:00 AMAug 11
tesseract-ocr
Can I use tesseract on pdf without converting pages into images? 
I have some pdf pages with digital text and Images with text, I just want to apply ocr on images but not on the digital text regions so I can get better accuracy for searchable pdfs


Zdenko Podobny

Aug 12, 2022, 3:57:07 AMAug 12
tesser...@googlegroups.com

Merlijn B.W. Wajer

Aug 12, 2022, 5:28:23 AMAug 12
tesser...@googlegroups.com
Hi Banti,
I've been working on something similar to this, but it's not ready for
doing exactly what you want. Basically, I have a tool to convert the
text layers of a PDF to hOCR, one of the output formats from Tesseract.
If you run that, and then also OCR the entire PDF with Tesseract, you
could try to "merge" the two hOCR files into one, preferring the
extracted text over the Tesseract text if they overlap - or do it based
on word confidence or so.

Of course, you'll have to figure out a proper scale, since Tesseract
requires the PDF to be rendered to an image, and the image pixels need
to line up with the hOCR coordinates extracted from the PDF.

You can find the tool here (but keep in mind I'm still actively working
on it / breaking things):

I don't (yet) have a tool to merge hOCR files.

