libtesseract skip OCR, just create invisible text layer

lbr

unread,

Jul 5, 2023, 1:13:03 AM7/5/23

to tesseract-ocr

I'm trying to create a searchable pdf out of a scanned one. I want to use Textract as an OCR engine instead of Tesseract. Is there a way to make libtesseract skip the OCR step and just create the invisible text layer (with the extracted chars from Textract) and apply it to the input pdf?

I read that libtesseract is what ocrmypdf uses to create the invisible text layer for searchable pdfs.

Zdenko Podobny

unread,

Jul 8, 2023, 11:21:30 AM7/8/23

to tesser...@googlegroups.com

No, it is not possible (tesseract uses an image used for OCR for pdf creation, OCR output for the position of text...)

Zdenko

st 5. 7. 2023 o 7:12 lbr <lbr...@gmail.com> napísal(a):

I'm trying to create a searchable pdf out of a scanned one. I want to use Textract as an OCR engine instead of Tesseract. Is there a way to make libtesseract skip the OCR step and just create the invisible text layer (with the extracted chars from Textract) and apply it to the input pdf?

I read that libtesseract is what ocrmypdf uses to create the invisible text layer for searchable pdfs.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/08bb441a-6edb-47be-b314-b0638a0bce1an%40googlegroups.com.

Merlijn Wajer

unread,

Jul 12, 2023, 12:57:06 PM7/12/23

to tesser...@googlegroups.com, lbr

Hi,

You can use archive-pdf-tools to do this: https://github.com/internetarchive/archive-pdf-tools

it has a Python version of the Tesseract text layer generation and can take hOCR as input (you can convert other OCR formats to hOCR). Note that it is not 100% the same as Tesseract currently - I am trying to find the difference/bug in my port.

I am the author, so feel free to reach out if you have any questions.

Regards,
Merlijn
--
Sent from my Motorola Droid 4 running Maemo Leste (Beowulf)

Reply all

Reply to author

Forward