I'm trying to create a searchable pdf out of a scanned one. I want to use Textract as an OCR engine instead of Tesseract. Is there a way to make libtesseract skip the OCR step and just create the invisible text layer (with the extracted chars from Textract) and apply it to the input pdf?I read that libtesseract is what ocrmypdf uses to create the invisible text layer for searchable pdfs.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/08bb441a-6edb-47be-b314-b0638a0bce1an%40googlegroups.com.
Hi,
You can use archive-pdf-tools to do this: https://github.com/internetarchive/archive-pdf-tools
it has a Python version of the Tesseract text layer generation and can take hOCR as input (you can convert other OCR formats to hOCR). Note that it is not 100% the same as Tesseract currently - I am trying to find the difference/bug in my port.
I am the author, so feel free to reach out if you have any questions.
Regards,
Merlijn
--
Sent from my Motorola Droid 4 running Maemo Leste (Beowulf)