Take in image from memory, get PDF output

34 views
Skip to first unread message

Michael Kadziela

unread,
Aug 3, 2022, 11:59:13 PMAug 3
to tesseract-ocr
Hey all,  and thanks for assisting.

I'm currently working on a pipeline that takes in PDFs, converts them to images, feeds them to Tesseract, and outputs a combined PDF at the end with a readable text layer.

I'm up to the Tesseract part, and I'm stuck with the API and unsure how to continue. Essentially I want to give Tesseract an image from memory, such as a Pix from Leptonica. This works currently for outputting a text string, but I can't find in the API any sort of method that uses the image that was given to the Tesseract instance, and can render a PDF output. They all seem to want a filepath rather than using the set image for the Tesseract instance.
Is there an API somewhere for this, or a work around?

Thanks! 

Zdenko Podobny

unread,
Aug 4, 2022, 7:16:51 AMAug 4
to tesser...@googlegroups.com
I did not test it, but have a look at  ProcessPagesMultipageTiff [1] for inspiration - it uses  TessBaseAPI::ProcessPage(Pix *pix,...  renderer) [2]  , so you could be able create pdf with images from memory.

Please share your experience and code snippet with the community if you are successful ;-)



Zdenko


št 4. 8. 2022 o 5:59 Michael Kadziela <kadt...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7846e2c8-7451-4535-84c0-6909d0ea3305n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages