Take in image from memory, get PDF output

41 views

Skip to first unread message

Michael Kadziela

unread,

Aug 3, 2022, 11:59:13 PM8/3/22

to tesseract-ocr

Hey all, and thanks for assisting.

I'm currently working on a pipeline that takes in PDFs, converts them to images, feeds them to Tesseract, and outputs a combined PDF at the end with a readable text layer.

I'm up to the Tesseract part, and I'm stuck with the API and unsure how to continue. Essentially I want to give Tesseract an image from memory, such as a Pix from Leptonica. This works currently for outputting a text string, but I can't find in the API any sort of method that uses the image that was given to the Tesseract instance, and can render a PDF output. They all seem to want a filepath rather than using the set image for the Tesseract instance.
Is there an API somewhere for this, or a work around?

Thanks!

Zdenko Podobny

unread,

Aug 4, 2022, 7:16:51 AM8/4/22

to tesser...@googlegroups.com

I did not test it, but have a look at ProcessPagesMultipageTiff [1] for inspiration - it uses TessBaseAPI::ProcessPage(Pix *pix,... renderer) [2] , so you could be able create pdf with images from memory.

Please share your experience and code snippet with the community if you are successful ;-)

[1] https://github.com/tesseract-ocr/tesseract/blob/424b17f997363670d187f42c43408c472fe55053/src/api/baseapi.cpp#L1030

[2] https://github.com/tesseract-ocr/tesseract/blob/424b17f997363670d187f42c43408c472fe55053/src/api/baseapi.cpp#L1253

Zdenko

št 4. 8. 2022 o 5:59 Michael Kadziela <kadt...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7846e2c8-7451-4535-84c0-6909d0ea3305n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages