How to get text for a multi-page TIFF file using capi?

75 views

Skip to first unread message

Bhaarat Sharma

unread,

Sep 18, 2017, 11:37:39 AM9/18/17

to tesseract-ocr

I am using the tesseract capi from Python using ctypes. Everything seems to work well except multi-page TIFFs. I only get text from the last page instead of all the text in a multi-page TIFF.

This is what I'm doing:

path = "multipage.tiff"

self.tesseract.TessBaseAPIProcessPages.argtypes = [POINTER(TessBaseAPI), c_char_p, c_char_p, c_int, POINTER(TessResultRenderer)]

self.tesseract.TessBaseAPIProcessPages.restype = c_bool

success = self.tesseract.TessBaseAPIProcessPages(self.api, create_string_buffer(path), None , 0, None)

ocr_r = self.tesseract.TessBaseAPIGetUTF8Text(self.api)

result = string_at(ocr_r) #contains text only from last page

Has anyone come across this before or have knowledge of how to resolve this?

I had [opened this as an issue][1] in tesseract but apparently this isn't an issue in tesseract command line or API since the command line works fine and gives text for all pages.

Perhaps something else should be called instead of `self.tesseract.TessBaseAPIGetUTF8Text(api)` to get all the text?

[1]: https://github.com/tesseract-ocr/tesseract/issues/1138

Reply all

Reply to author

Forward

0 new messages