PDF output, to file vs memory buffer

Jeff Breidenbach

unread,

Nov 28, 2016, 12:57:59 PM11/28/16

to tesseract-dev

Moving a conversation here since it may be of wider interest.

zdenko> Also there is request to get text information (STRING) from

zdenko> renderer. At the moment renderer produce output to file, but

zdenko> some users (e.g. those how use tesseract wrappers) would

zdenko> like to use this information (especially hocr, tsv and txt) for

zdenko> further processing.

zdenko> This request is related to API breakage between 3.02 and 3.04 .

zdenko> Problem is with functions ProcessPage and ProcessPages that put

zdenko> result as "STRING* text_out" in 3.02 and from 3.04 as

zdenko >"TessResultRenderer* renderer". I thinks it is important to fix this

zdenko> backward API compatibility ASAP.

For things like book scanning, it is very common to be working with many

high resolution images, that cannot fit in memory at the same time.

Remember that PDF output contains a copy of all the images. Therefore

it is important that PDF output uses a streaming interface, rather than writing

everything to a memory buffer.

However, I certainly understand that people like memory buffers especially

for small output formats like txt, hocr, etc. I hope the answer for that can be

fmemopen() rather than abandoning the streaming interface entirely.

Jeff Breidenbach

unread,

Nov 28, 2016, 2:47:40 PM11/28/16

to tesseract-dev

I really, really, don't want to break end-to-end streaming.

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-do-streaming

Zdenko Podobný

unread,

Nov 28, 2016, 3:28:00 PM11/28/16

to tesser...@googlegroups.com

Me neither.

But API is broken (with 3.0x) and it should be fixed IMO.

ProcessPages&ProcessPages were use by python wrapper for exacting text (hocr?) in 3.02 and now this option is gone. For me this is same situation as we had with leptonica 1,74 and tesseract 3.04[1].

[1] https://github.com/tesseract-ocr/tesseract/issues/233#issuecomment-263017757

Zdenko

On Mon, Nov 28, 2016 at 8:47 PM, Jeff Breidenbach <breid...@gmail.com> wrote:

I really, really, don't want to break end-to-end streaming.

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-do-streaming

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/bbd2e566-a69a-4fda-a3f8-f8d9cb4ae316%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward