PDF output, to file vs memory buffer

126 views
Skip to first unread message

Jeff Breidenbach

unread,
Nov 28, 2016, 12:57:59 PM11/28/16
to tesseract-dev
Moving a conversation here since it may be of wider interest.

zdenko> Also there is request to get text information (STRING) from 
zdenko> renderer. At the moment renderer produce output to file, but 
zdenko> some users (e.g. those how use tesseract wrappers) would 
zdenko> like to use this information (especially hocr, tsv and txt) for 
zdenko> further processing.

zdenko> This request is related to API breakage between 3.02 and 3.04 . 
zdenko> Problem is with functions ProcessPage and ProcessPages that put 
zdenko> result as "STRING* text_out" in 3.02 and from 3.04 as 
zdenko >"TessResultRenderer* renderer". I thinks it is important to fix this 
zdenko> backward API compatibility ASAP.

For things like book scanning, it is very common to be working with many
high resolution images, that cannot fit in memory at the same time.  
Remember that PDF output contains a copy of all the images. Therefore 
it is important that PDF output uses a streaming interface, rather than writing 
everything to a memory buffer.

However, I certainly understand that people like memory buffers especially
for small output formats like txt, hocr, etc. I hope the answer for that can be
fmemopen() rather than abandoning the streaming interface entirely.



Jeff Breidenbach

unread,
Nov 28, 2016, 2:47:40 PM11/28/16
to tesseract-dev
I really, really, don't want to break end-to-end streaming.

Zdenko Podobný

unread,
Nov 28, 2016, 3:28:00 PM11/28/16
to tesser...@googlegroups.com
Me neither.
But API is broken (with 3.0x) and it should be fixed IMO. 
ProcessPages&ProcessPages were use by python wrapper for exacting text (hocr?) in 3.02 and now this option is gone. For me this is same situation as we had with leptonica 1,74 and tesseract 3.04[1].


Zdenko

On Mon, Nov 28, 2016 at 8:47 PM, Jeff Breidenbach <breid...@gmail.com> wrote:
I really, really, don't want to break end-to-end streaming.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-dev+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/bbd2e566-a69a-4fda-a3f8-f8d9cb4ae316%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages