Issue 1498 in tesseract-ocr: When creating searchable pdf, file contents are not flushed and file handle is not released

9 views
Skip to first unread message

tesser...@googlecode.com

unread,
Jul 27, 2015, 6:16:04 AM7/27/15
to tesserac...@googlegroups.com
Status: New
Owner: ----

New issue 1498 by gpapado...@gmail.com: When creating searchable pdf, file
contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

What steps will reproduce the problem?
1. Use tesseract 3.04
2. In file api/tesseractmain.cpp add a sleep before the program exits. For
example:
....
fprintf(stdout, "DONE\n");
sleep(60);
fprintf(stdout, "EXITING\n");

PERF_COUNT_END
return 0; // Normal exit
}

3. Run tesseract to create a searchable pdf. On a different console,
monitor the result. For example:
> tail -f result.pdf


What is the expected output? What do you see instead?
After DONE is printed, some of the contents of the searchable pdf are
written on file result.pdf. The expected result is that the whole pdf
content up to the last EOF is written to the file and the file is properly
closed. However this only happens when after EXIT is printed, when the
program finally exits.

Please use labels and text to provide additional information.


--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

tesser...@googlecode.com

unread,
Jul 28, 2015, 6:47:31 PM7/28/15
to tesserac...@googlegroups.com

Comment #1 on issue 1498 by breidenb...@gmail.com: When creating searchable
pdf, file contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

a) what platform is this, Linux?
b) is this streaming to stdout, e.g. tesseract input.tif - pdf > output.pdf
c) if yes, do you also get it with other formats, e.g. tesseract input.tif
- hocr > output.hocr

It's quite possible we can "fix" this by closing the stdout stream when
we finish writing. This will have the benefit of making it impossible
for someone to accidentally stream multiple output formats to stdout
and cause silent data corruption.

Not sure where the code is, let me spend a couple minutes checking.

PS. Just curious, how did you even notice this?

tesser...@googlecode.com

unread,
Jul 28, 2015, 6:48:43 PM7/28/15
to tesserac...@googlegroups.com

Comment #2 on issue 1498 by breidenb...@gmail.com: When creating searchable
pdf, file contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

Yeah, it's right here.

https://github.com/tesseract-ocr/tesseract/blob/master/api/renderer.cpp#L33

The original idea of not closing stdout after we finish with it was
introduced by Zdenko back in Dec 23, 2012. I don't know why. Zdenko,
do you remember what you were thinking about?

https://github.com/tesseract-ocr/tesseract/commit/4812fac33e25f0b384d473b597e93508725ce058

tesser...@googlecode.com

unread,
Jul 29, 2015, 2:35:50 AM7/29/15
to tesserac...@googlegroups.com

Comment #3 on issue 1498 by zde...@gmail.com: When creating searchable pdf,
file contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

IMO reporter does not use stdout ("On a different console, monitor the
result")...

Regarding closing stdout - AFAIK if we perform fclose(stdout) - (especially
outside of main) it will cause program will not be able to write to stdout
(e.g. warning, some info) and program will crash. So fclose(stdout) is not
considered as wise action.

tesser...@googlecode.com

unread,
Jul 29, 2015, 3:09:26 AM7/29/15
to tesserac...@googlegroups.com

Comment #4 on issue 1498 by gpapadop73: When creating searchable pdf, file
contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

a) I tried this on Linux. Originally I found this on Windows with tess4j
java wrapper. But in order to confirm that the problem is not on the
wrapper, I tried it on Linux.
b) No I am streaming to a file. Here is my command:
tesseract tesseract-3.04.00/testing/eurotext.png result --tessdata-dir
tesseract-ocr -c tessedit_create_pdf=true
It produces file result.pdf
c) I am only interested in pdf, so I have not tried other formats.

I have an application where the user can work on several images. We want to
provide the ability to create a searchable pdf from an image. The problem
becomes obvious because the user creates one pdf and if he tries to open
it, it fails. The produced pdf can be correctly opened only when the user
closes the application.

I now see that the problem is that the renderer's destructor is called when
the main function is about to return.

tesser...@googlecode.com

unread,
Jul 29, 2015, 10:00:49 AM7/29/15
to tesserac...@googlegroups.com

Comment #5 on issue 1498 by gpapadop73: When creating searchable pdf, file
contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

The problem happened because in the java code there was no call to
TessDeleteResultRenderer. But the tricky part was that adding this call did
not solve the problem. The reason was the
delete[] renderer;
instead of
delete renderer;
which you fixed in file api/capi.cpp

So after getting your source from label 3.04.01dev and fixing the java
wrapper, it works fine.

Thank you

tesser...@googlecode.com

unread,
Jul 29, 2015, 12:01:54 PM7/29/15
to tesserac...@googlegroups.com
Updates:
Status: Fixed

Comment #6 on issue 1498 by zde...@gmail.com: When creating searchable pdf,
file contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

(No comment was entered for this change.)

tesser...@googlecode.com

unread,
Jul 29, 2015, 5:57:00 PM7/29/15
to tesserac...@googlegroups.com

Comment #7 on issue 1498 by breidenb...@gmail.com: When creating searchable
pdf, file contents are not flushed and file handle is not released
https://code.google.com/p/tesseract-ocr/issues/detail?id=1498

Regarding comment #4, I sure hope warnings or info go to stderr, not stdout.

But since gpapadop73 is happy, then I am too.
Reply all
Reply to author
Forward
0 new messages