Modification of "background" image allowed in PDF output?

Frank Siegert

unread,

Sep 19, 2014, 8:38:36 AM9/19/14

to tesser...@googlegroups.com

Dear all,

I have been testing tesseract to embed OCR in scanned PDF documents, and it works phenomenally well in recognizing the text.

Now I noticed one slightly disturbing issue just by chance when comparing the original input image and the PDF file: A number of straight lines that are present in the input image have disappeared completely in the PDF (some of the are horizontal rules, others are lines in a logo). Since I wanted to use tesseract to produce completely unmodified documents with only the OCR text layer added, this would be a problem for me. I have uploaded a test image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and here is the command I used on it:

$ tesseract -l deu tesseract-test.tif tesseract-test pdf
Tesseract Open Source OCR Engine v3.03 with Leptonica
OSD: Weak margin (6.96) for 162 blob text block, but using orientation anyway: 1
$ tesseract --version
tesseract 3.03
leptonica-1.71
libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1

This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is missing the straight horizontal lines and the ones in the logo. Is this line-removal done on purpose and can it be disabled?

Cheers,

Frank

PS: I have removed much more text from the document for privacy reasons, but the same happens when the document is complete with text.

zdenko podobny

unread,

Sep 19, 2014, 8:54:52 AM9/19/14

to tesser...@googlegroups.com

This is known issue - try current code from git repository. It should be fixed.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Frank Siegert

unread,

Sep 19, 2014, 9:14:52 AM9/19/14

to tesser...@googlegroups.com

Dear Zdenko,

Thanks for the quick reply!

Does that mean in general, i.e. except for this bug, that I can by construction assume the image will remain unmodified and only a text layer added?

Cheers,

Frank

zdenko podobny

unread,

Sep 19, 2014, 11:01:50 AM9/19/14

to tesser...@googlegroups.com

Well yes and no ;-)

"Yes" - there should be no change on image, but "no" - you need to expect that (re)compression of input image by pdf renderer could take a place. See comments for issue 1285[1] for more details.

[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285

Zdenko

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com.

Reply all

Reply to author

Forward