Dear all,
I have been testing tesseract to embed OCR in scanned PDF documents, and it works phenomenally well in recognizing the text.
Now I noticed one slightly disturbing issue just by chance when comparing the original input image and the PDF file: A number of straight lines that are present in the input image have disappeared completely in the PDF (some of the are horizontal rules, others are lines in a logo). Since I wanted to use tesseract to produce completely unmodified documents with only the OCR text layer added, this would be a problem for me. I have uploaded a test image for this to
http://cern.ch/fsiegert/tmp/tesseract-test.tif and here is the command I used on it:
$ tesseract -l deu tesseract-test.tif tesseract-test pdf
Tesseract Open Source OCR Engine v3.03 with Leptonica
OSD: Weak margin (6.96) for 162 blob text block, but using orientation anyway: 1
$ tesseract --version
tesseract 3.03
leptonica-1.71
libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.1
Cheers,
Frank
PS: I have removed much more text from the document for privacy reasons, but the same happens when the document is complete with text.