New user's questions

63 views
Skip to first unread message

Dr Rainer Woitok

unread,
Feb 21, 2020, 2:07:36 PM2/21/20
to tesser...@googlegroups.com
Greetings,

after playing a while with "tesseract" and after having read plenty of
manual pages and documentation on the web I still have some questions.
I want to create a PDF file with an OCR layer, but:

1. Some of my TIFF files created by "ScanTailor" have light text on dark
background, and documentation says to manually invert such files be-
fore feeding them to current "tesseract" versions. But of course I
want the PDF file to contain the original document with light text on
dark background.

2. According to the documentation the TIFF file for OCR-ing should have
at least 300 dpi. But for the background image within the final PDF
document I'd like to use a JP2 file with only 150 dpi and a high com-
pression rate.

So is it possible to pass "tesseract" a high quality image for OCR-ing
and a lesser quality image for building the PDF file with?

Sincerely,
Rainer

Dr Rainer Woitok

unread,
Mar 9, 2020, 10:43:37 AM3/9/20
to tesser...@googlegroups.com
Greetings,

On Friday, 2020-02-21 18:18:21 +0100, I myseld wrote:

> ...
Sadly though, I didn't receive any answers. Searching further, I event-
ually found

https://github.com/tesseract-ocr/tesseract/issues/660

which contains the developers' discussion leading to new configuration
variable "textonly_pdf" (you'll need "tesseract" 4.*.* to use that).
This web page also contains examples which explain how to utilize this
option using either "qpdf" or "pdftk". However, according to my own ex-
perience "qpdf" will only work, if you do NOT resample the original TIFF
files from 300 dpi to 150 dpi but only convert them to JP2 applying los-
sy compression. If you do resample and use "qpdf", your PDF viewer will
not correctly find the text associated with the area you highlight with
the mouse, while when using "pdftk" everything will work as expected be-
cause "pdftk" will detect the different widths and heights in pixels and
rescale the overlaid file accordingly.

The code below assumes the current directory to be the ScanTailor pro-
ject's "out/" directory containing one TIFF file for every page scanned:

# neg=-negate # Uncomment in case of light text on dark background.

for f in *.tif
do stm=${f%.tif}

# Create smaller background image:
convert $f -resample 150/150 -quality 40 jp2:- |
img2pdf -o $stm-b.pdf -

# Use black/white and optionally inverted image for OCR-ing:
convert $f -threshold 70% $neg tif:- |
tesseract - - -l deu --psm 1 -c textonly_pdf=1 pdf |
pdftk $stm-b.pdf stamp - output $stm-o.pdf
done

pdftk *-o.pdf cat output output.pdf
rm -f *-[bo].pdf

One last word of warning though: If you're using "evince" as your PDF
viewer, you'll only see empty blue boxes when you highlight text using
the mouse. According to

https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/280

this has been hunted down to some "poppler" problem which seems still to
be open.

Sincerely,
Rainer
Reply all
Reply to author
Forward
0 new messages