Greetings,
On Friday, 2020-02-21 18:18:21 +0100, I myseld wrote:
> ...
Sadly though, I didn't receive any answers. Searching further, I event-
ually found
https://github.com/tesseract-ocr/tesseract/issues/660
which contains the developers' discussion leading to new configuration
variable "textonly_pdf" (you'll need "tesseract" 4.*.* to use that).
This web page also contains examples which explain how to utilize this
option using either "qpdf" or "pdftk". However, according to my own ex-
perience "qpdf" will only work, if you do NOT resample the original TIFF
files from 300 dpi to 150 dpi but only convert them to JP2 applying los-
sy compression. If you do resample and use "qpdf", your PDF viewer will
not correctly find the text associated with the area you highlight with
the mouse, while when using "pdftk" everything will work as expected be-
cause "pdftk" will detect the different widths and heights in pixels and
rescale the overlaid file accordingly.
The code below assumes the current directory to be the ScanTailor pro-
ject's "out/" directory containing one TIFF file for every page scanned:
# neg=-negate # Uncomment in case of light text on dark background.
for f in *.tif
do stm=${f%.tif}
# Create smaller background image:
convert $f -resample 150/150 -quality 40 jp2:- |
img2pdf -o $stm-b.pdf -
# Use black/white and optionally inverted image for OCR-ing:
convert $f -threshold 70% $neg tif:- |
tesseract - - -l deu --psm 1 -c textonly_pdf=1 pdf |
pdftk $stm-b.pdf stamp - output $stm-o.pdf
done
pdftk *-o.pdf cat output output.pdf
rm -f *-[bo].pdf
One last word of warning though: If you're using "evince" as your PDF
viewer, you'll only see empty blue boxes when you highlight text using
the mouse. According to
https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/280
this has been hunted down to some "poppler" problem which seems still to
be open.
Sincerely,
Rainer