There is a lengthy side discussion that is appropriate to move
back here. I've been asked to elaborate what I mean by image
There are two ways to turn a PDF file into images. One is to
render it, for example using a tool like pdftoppm. This is great
if there are things like fonts involved.
But far better, for bag-of-images PDF files, such as produced
by certain scanning machines, is to crack open the bag and
take out the images. This guarantees no rescaling, no loss
of image information, and no (possibly space inefficient) format
conversions.
Tools for image extraction are not super common, but it sounds
from the name like podofoimgextract does it. And for a fairly limited
set of formats, so does pdfimages from poppler-utils. The best case
scenario is image extract with no transcoding whatsoever. That's
not always possible (expecially when dealing with really fancy formats
like JBIG2) but it should be fine for PDF files produced by a scanner.
And also any PDF files produced by Tesseract.