My gscan2pdf application will sort all of that out automatically, as
well as using Tesseract for the OCR, embedding the output behind the
image in the PDF.
gscan2pdf is a GUI and at the moment cannot be controlled from the
command line in the manner you want. It is open source.
The command line tools you are looking for are pdfimages to extract
the images, and then imagemagick to convert them to TIFF.
It is certainly better to retain the resolution of the images, rather
than down- or upsampling them, as information can only be lost that
way. The above technique extracts the images at the resolution with
which they were embedded.
imagemagick uses ghostscript and resamples the result.
> After I extract the image this way and OCR it, is there a simple way for me
> to place the text into the original PDF file?
Not really. If you can hack a bit of Perl, you could take the routine
from gscan2pdf - from that point of view it isn't hard, but I don't
know of another tool that does it.
> Are you planning on providing a gscan2pdf command line interface any time
> soon?
I have thought about it, but it isn't anywhere near the top of my todo
list at the moment.
As annotations are not indexed by Beagle, gscan2pdf also simply embeds
the text as plain text behind the image.
It just uses the PDF::API Perl module, so if you know Perl, yes, it is
very simple.