Now, if I print such scanned pages and rescan using some of the
conversion software I can get a rough text version (lots of errors but
basically useable). It strikes me that there must be a simpler way of
taking a PDF document that is a picture of a page and converting it to
text but darned if i know how!
Any guidance out there?
OCR programs work with bitmap files, so you don't need to scan the
printed version, just extract the images (eg with pdfimages) and convert
them to an acceptable format (eg TIFF) with ImageMagick.
$ pdfimages file.pdf image
(this produces image-000.ppm, image-001.ppm, image-002.ppm, etc, or
perhaps .jpg, depending on what pdfimages finds in there)
$ for f in image-*.ppm; do convert $f $(f/ppm/tif); done
(or something similar, depending).
///Peter