> The OP did not mention OCR.
I am the OP (who used a remailer that gives a different identity to
each post). Sorry for the confusion.
> Why would you lose that? I do OCR from bitmaps. I'm not familiar
> with any other way of doing OCR though... maybe there is a way to
> do them from vector images?
The scanner at-hand can produce a "searchable PDF" -- that is, a PDF
container that includes the bitmaps along with the OCR'd text. Each
character maps to a precise position on the bitmap image and that
metadata is part of the PDF. So when pages are extracted or cropped,
the correct set of OCR'd text follows the bitmap. And I can always
get the text in separately if needed using /pdftotext/.
I also have a FOSS tool that can do OCR on a bitmap, but then I end up
with an image file and a text file, which is messy. When the bitmap
is inserted in a latex document, for example, the text is lost. It's
more managable to have a PDF container that encapsulates all
components of the document.
Latex does not seem to have a way to insert the OCR'd into the PDF for
searching and extraction, apart from using the /attachfile/ package,
which (I think) forces a visible thumb-tac icon into the document.
> Email? Who mentioned email? I use a scanner attached to my
> workstation. I load a stack of documents and press the button. I
> get a bunch of PNGs appear in the nominated directory. Email is not
> involved.
The scanner at-hand e-mails the scans, apparently forced by the admins
configuration.
> If you are using a remote scanner that sends them to you, install
> procmail and divert them into a separate folder.
Procmail is another problem. Procmail's big weakness is inability to
recognize and manipulate MIME attachments. I once tried to write a
procmail recipe to use third-party MIME tools and it was a disaster.
Clearly it's nontrivial.
> > I'm also not sure I could continue using a /while/ loop to iterate
> > over pages and apply a 90 degree angle on every other page and -90
> > degrees on the others. So I only see disadvantages to that.
>
> ImageMagick is your friend. A couple of lines of shell script should
> be able to do that.
The original document is a legal document in the form of an A3
booklet. The beauty of the code I posted is that the PDF internally
contains the original scans as-is, with no butchering. Yet it uses
viewport to give a nice A4 up-right presentation.
Using ImageMagick in the way you suggest butchers the original scans.
Would a court be anal about that? I'd rather not risk it. Using
viewport makes it possible to extract the original A3 images later if
needed, exactly as they were scanned.