guessing viewport coordinates drives me nuts

Anonymous

unread,

May 13, 2015, 6:16:22 AM5/13/15

to

After scanning an A3 booklet on a full-duplex scanner, I'm left with
an unreadable PDF (every other page flipped 90 degrees left and the
others 90 degrees right, all out of order).

So I use pdfpages with viewport to produce a readable A4 document.
The solution works, but it's quite unnerving to guess on coordinates
and recompile the document 100 times until it looks right.

I tried using the /overpic/ package to place a grid over the A3 page.
Of course the A3 sample page runs off the A4. And even when the page
size is A3, the grid coordinates run off the page. So I tried putting
the A3 sample image on an A2 page to try to get actual coordinates (to
avoid the hassle of scaling factors). But then the A2 ends up putting
the A3 image in a *quarter* section of it (so the A3 is being scaled
down to A4 anyway). The coordinates are completely unusable this way.

This example below shows the correct coordinates that took me many
guesses to get right. Note that the overpic page in the beginning was
no help at all. I would like to know what the problem with that is to
avoid all the guesswork in the future.

====8<----------------------------------------
\documentclass[a4paper]{minimal}
\usepackage[paper=A4,pagesize]{typearea}
\usepackage[pdftex]{geometry}
\usepackage{pdfpages}
\usepackage{ifthen}
\usepackage{overpic}

\newcommand{\pdffile}{my_a3_booklet.pdf}
\newcounter{pg}

\begin{document}

% configure layout for an A2 paper size with grid
% (will be removed after coords determined)
%
\KOMAoptions{paper=A2,pagesize}
\recalctypearea
\newgeometry{margin=0cm}

\begin{overpic}[scale=1.0,unit=1mm,grid,tics=5]{%
extracted_a3_sample_page}\end{overpic}

% restore layout for an A4 paper size
\clearpage
\KOMAoptions{paper=A4,pagesize}
\recalctypearea

\setcounter{pg}{1}

\whiledo{\value{pg}<6}{%
\includepdf[pages=\thepg, angle=90, viewport=40mm 16mm 260mm 160mm, clip]{\pdffile}
\addtocounter{pg}{1}
\includepdf[pages=\thepg, angle=-90, viewport=40mm 52mm 260mm 195mm, clip]{\pdffile}
\addtocounter{pg}{1}
}%

\setcounter{pg}{6}

\whiledo{\value{pg}>0}{%
\includepdf[pages=\thepg, angle=-90, viewport=40mm 260mm 260mm 405mm, clip]{\pdffile}
\addtocounter{pg}{-1}
\includepdf[pages=\thepg, angle=90, viewport=40mm 230mm 260mm 375mm, clip]{\pdffile}
\addtocounter{pg}{-1}
}%

\end{document}
====8<----------------------------------------

Peter Flynn

unread,

May 13, 2015, 4:45:34 PM5/13/15

to

On 05/13/2015 11:16 AM, Anonymous wrote:
> After scanning an A3 booklet on a full-duplex scanner, I'm left with
> an unreadable PDF (every other page flipped 90 degrees left and the
> others 90 degrees right, all out of order).

I wouldn't use PDF. I'd scan the pages all to PNG or JPG. If your
scanner produces PDF only, then run pdfimage to extract the images to
PPM and convert them to JPG.

That still doesn't solve the problem of positioning and scaling, but at
least you don't have to fight the silly PDF encapsulation.

///Peter

Nomen Nescio

unread,

May 14, 2015, 4:48:26 PM5/14/15

to

So I could put the stack of A3 pages on the scanner, and tell it to
send me a separate bitmap for each side that it scans. I would lose
the OCR feature, and apparently have the headache of many documents in
the form of an e-mail with many attachments (and mutt can only act on
one attachment at a time, so manual labor involved). I'm also not
sure I could continue using a /while/ loop to iterate over pages and
apply a 90 degree angle on every other page and -90 degrees on the
others. So I only see disadvantages to that.

The scanner can also do multi-page TIFF files, but can the pdfpages
package handle that?

Peter Flynn

unread,

May 18, 2015, 3:49:43 PM5/18/15

to

On 05/14/2015 09:48 PM, Nomen Nescio wrote:
>> On 05/13/2015 11:16 AM, Anonymous wrote:
>>> After scanning an A3 booklet on a full-duplex scanner, I'm left with
>>> an unreadable PDF (every other page flipped 90 degrees left and the
>>> others 90 degrees right, all out of order).
>>
>> I wouldn't use PDF. I'd scan the pages all to PNG or JPG. If your
>> scanner produces PDF only, then run pdfimage to extract the images
>> to PPM and convert them to JPG.
>>
>> That still doesn't solve the problem of positioning and scaling, but
>> at least you don't have to fight the silly PDF encapsulation.
>
> So I could put the stack of A3 pages on the scanner, and tell it to
> send me a separate bitmap for each side that it scans. I would lose
> the OCR feature,

The OP did not mention OCR. Why would you lose that? I do OCR from
bitmaps. I'm not familiar with any other way of doing OCR though...
maybe there is a way to do them from vector images?

> and apparently have the headache of many documents in
> the form of an e-mail with many attachments (and mutt can only act on
> one attachment at a time, so manual labor involved).

Email? Who mentioned email? I use a scanner attached to my workstation.
I load a stack of documents and press the button. I get a bunch of PNGs
appear in the nominated directory. Email is not involved.

If you are using a remote scanner that sends them to you, install
procmail and divert them into a separate folder.

> I'm also not
> sure I could continue using a /while/ loop to iterate over pages and
> apply a 90 degree angle on every other page and -90 degrees on the
> others. So I only see disadvantages to that.

ImageMagick is your friend. A couple of lines of shell script should be
able to do that.

> The scanner can also do multi-page TIFF files, but can the pdfpages
> package handle that?

No, as its name implies, it will only handle PDFs, AFAIK. But I don't
see what's wrong with including the bitmap page images directly.
pdflatex is surely capable of compressing them when it creates its own
PDF, without the need to use pdfpages?

///Peter

///Peter

Axel Berger

unread,

May 19, 2015, 11:10:10 AM5/19/15

to

Peter Flynn wrote on Mon, 15-05-18 21:49:

>The OP did not mention OCR. Why would you lose that?

Well, if the scanner can do OCR it should save the image and the text
in the same single file. For that PDF really is a good choice. The
reason why I hate "scan to PDF" is, that you mostly get no choice about
the embedded image format and it mostly comes to low resolution highly
compressed Jpeg with loads of artefacts and still far bigger than a B/W
PNG, the optimal choice for text and line drawings.

Nomen Nescio

unread,

May 27, 2015, 2:50:48 PM5/27/15

to

> The OP did not mention OCR.

I am the OP (who used a remailer that gives a different identity to
each post). Sorry for the confusion.

> Why would you lose that? I do OCR from bitmaps. I'm not familiar
> with any other way of doing OCR though... maybe there is a way to
> do them from vector images?

The scanner at-hand can produce a "searchable PDF" -- that is, a PDF
container that includes the bitmaps along with the OCR'd text. Each
character maps to a precise position on the bitmap image and that
metadata is part of the PDF. So when pages are extracted or cropped,
the correct set of OCR'd text follows the bitmap. And I can always
get the text in separately if needed using /pdftotext/.

I also have a FOSS tool that can do OCR on a bitmap, but then I end up
with an image file and a text file, which is messy. When the bitmap
is inserted in a latex document, for example, the text is lost. It's
more managable to have a PDF container that encapsulates all
components of the document.

Latex does not seem to have a way to insert the OCR'd into the PDF for
searching and extraction, apart from using the /attachfile/ package,
which (I think) forces a visible thumb-tac icon into the document.

> Email? Who mentioned email? I use a scanner attached to my
> workstation. I load a stack of documents and press the button. I
> get a bunch of PNGs appear in the nominated directory. Email is not
> involved.

The scanner at-hand e-mails the scans, apparently forced by the admins
configuration.

> If you are using a remote scanner that sends them to you, install
> procmail and divert them into a separate folder.

Procmail is another problem. Procmail's big weakness is inability to
recognize and manipulate MIME attachments. I once tried to write a
procmail recipe to use third-party MIME tools and it was a disaster.
Clearly it's nontrivial.

> > I'm also not sure I could continue using a /while/ loop to iterate
> > over pages and apply a 90 degree angle on every other page and -90
> > degrees on the others. So I only see disadvantages to that.
>
> ImageMagick is your friend. A couple of lines of shell script should
> be able to do that.

The original document is a legal document in the form of an A3
booklet. The beauty of the code I posted is that the PDF internally
contains the original scans as-is, with no butchering. Yet it uses
viewport to give a nice A4 up-right presentation.

Using ImageMagick in the way you suggest butchers the original scans.
Would a court be anal about that? I'd rather not risk it. Using
viewport makes it possible to extract the original A3 images later if
needed, exactly as they were scanned.