Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

pdf180 question

44 views
Skip to first unread message

JohnF

unread,
Jun 29, 2015, 6:57:49 AM6/29/15
to
I've been scanning some notebook pages as pdf's,
and (without some acrobatic contortions on my part)
the odd-numbered pages are scanned upside-down.
pdf180 (running on linux) does a fine job fixing that.
No actual problem, but a minor oddity that I'm wondering
if someone can explain.
When I use acroread to view the pages, the even-numbered
ones (not pdf180'ed) are displayed at 107%, but the pdf180'ed
pages are displayed 63.8%. If I use acroread itself to take
an original odd-numbered page (displayed at 107%) and just
"rotate clockwise" twice and then "save a copy", that copy
is displayed at 107%.
Of course, I can tell acroread to display the 63.8% pages
at 107%, but when scrolling through pages, you initally see
one large, one small, etc. Is there any way get pdf180
to behave in whatever way necessary so that its output rotated
page is displayed at the same size as the original input page?
(Note: gv displays all pages same size, but acroread displays
these scans with >>much<< better quality (and I do mean >>much<<).)
--
John Forkosh ( mailto: j...@f.com where j=john and f=forkosh )

Axel Berger

unread,
Jun 30, 2015, 12:10:24 AM6/30/15
to
JohnF wrote on Mon, 15-06-29 12:57:
>I've been scanning some notebook pages as pdf's,

No you haven't. You've scanned to some unspecified, most probably ill
chosen und substandard, raster format and some unknown software has
warapped a PDF mantle around it and made some more ill chosen changes
while at it.
If you allow others to make all your choices for you, you lose the
right to complain.

JohnF

unread,
Jun 30, 2015, 2:40:30 AM6/30/15
to
Axel Berger <Axel_...@b.maus.de> wrote:
> JohnF wrote on Mon, 15-06-29 12:57:
>>I've been scanning some notebook pages as pdf's,
>
> No you haven't. You've scanned to some unspecified, most probably ill
> chosen und substandard, raster format and some unknown software has
> wrapped a PDF mantle around it and made some more ill chosen changes
> while at it.
> If you allow others to make all your choices for you,
> you lose the right to complain.

But then I shouldn't let you choose for me whether or not
I have the right to complain!!! :)
Moreover, I'd guess the US Supreme Court would uphold that
right under the First Amendment. Maybe Forkosh v. Berger will
be cited for centuries to come.

More saliently, it's a Brother MFC-J6920DW doing the scanning,
driven by Xsane under linux. I'd of thought it's Xsane wrapping
your pdf mantle, but the Brother can scan pdf to a memory stick
without any pc at all. So I'm not sure who's doing what to whom.
But I'm pretty sure your overall remark (sans the complaining part)
is correct...

...I hadn't originally mentioned that I'd diff -a'ed several
original scanned files versus their pdf180'd rotated counterparts,
just to see if I could intuit the problem. The files are ~300K
each, but the diff's are only ~160 mostly ascii lines (but there
are a few unprintable binary "lines"). So it's clearly mostly
wrapper-type stuff that's being manipulated. Below is a typical
diff with the binary stuff elided. Can you suggest what might be
done to fix the files vis-a-vis my complaint? I can easily write
a short C program to make any such straightforward changes.

diff -a w11pg0087.pdf w11pg0087-rotated180.pdf |less
2,13c2,3
<
< 1 0 obj
< << /Type /Catalog
< /Outlines 2 0 R
< /Pages 3 0 R
< >>
< endobj
<
< 2 0 obj
< << /Type /Outlines
< /Count 0
< >>
---
> 5 0 obj
> << /S /GoTo /D [6 0 R /Fit ] >>
15,31c5,12
<
< 3 0 obj
< << /Type /Pages
< /Kids [
< 6 0 R
< ]
< /Count 1
< >>
< endobj
<
< 6 0 obj
< << /Type /Page
< /Parent 3 0 R
< /MediaBox [0 0 653 862]
< /Contents 7 0 R
< /Resources << /ProcSet 8 0 R >>
< >>
---
> 12 0 obj <<
> /Length 145
> /Filter /FlateDecode
> >>
> stream
**** binary stuff elided ****
33,35c14,34
<
< 7 0 obj
< << /Length 288837 >>
---
> 6 0 obj <<
> /Type /Page
> /Contents 12 0 R
> /Resources 11 0 R
> /MediaBox [0 0 652.9984 841.8898]
> /Parent 15 0 R
> >> endobj
> 10 0 obj <<
> /Type /XObject
> /Subtype /Form
> /FormType 1
> /PTEX.FileName (/tmp/pdfjam-mSpTMK/file1/source-1.pdf)
> /PTEX.PageNumber 1
> /PTEX.InfoDict 16 0 R
> /Matrix [1.00000000 0.00000000 0.00000000 1.00000000 0.00000000 0.00000000]
> /BBox [0.00000000 0.00000000 653.00000000 862.00000000]
> /Resources <<
> /ProcSet [ /PDF ]
> >>
> /Length 288837
> >>
1103,1113c1102,1108
<
< 8 0 obj
< [/PDF]
< endobj
<
< 9 0 obj
< << /Title (XSane scanned image)
< /Creator (XSane version 0.996 (sane 1.0) - by Oliver Rauch)
< /Producer (XSane 0.996)
< /CreationDate (D:20150630034026+00'00')
< >>
---
> 16 0 obj
> <<
> /Title (XSane scanned image)
> /Creator (XSane version 0.996 \(sane 1.0\) - by Oliver Rauch)
> /Producer (XSane 0.996)
> /CreationDate (D:20150630034026+00'00')
> >>
1115c1110,1146
<
---
> 13 0 obj <<
> /D [6 0 R /XYZ 133.7684 741.9944 null]
> >> endobj
> 14 0 obj <<
> /D [6 0 R /XYZ 133.7684 717.0878 null]
> >> endobj
> 11 0 obj <<
> /XObject << /Im4 10 0 R >>
> /ProcSet [ /PDF ]
> >> endobj
> 15 0 obj <<
> /Type /Pages
> /Count 1
> /Kids [6 0 R]
> >> endobj
> 17 0 obj <<
> /Names [(Doc-Start) 14 0 R (page.1) 13 0 R]
> /Limits [(Doc-Start) (page.1)]
> >> endobj
> 18 0 obj <<
> /Kids [17 0 R]
> >> endobj
> 19 0 obj <<
> /Dests 18 0 R
> >> endobj
> 20 0 obj <<
> /Type /Catalog
> /Pages 15 0 R
> /Names 19 0 R
> /PageMode /UseOutlines
> /OpenAction 5 0 R
> >> endobj
> 21 0 obj <<
> /Author()/Title()/Subject()/Creator(LaTeX with hyperref package)/Producer(pdfeTeX-1.21a)/Keywords()
> /CreationDate (D:20150630005542-04'00')
> /PTEX.Fullbanner (This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4)
> >> endobj
1117,1122c1148,1157
< 0 10
< 0000000000 65535 f
< 0000000010 00000 n
< 0000000094 00000 n
< 0000000153 00000 n
< 0000000000 00000 f
---
> 0 22
> 0000000001 65535 f
> 0000000002 00000 f
> 0000000003 00000 f
> 0000000004 00000 f
> 0000000007 00000 f
> 0000000009 00000 n
> 0000000281 00000 n
> 0000000008 00000 f
> 0000000009 00000 f
1124,1128c1159,1170
< 0000000256 00000 n
< 0000000412 00000 n
< 0000289312 00000 n
< 0000289339 00000 n
<
---
> 0000000398 00000 n
> 0000289894 00000 n
> 0000000057 00000 n
> 0000289772 00000 n
> 0000289833 00000 n
> 0000289961 00000 n
> 0000289595 00000 n
> 0000290019 00000 n
> 0000290116 00000 n
> 0000290153 00000 n
> 0000290189 00000 n
> 0000290296 00000 n
1130,1133c1172,1177
< << /Size 10
< /Root 1 0 R
< /Info 9 0 R
< >>
---
> <<
> /Size 22
> /Root 20 0 R
> /Info 21 0 R
> /ID [<1621A227D80F6660C2FA9D53E2766670> <1621A227D80F6660C2FA9D53E2766670>]
> >>
1135c1179
< 289538
---
> 290558

Thanks,

Axel Berger

unread,
Jun 30, 2015, 8:10:23 AM6/30/15
to
JohnF wrote on Tue, 15-06-30 08:40:
>Can you suggest what might be done to fix the files vis-a-vis my
>complaint?

As I dislike those ready made single keypress solutions I tend to do a
step by step myself (and then write my own single klick script for the
process).
So my first step when confronted with your file would be to use
pdfimages from the XPDF bundle to extract the hidden images. As you
don't want them changed (unless to and from lossless compressions that
are fully reversible) don't forget the "-j" option.

From there on you might cut off black borders (if present), enhance
greyscale to go from black to white instead of from a darker middle
grey to a slightly lighter middle grey (I often endure presentations
where this has not been done, enhancing colour is trickier and can
lead to unexpected results), convert to a sensible compressed format,
often reducing the colour space as a first step (the change to 256
colours is often invisible) and then bundle into a PDF. For a
consistent trimming of margins a disciplined and precise positioning
while scanning is a must, if omitted you have the choice of pages all
over the place or hand-trimming all pages individually (don't even
think about it, it's cruel and unusual).

Axel

JohnF

unread,
Jun 30, 2015, 10:05:09 PM6/30/15
to
Axel Berger <Axel_...@b.maus.de> wrote:
> JohnF wrote on Tue, 15-06-30 08:40:
>>Can you suggest what might be done to fix the files vis-a-vis my
>>complaint?
>
> As I dislike those ready made single keypress solutions I tend to do a
> step by step myself (and then write my own single klick script for the
> process).

Yeah, that's what I did to use pdf180 (manually tweaked a procedure
and then automated it), but with a ~25-line C program for a script
that just uses system() to execute constructed commands. It just
mv's odd-numbered pages to an originals/ subdirectory, runs pdf180
against each scan file in originals/ with --outfile ../, and finally
mv/renames the output to remove the -rotated180 suffix. I'd also
tried running pdf180 with --papersize and some other switches,
which resulted in some diff's but didn't affect the behavior
I was complaining about.

> So my first step when confronted with your file would be to use
> pdfimages from the XPDF bundle to extract the hidden images. As you
> don't want them changed (unless to and from lossless compressions that
> are fully reversible) don't forget the "-j" option.

That -j switch saves output as jpegs according to the man page,
and I believe that's not lossless. gif would be lossless, but not jpg.
Anyway, I'd originally experimented before deciding to "save as pdf"
(in quotes as per your original wrapper remarks). The printer has
a bunch of options, jpg being one, and I tried scanning a few pages
in each available format and viewing the output. I can't say why,
but pdf was obviously and noticeably the clearest and most readable.
So pdfimages would just take the already-saved pdf and convert
it back to one of the formats I could have saved "natively".
I'd actually tried that using Imagemagick convert during original
experimentation (just to see what would happen because I was clueless
how the scanner actually saves stuff), and it indeed looks even worse
than the native jpg. But I still have all those test files, and will
try pdfimages (again, just to see what happens).
Aside: regarding your original remarks about a native scanner raster
format, each scan leaves a separate file in /tmp/ with names like
574794 Jun 30 03:50 brscan_jpeg_PAGE1_0hFq9t
469321 Jun 30 04:09 brscan_jpeg_PAGE1_0vPFsF
466243 Jun 30 04:36 brscan_jpeg_PAGE1_1V6wiY
459293 Jun 30 04:14 brscan_jpeg_PAGE1_1axovp
They're larger than the pdf's, which are typically ~250-325K.
I can't make sense of the hash suffix, and despite that "jpeg" in the
name, that's not what they are. Looks more like what you said, some kind
of raster format. Here's the first few "lines" from one. No typical
header/signature/magic-number/etc at the beginning that I can see...
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
00 42 07 00 01 00 84 00 00 00 00 46 00 81 ff da ff B.........F..???
10 00 fe ee 00 00 04 f9 00 00 01 f8 00 00 20 f9 00 .??...?...?.. ?.
20 04 0c 00 1f f0 e0 fd 00 01 01 80 f9 00 03 70 00 ....???....?..p.
30 00 03 fc 00 00 0f f9 00 01 01 c0 f0 00 00 02 ef ..?...?...??...?
40 00 00 78 e3 00 00 01 fa ff 01 ef 80 fc 00 00 0f ..x?...??.?.?...
50 fe ff 42 07 00 01 00 84 00 00 00 00 45 00 81 ff ??B.........E..?

> From there on you might cut off black borders (if present), enhance
> greyscale to go from black to white instead of from a darker middle
> grey to a slightly lighter middle grey (I often endure presentations
> where this has not been done, enhancing colour is trickier and can
> lead to unexpected results), convert to a sensible compressed format,
> often reducing the colour space as a first step (the change to 256
> colours is often invisible) and then bundle into a PDF. For a
> consistent trimming of margins a disciplined and precise positioning
> while scanning is a must, if omitted you have the choice of pages all
> over the place or hand-trimming all pages individually (don't even
> think about it, it's cruel and unusual).
> Axel

Thanks for the additional suggestions, Axel. Actually, the scans are
in b&w. It's just monochrome ink on paper, so there's no information
in the color. And I'd tried various Xsane color,etc settings, but b&w
at 300dpi was actually best (even "crisper"-looking than b&w at 600dpi
for some reason). As for borders, the notebook pages are 11.75x9.375"
requiring a tabloid-size scanner. And Xsane lets you adjust the scan
region, so I got a precise fit, leaving about 0.25" margins all around to
accommodate my "disciplined and precise positioning" (quoting you above).
I used to spend lots of time at library xerox machines copying journal
articles (now almost everything's on the net), so I've become quite good
at reproducibly precise positioning, if I do say so myself.

Axel Berger

unread,
Jul 1, 2015, 1:15:08 PM7/1/15
to
JohnF wrote on Wed, 15-07-01 04:05:
>That -j switch saves output as jpegs according to the man page, and I
>believe that's not lossless. gif would be lossless, but not jpg.

That's a misunderstanding. By default pdfimages writes large
uncompressed files. This is fine for PNG, you can easily recompress
later, but not for jpg. The losses are already there in the PDF,
decompressing would not add more but later recomression would. So "-j"
extracts all Jpegs found (if any) as is. It does nothing for all
included non-Jpeg formats.

>pdf was obviously and noticeably the clearest and most readable.

Many machines I encounter use low resolution Jpegs with artefacts and
far too many colours. But on the Web I often encounter compound PDFs
obviously made by machines nobody I know can afford. They have the
letters in crisp high resolution b/w and the background with all its
smudges and images as Jpeg. Pdfimages can't extract anything sensible
out of there and you'd indeed be better off leaving well alone and just
manipulating the PDF.

Anyway pdfimages is nice for just taking a first look at what's in
there. If it's not full page images but hundreds of tiny page-width
stripes you're stumped too, or at least I am.

>Actually, the scans are in b&w.

If you can and did choose that, you're already 90 % of the way to
perfect. The black borders are only a problem with greyscale. They're
truly black so you can't make dark grey letters black while they're
there. Without the enhancement step just leave them.

Best of luck
Axel

JohnF

unread,
Jul 2, 2015, 3:20:15 AM7/2/15
to
Axel Berger <Axel_...@b.maus.de> wrote:
> JohnF wrote on Wed, 15-07-01 04:05:
>>That -j switch saves output as jpegs according to the man page, and I
>>believe that's not lossless. gif would be lossless, but not jpg.
>
> That's a misunderstanding. By default pdfimages writes large
> uncompressed files. This is fine for PNG, you can easily recompress
> later, but not for jpg. The losses are already there in the PDF,
> decompressing would not add more but later recomression would.
> So "-j" extracts all Jpegs found (if any) as is. It does nothing
> for all included non-Jpeg formats.

Thanks for the clarification, Axel. I apparently browsed through
that man page too quickly.

>>pdf was obviously and noticeably the clearest and most readable.
>
> Many machines I encounter use low resolution Jpegs with artefacts and
> far too many colours. But on the Web I often encounter compound PDFs
> obviously made by machines nobody I know can afford. They have the
> letters in crisp high resolution b/w and the background with all its
> smudges and images as Jpeg. Pdfimages can't extract anything sensible
> out of there and you'd indeed be better off leaving well alone and just
> manipulating the PDF.

Yeah, I actually solved the problem with Xsane alone.
It has a "rotate 180" option that does exactly what I'd wanted
to do with pdf180. But using the Xsane option, I have to scan
all the even pages first, then all the odd (or vice versa).
And that means I have to remove the notebook and turn pages
twice as often (rather than just rotating the book on the
platen to position the facing page). Adds about 20mins/notebook,
but it now seems like the best procedure.

> Anyway pdfimages is nice for just taking a first look at what's in
> there. If it's not full page images but hundreds of tiny page-width
> stripes you're stumped too, or at least I am.
>
>>Actually, the scans are in b&w.
>
> If you can and did choose that, you're already 90 % of the way to
> perfect. The black borders are only a problem with greyscale. They're
> truly black so you can't make dark grey letters black while they're
> there. Without the enhancement step just leave them.

Yeah, the b&w pdf's (at 300dpi in my case) are definitely best,
which I found out simply by exhaustive trial-and-error.
Thanks for the explanation about why.

> Best of luck
> Axel

Thanks again for the help, Axel. A few sample pages are up
at http://www.forkosh.com/u715.html I've left odd-numbered
pages upside-down for now (till I re-scan them using Xsane's rotate)
to avoid that pesky size problem. But at least you've convinced me
that pdf is the right format choice for this purpose.
What remains problematic is the html <obect> tag on that web page
(view source for details), which seems to display the pdf pages
on some browsers, but not always on others. I can't seem to
come up with a foolproof way of displaying them. But that problem's
even more OT for this ng than the original problem.

Axel Berger

unread,
Jul 2, 2015, 8:15:05 PM7/2/15
to
JohnF wrote on Thu, 15-07-02 09:20:
>Yeah, the b&w pdf's (at 300dpi in my case) are definitely best, which I
>found out simply by exhaustive trial-and-error.

In principle later reduction of greyscale to b/w should yield the same
result. In practice I've found that the sacnner itself does a much
better job in recognizing straight and not ragged borders. This is
especially true when one corner has a darkish background and the
opposite one dull and weak letters. B/w seems to optimze locally while
conversion uses a fixed threshold all over.

>which seems to display the pdf pages on some browsers, but not always on
>others.

And hopefully never on mine. At least I've done my best to forbid it.
For PDF I have my viewer of choice right here and that and not a
half-baked browser compromise is what I want use. I even force that
choice on my visitors through

<FilesMatch "\.(pdf|mp3)$">
Header add Content-Disposition "Attachment"
</FilesMatch>

in my .htaccess.

Axel


JohnF

unread,
Jul 2, 2015, 11:03:39 PM7/2/15
to
Yeah, I understand and agree with your remark, and my html page
already has a "download page" button that downloads just the pdf for
the current page you're viewing, and also a "download entire notebook"
button (or savvy users could wget whatever they want). That's the
easiest I could think to make it for users to apply their viewer
of choice.
But most people may just want to casually browse this kind
of stuff, and maybe download it later if they find they're more
interested. So there ought to be a straightforward and robust
way of <tag>'ing pdf's so any browser "just works" to display
them, analogous to <img> for gif's, jpg's, etc (svg's still
seem a bit problematic). This is needed because some useful
content is best saved as pdf's, and won't display well if
converted to a more "browser-portable" form.

Axel Berger

unread,
Jul 3, 2015, 2:15:06 PM7/3/15
to
JohnF wrote on Fri, 15-07-03 05:03:
>But most people may just want to casually browse this kind of stuff,

True. But "open in reader" is not that different from the often
used "open in new window", only you have all the controls and shortcuts
you expect with PDF.

JohnF

unread,
Jul 3, 2015, 7:36:59 PM7/3/15
to
"open in reader" -- that'd be great!!! Is there a <tag> for that,
i.e., <tag action="open in reader" src="filename.pdf"> ???
That's what I was trying to accomplish, but failed to google
a browser-agnostic, portable, foolproof method (and failed to
construct one myself after trying various permutations/combinations/etc
of stuff I'd googled). That page I cited does this when it works --
it uses frames to display the pdf on the right-hand side, using
whatever acroread/plugin/etc your browser's configured for
to display pdf's, and on the left-hand side is a scrollable
page selection table of contents, and the download options.

Axel Berger

unread,
Jul 4, 2015, 8:15:08 AM7/4/15
to
JohnF wrote on Sat, 15-07-04 01:36:
>That's what I was trying to accomplish, but failed to google a browser-
>agnostic, portable, foolproof method (and failed to construct one myself
>after trying various permutations/combinations/etc of stuff I'd googled).

For yourself the older Acrobats (NOT browsers) had a setting "Display
PDF in Browser", enabled by default, that you could disable. I believe
the newest ones don't have it anymore.
To get the same result for others use the .htaccess setting I gave
before. Most (all?) browsers I know, on encountering a file they can't
display themselves, show a dialog:

Open with XXX or Save?

As PDF is known to the OS, the "open with" will already point to
Acrobat. You can usually get rid of the dialog by choosing "always do
this from now on" or something like that but it is the default
behaviour. I'm not sure about Chrome and stuff, but iirc even the
newest Firefox still has it.

(As you will have noticed, you can ususally ask me about backwards
compatibility and expect an answer, but more often than not not about
the newest stuff. My PDFs display in Arcobat 3 and my web pages in
Netscape and IE 3 (often not identical but with graceful fallback, but
they work), all on Win 3.11, and I'm going to keep it that way.)

Axel

JohnF

unread,
Jul 4, 2015, 11:09:46 PM7/4/15
to
Thanks, Axel, ...
<FilesMatch "\.(pdf|mp3)$">
Header add Content-Disposition "Attachment"
</FilesMatch>
works like a charm! (at least for everything I can test myself)
Of all the stuff I'd tried to get that page working right,
.htaccess never occurred to me, not in the slightest...
So many "little languages", so little time. :)
0 new messages