Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Find TRUE bounding box of PS/PDF?

10 views
Skip to first unread message

Ilya Zakharevich

unread,
Feb 22, 2007, 3:31:14 PM2/22/07
to
I maintain a script which mangles a PS/PDF/DVI document for the "max
scale" 2up printing. To do this, I need to find bounding box of the
pages.

For many documents,

gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=bbox filename

would work. However, for many documents originated from scanned
sources, this produces too large bounding box (AFAIK, these documents
contain an embedded image with huge white margins).

Anyone knowing a better way to treat this problem? I could convert to
a bitmap, but what is the quickiest way to find bounding boxes of ink
in a bitmap?

Thanks,
Ilya

P.S. I see that pdfcrop uses similar technique to what I do, so it
would fail likewise...

Aandi Inston

unread,
Feb 22, 2007, 3:43:55 PM2/22/07
to
Ilya Zakharevich <nospam...@ilyaz.org> wrote:

>I maintain a script which mangles a PS/PDF/DVI document for the "max
>scale" 2up printing. To do this, I need to find bounding box of the
>pages.
>
>For many documents,
> gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=bbox filename
>
>would work. However, for many documents originated from scanned
>sources, this produces too large bounding box (AFAIK, these documents
>contain an embedded image with huge white margins).

White margins are the normal thing. I suspect the problem is that you
have off-white margins from scanning, so the true bounding box
includes all of the near-white paper.
----------------------------------------
Aandi Inston
Please support usenet! Post replies and follow-ups, don't e-mail them.

Ilya Zakharevich

unread,
Feb 22, 2007, 6:44:25 PM2/22/07
to
[A complimentary Cc of this posting was sent to
Aandi Inston
<qu...@dial.pipex.con>], who wrote in article <45de0056....@read.news.uk.uu.net>:

> > gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=bbox filename
> >
> >would work. However, for many documents originated from scanned
> >sources, this produces too large bounding box (AFAIK, these documents
> >contain an embedded image with huge white margins).
>
> White margins are the normal thing. I suspect the problem is that you
> have off-white margins from scanning, so the true bounding box
> includes all of the near-white paper.

Well, they are not "my" scans, but those produced by the publisher.
However, "their" scans still may have background not normalized to
0xFFFFFF...

To detect this, I will need to ghostscript to ppm/etc, and observe the
output, right? Hmm, probably imagemagick should be able to convert
PDF --> TXT too.

Thanks,
Ilya

bugbear

unread,
Feb 23, 2007, 4:20:01 AM2/23/07
to

http://netpbm.sourceforge.net/doc/pnmcrop.html
You might need to posterise the rendered image
to get the pnmcrop to work the way you want.

BugBear

Ilya Zakharevich

unread,
Feb 23, 2007, 6:43:15 AM2/23/07
to
[A complimentary Cc of this posting was sent to
Aandi Inston
<qu...@dial.pipex.con>], who wrote in article <45de0056....@read.news.uk.uu.net>:
> > gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=bbox filename
> >
> >would work. However, for many documents originated from scanned
> >sources, this produces too large bounding box (AFAIK, these documents
> >contain an embedded image with huge white margins).
>
> White margins are the normal thing. I suspect the problem is that you
> have off-white margins from scanning, so the true bounding box
> includes all of the near-white paper.

BTW, is there a way to ask Ghostscript to make document 10% brighter,
so that 0xEEEEEE would overblow to 0xFFFFFF?

Thanks,
Ilya

Ilya Zakharevich

unread,
Feb 23, 2007, 10:00:43 AM2/23/07
to
[A complimentary Cc of this posting was sent to
Aandi Inston
<qu...@dial.pipex.con>], who wrote in article <45de0056....@read.news.uk.uu.net>:
> > gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=bbox filename
> >
> >would work. However, for many documents originated from scanned
> >sources, this produces too large bounding box (AFAIK, these documents
> >contain an embedded image with huge white margins).
>
> White margins are the normal thing. I suspect the problem is that you
> have off-white margins from scanning, so the true bounding box
> includes all of the near-white paper.

Do not think it is applicable. bbox returns

%%BoundingBox: 0 0 448 681

(or some such) for all pages. Doing

convert xx1.pdf[2]" ~/tmp/xx1.txt

reports size as 448x680, but the first non-white pixel is at 66,45
(0-based). Same with -depth 16.

So the margin IS white, but bbox can't find it...

gs is 8.54...

Puzzled,
Ilya


bugbear

unread,
Feb 23, 2007, 11:39:36 AM2/23/07
to

As I understand it bbox records the PAINTED
area.

The border may (well) be painted white.

BugBear

Ilya Zakharevich

unread,
Feb 23, 2007, 1:00:59 PM2/23/07
to
[A complimentary Cc of this posting was sent to
bugbear
<bugbear@trim_papermule.co.uk_trim>], who wrote in article <45df18c8$0$8747$ed26...@ptn-nntp-reader02.plus.net>:

> As I understand it bbox records the PAINTED area.

This was my initial conjecture too (given the results), but Aandi says
otherwise, and AFAIK, the documentation explicitely says that what is
painted white should not be included.

Maybe this relates to what is painted by "pure" PS commands, and not
by "embedded graphics"?

Bug or feature?

Thanks,
Ilya

Aandi Inston

unread,
Feb 23, 2007, 2:02:53 PM2/23/07
to
bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

>As I understand it bbox records the PAINTED
>area.
>
>The border may (well) be painted white.

That would be incorrect, in as much as that isn't what the bounding
box is. However, I'm quite happy to accept it if you say that is what
this particular piece of software does (it IS much easier to measure).

Ilya Zakharevich

unread,
Feb 23, 2007, 3:51:33 PM2/23/07
to
[A complimentary Cc of this posting was sent to
bugbear
<bugbear@trim_papermule.co.uk_trim>], who wrote in article <45deb1c1$0$8719$ed26...@ptn-nntp-reader02.plus.net>:

> > To detect this, I will need to ghostscript to ppm/etc, and observe the
> > output, right? Hmm, probably imagemagick should be able to convert
> > PDF --> TXT too.

> http://netpbm.sourceforge.net/doc/pnmcrop.html
> You might need to posterise the rendered image
> to get the pnmcrop to work the way you want.

Thanks for the pointer. I remember that I saw something like this
somewhere, but forgot where.

However, it is not what I need. I do not need the actual crop; what I
need is bbox. BBoxes for first/odd/even pages are combined in my tool
to decide on the best croping strategy...

Thanks anyway,
Ilya

François Robert

unread,
Feb 23, 2007, 5:07:45 PM2/23/07
to
In article <erna4r$4ss$1...@agate.berkeley.edu>,
Ilya Zakharevich <nospam...@ilyaz.org> wrote:

A simple experiment shows that white raster pixels are not considered as
painted white : I run the following PS snippet through GS 8.54 (OS X)
with gs -DEVICE=bbox. On each page, a 100x100 white square is drawn with
different methods and a smaller 60x60 black square is drawn on top :

%!

(white square as filled path:) = flush
0 0 moveto 0 100 lineto 100 100 lineto 100 0 lineto
closepath 1 setgray fill
0 setgray 20 20 60 60 rectfill
showpage

(white square as rectfill:) = flush
1 setgray 0 0 100 100 rectfill
0 setgray 20 20 60 60 rectfill
showpage

(white square as an image:) = flush
gsave
100 100 scale
100 100 8 [100 0 0 -100 0 100] <FF>
image
grestore

0 setgray 20 20 60 60 rectfill
showpage

(white square as an imagemask:) = flush
1 setgray
gsave
100 100 scale
100 100 true [100 0 0 -100 0 100] <FF>
imagemask
grestore

0 setgray 20 20 60 60 rectfill
showpage

%%EOF

The results are :

AFPL Ghostscript 8.54 (2006-05-17)
Copyright (C) 2005 artofcode LLC, Benicia, CA. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
white square as filled path:
%%BoundingBox: 19 19 81 81
%%HiResBoundingBox: 19.997999 19.997999 80.009998 80.009998
>>showpage, press <return> to continue<<

white square as rectfill:
%%BoundingBox: 19 19 81 81
%%HiResBoundingBox: 19.997999 19.997999 80.009998 80.009998
>>showpage, press <return> to continue<<

white square as an image:
%%BoundingBox: 0 0 100 100
%%HiResBoundingBox: 0.000000 0.000000 99.999981 99.999981
>>showpage, press <return> to continue<<

white square as an imagemask:
%%BoundingBox: 0 0 100 100
%%HiResBoundingBox: 0.000000 0.000000 99.999981 99.999981
>>showpage, press <return> to continue<<

________________________________________________________
François Robert

François Robert

unread,
Feb 23, 2007, 5:22:39 PM2/23/07
to
Just added another case :

(white square as an image, clipped:) = flush
gsave
10 10 80 80 rectclip


100 100 scale
100 100 8 [100 0 0 -100 0 100] <FF>
image
grestore

0 setgray 20 20 60 60 rectfill
showpage

The results is :

white square as an image, clipped:
%%BoundingBox: 9 9 90 90
%%HiResBoundingBox: 9.990000 9.990000 89.999997 89.999997

bugbear

unread,
Feb 26, 2007, 5:25:16 AM2/26/07
to
Aandi Inston wrote:
> bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
>
>> As I understand it bbox records the PAINTED
>> area.
>>
>> The border may (well) be painted white.
>
> That would be incorrect, in as much as that isn't what the bounding
> box is. However, I'm quite happy to accept it if you say that is what
> this particular piece of software does (it IS much easier to measure).

(it appears I'm, wrong, Aandi - see posts by François Robert)

I dread to think what happens if the background is no white
e.g. an inverted transfer curve.

BugBear

Kevin Ashley

unread,
Feb 26, 2007, 11:03:10 AM2/26/07
to

It's true that pnmcrop is going to do more work than you want,
but it *can* tell you the bounding box, or at least information
that is equivalent, if you use the -verbose option:

pnmcrop: Background color is blue
pnmcrop: cropping 17 rows off the top
pnmcrop: cropping 6 rows off the bottom
pnmcrop: cropping 3 cols off the left

Just throw pnmcrop's actual output at /dev/null and
parse the above and you have what you want. As others have
pointed out, given that your images are scanned, they are
unlikely to have pure white backgrounds. Try either turning
the postscript to pbmraw format (which forces every pixel
to black or white) or using some other tool such as
posterising, which someone else has suggested. You will need to
do some experimentation to find out what works best
for you.

Ilya Zakharevich

unread,
Feb 26, 2007, 1:09:04 PM2/26/07
to
[A complimentary Cc of this posting was sent to
Kevin Ashley
<K.As...@ulcc.ac.uk>], who wrote in article <erv0bv$oa4$1...@canard.ulcc.ac.uk>:

> parse the above and you have what you want. As others have
> pointed out, given that your images are scanned, they are
> unlikely to have pure white backgrounds.

They do. Did you ever see epublished-by-scanning magazines?

Thanks for pnmcrop tips...

Ilya

François Robert

unread,
Feb 26, 2007, 2:34:31 PM2/26/07
to
In article <45e2b58c$0$8710$ed26...@ptn-nntp-reader02.plus.net>,
bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

> I dread to think what happens if the background is no white
> e.g. an inverted transfer curve.

FYI, I ran again my tests with your suggestion. I setup a simple linear
transfer function in the graphic state befor the first test :

%!
{ 0.5 mul 0.25 add } settransfer

(white square as filled path:) = flush

etc...

The results :


AFPL Ghostscript 8.54 (2006-05-17)
Copyright (C) 2005 artofcode LLC, Benicia, CA. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
white square as filled path:

%%BoundingBox: 0 0 101 101
%%HiResBoundingBox: 0.000000 0.000000 100.007997 100.007997


>>showpage, press <return> to continue<<

white square as rectfill:
%%BoundingBox: 0 0 101 101
%%HiResBoundingBox: 0.000000 0.000000 100.007997 100.007997


>>showpage, press <return> to continue<<

white square as an image:
%%BoundingBox: 0 0 101 101
%%HiResBoundingBox: 0.000000 0.000000 100.007997 100.007997


>>showpage, press <return> to continue<<

white square as an image, clipped:
%%BoundingBox: 9 9 90 90
%%HiResBoundingBox: 9.990000 9.990000 89.999997 89.999997
>>showpage, press <return> to continue<<

white square as an imagemask:
%%BoundingBox: 0 0 101 101
%%HiResBoundingBox: 0.000000 0.000000 100.007997 100.007997

François Robert

unread,
Feb 26, 2007, 2:35:42 PM2/26/07
to
In article <ermk0j$2n3c$1...@agate.berkeley.edu>,
Ilya Zakharevich <nospam...@ilyaz.org> wrote:
...

> BTW, is there a way to ask Ghostscript to make document
> 10% brighter, so that 0xEEEEEE would overblow to 0xFFFFFF?
Using transfer function(s) ?

________________________________________________________
François Robert

Ilya Zakharevich

unread,
Feb 28, 2007, 11:30:55 AM2/28/07
to
[A complimentary Cc of this posting was sent to
François Robert
<moc....@trebor.siocnarf>], who wrote in article <moc.xeta-E87A6A...@powernews.iol.it>:

> FYI, I ran again my tests with your suggestion. I setup a simple linear
> transfer function in the graphic state befor the first test :
>
> %!
> { 0.5 mul 0.25 add } settransfer
>
> (white square as filled path:) = flush
> etc...
>
> The results :

Could you please sum up the results in plain language, for those of us
not fluent in PS? Something like

with transfer function set, bbox acts on color AFTER applying the
transfer function (with white not contributing to bbox), with an
exception of non-cropped images contributing ALL their size into bbox?

Thanks,
Ilya

Ilya Zakharevich

unread,
Mar 9, 2007, 11:57:54 AM3/9/07
to
[A complimentary Cc of this posting was NOT [per weedlist] sent to
Ilya Zakharevich
<nospam...@ilyaz.org>], who wrote in article <erkuii$1h5f$1...@agate.berkeley.edu>:

> I maintain a script which mangles a PS/PDF/DVI document for the "max
> scale" 2up printing. To do this, I need to find bounding box of the
> pages.
>
> For many documents,
>
> gs -q -dNOPAUSE -dBATCH -r600x600 -sDEVICE=bbox filename
>
> would work. However, for many documents originated from scanned
> sources, this produces too large bounding box (AFAIK, these documents
> contain an embedded image with huge white margins).
>
> Anyone knowing a better way to treat this problem?

(Conjecturally,) I found such a way. I suspect that

convert filename.pdf -fuzz 10% -trim info:

will print what I want (or maybe I could even substitute some

-print FORMAT

to get EXACTLY the info I want in a reliable format).

======================

Unfortunately, the computers around have only an older version of
ImageMagick which does not support info:, -identify, or -print. And I
cannot even convince it to translate to a particular format without
specifying an extension (e.g., to write to STDOUT)... Best I managed
to do is to use

convert filename.pdf -fuzz 10% -trim -depth 1 tmp.miff
identify tmp.miff
tmp.miff[0] MIFF 555x770 570x792+2+16 PseudoClass 2c 1.2mb 0.120u 0:01
tmp.miff[1] MIFF 547x780 570x792+16+6 PseudoClass 2c 1.2mb 0.090u 0:01
tmp.miff[2] MIFF 556x783 570x792+7+3 PseudoClass 2c 1.2mb 0.050u 0:01

Anyone being able to force it at least into a pipeline, as with

convert filename.pdf -fuzz 10% -trim -depth 1 -format MIFF - | identify -

(which does not work, since -format is not understood the way I want)?

Thanks,
Ilya

P.S. Versions around here are 6.0.7 and 6.1.1...

0 new messages