WWW::Mechanize and outputing what's returned

Justin C

unread,

Feb 3, 2012, 8:49:03 AM2/3/12

to

I've just written my first WWW::Mechanize program, it does it's job,
and I can export the data to PDF using PDF::FromHTML. What I don't get
with this, however, are the images on the page, so my PDF is ugly.

I've tried using $mech->find_all_images(), and downloading them, but
the images on the page are all relative links - and, it seems, the
relative path is being set depending which style sheet is in force at
the time.

Can anyone suggest where I start reading so that I can learn how to
get the entire page, including images, and have the html in
$mech->content display links to the locally downloaded copies of the
images?

Or is there a better way to submit a form and get what is returned
into a PDF?

Thank you for any suggestions.

Justin.

--
Justin C, by the sea.

Ben Morrow

unread,

Feb 3, 2012, 11:30:34 AM2/3/12

to

Quoth Justin C <justi...@purestblue.com>:

> I've just written my first WWW::Mechanize program, it does it's job,
> and I can export the data to PDF using PDF::FromHTML. What I don't get
> with this, however, are the images on the page, so my PDF is ugly.

Also, that module makes no attempt to handle CSS, so for most ordinary
web pages it's probably useless.

> I've tried using $mech->find_all_images(), and downloading them, but
> the images on the page are all relative links - and, it seems, the
> relative path is being set depending which style sheet is in force at
> the time.

I'm not sure what you mean here. ->find_all_images returns
WWW::Mech::Image objects, which have both ->url and ->base methods. Is
that not enough to download the image and put it in the right place in a
tree?

> Can anyone suggest where I start reading so that I can learn how to
> get the entire page, including images, and have the html in
> $mech->content display links to the locally downloaded copies of the
> images?
>
> Or is there a better way to submit a form and get what is returned
> into a PDF?

Rendering modern HTML is an extremely complicated business. I wouldn't
try to to it in pure Perl unless there's no other option. For rendering
to PDF I'd look at PDF::WebKit, which uses an external WebKit-based
binary to do the rendering; unfortunately it also requires Qt, which may
mean you can't use it.

If you are having trouble because you're feeding WebKit HTML from Mech
and it can't resolve the URLs, you probably want to use the base_href
parameter to Mech->content.

Ben

Justin C

unread,

Feb 6, 2012, 10:48:05 AM2/6/12

to

On 2012-02-03, Ben Morrow <b...@morrow.me.uk> wrote:
>
> Quoth Justin C <justi...@purestblue.com>:
>> I've just written my first WWW::Mechanize program, it does it's job,
>> and I can export the data to PDF using PDF::FromHTML. What I don't get
>> with this, however, are the images on the page, so my PDF is ugly.
>
> Also, that module makes no attempt to handle CSS, so for most ordinary
> web pages it's probably useless.
>
>> I've tried using $mech->find_all_images(), and downloading them, but
>> the images on the page are all relative links - and, it seems, the
>> relative path is being set depending which style sheet is in force at
>> the time.
>
> I'm not sure what you mean here. ->find_all_images returns
> WWW::Mech::Image objects, which have both ->url and ->base methods. Is
> that not enough to download the image and put it in the right place in a
> tree?

I may look at that again in a while. For now I've given up on the
images...

>
>> Can anyone suggest where I start reading so that I can learn how to
>> get the entire page, including images, and have the html in
>> $mech->content display links to the locally downloaded copies of the
>> images?
>>
>> Or is there a better way to submit a form and get what is returned
>> into a PDF?
>
> Rendering modern HTML is an extremely complicated business. I wouldn't
> try to to it in pure Perl unless there's no other option. For rendering
> to PDF I'd look at PDF::WebKit, which uses an external WebKit-based
> binary to do the rendering; unfortunately it also requires Qt, which may
> mean you can't use it.

It wasn't fun installing, but my PDF's look much better (hence being
able to do without the images).

The major problem I had is that wkhtmltopdf (what PDF::WebKit drives
to get PDF output) requires a running X server, but I wanted to run
this on a headless box that has no X. So I installed xvfb (it still
dragged in a whole bunch of dependencies), I then needed a bash
script:

xvfb-run wkhtmltopdf $@

and then a hack to PDF::WebKit so that it doesn't look for
wkhtmltopdf, but uses my script instead. It's dirty, but it works.
wkhtmltopdf, if run from the command line in an xterm/rxvt/whatever
works fine, but it will not run outside and X server. :-(

> If you are having trouble because you're feeding WebKit HTML from Mech
> and it can't resolve the URLs, you probably want to use the base_href
> parameter to Mech->content.

I had that in my code, I was probably doing something wrong, but I
wasn't getting what I wanted. I'll give it another try when I've got
this whole thing worked out, not just this part.

Thank you for the pointers, PDF::WebKit (apart from the install
overhead) is much easier than PDF::FromHTML.

Ben Morrow

unread,

Feb 6, 2012, 5:05:21 PM2/6/12

to

Quoth Justin C <justi...@purestblue.com>:

> On 2012-02-03, Ben Morrow <b...@morrow.me.uk> wrote:
> >
> > Rendering modern HTML is an extremely complicated business. I wouldn't
> > try to to it in pure Perl unless there's no other option. For rendering
> > to PDF I'd look at PDF::WebKit, which uses an external WebKit-based
> > binary to do the rendering; unfortunately it also requires Qt, which may
> > mean you can't use it.
>
> It wasn't fun installing, but my PDF's look much better (hence being
> able to do without the images).
>
> The major problem I had is that wkhtmltopdf (what PDF::WebKit drives
> to get PDF output) requires a running X server, but I wanted to run
> this on a headless box that has no X.

See <http://madalgo.au.dk/~jakobt/wkhtmltoxdoc/wkhtmltopdf-0.9.9-doc.html>,
particularly the section 'Reduced Funtionality', and
<http://code.google.com/p/wkhtmltopdf/downloads/list>.

(Quite *why* it links Qt I can't imagine, but there we are...)

Ben