On 2012-02-03, Ben Morrow <
b...@morrow.me.uk> wrote:
>
> Quoth Justin C <
justi...@purestblue.com>:
>> I've just written my first WWW::Mechanize program, it does it's job,
>> and I can export the data to PDF using PDF::FromHTML. What I don't get
>> with this, however, are the images on the page, so my PDF is ugly.
>
> Also, that module makes no attempt to handle CSS, so for most ordinary
> web pages it's probably useless.
>
>> I've tried using $mech->find_all_images(), and downloading them, but
>> the images on the page are all relative links - and, it seems, the
>> relative path is being set depending which style sheet is in force at
>> the time.
>
> I'm not sure what you mean here. ->find_all_images returns
> WWW::Mech::Image objects, which have both ->url and ->base methods. Is
> that not enough to download the image and put it in the right place in a
> tree?
I may look at that again in a while. For now I've given up on the
images...
>
>> Can anyone suggest where I start reading so that I can learn how to
>> get the entire page, including images, and have the html in
>> $mech->content display links to the locally downloaded copies of the
>> images?
>>
>> Or is there a better way to submit a form and get what is returned
>> into a PDF?
>
> Rendering modern HTML is an extremely complicated business. I wouldn't
> try to to it in pure Perl unless there's no other option. For rendering
> to PDF I'd look at PDF::WebKit, which uses an external WebKit-based
> binary to do the rendering; unfortunately it also requires Qt, which may
> mean you can't use it.
It wasn't fun installing, but my PDF's look much better (hence being
able to do without the images).
The major problem I had is that wkhtmltopdf (what PDF::WebKit drives
to get PDF output) requires a running X server, but I wanted to run
this on a headless box that has no X. So I installed xvfb (it still
dragged in a whole bunch of dependencies), I then needed a bash
script:
xvfb-run wkhtmltopdf $@
and then a hack to PDF::WebKit so that it doesn't look for
wkhtmltopdf, but uses my script instead. It's dirty, but it works.
wkhtmltopdf, if run from the command line in an xterm/rxvt/whatever
works fine, but it will not run outside and X server. :-(
> If you are having trouble because you're feeding WebKit HTML from Mech
> and it can't resolve the URLs, you probably want to use the base_href
> parameter to Mech->content.
I had that in my code, I was probably doing something wrong, but I
wasn't getting what I wanted. I'll give it another try when I've got
this whole thing worked out, not just this part.
Thank you for the pointers, PDF::WebKit (apart from the install
overhead) is much easier than PDF::FromHTML.