Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert a website to a single PDF file

0 views
Skip to first unread message

Geico Caveman

unread,
Jul 4, 2009, 4:06:38 PM7/4/09
to
I have a bunch of html files, downloaded using wget.

I want to convert these to a single PDF file (with all the links
converted, separate html files converted to pages in the order the
links are encountered in index.html, recursively).

Is this possible ?

Sam

unread,
Jul 4, 2009, 4:32:26 PM7/4/09
to
Geico Caveman writes:

Open each page in Firefox. Print each page to a file. You'll get a
Postscript file. Use the ps2pdf script to convert it to a PDF.


Geico Caveman

unread,
Jul 4, 2009, 5:15:31 PM7/4/09
to
On 2009-07-04 13:32:26 -0700, Sam <s...@email-scan.com> said:

> This is a MIME GnuPG-signed message. If you see this text, it means that
> your E-mail or Usenet software does not support MIME signed messages.
> The Internet standard for MIME PGP messages, RFC 2015, was published in 1996.
> To open this message correctly you will need to install E-mail or Usenet
> software that supports modern Internet standards.
>
> --=_mimegpg-commodore.email-scan.com-26770-1246739546-0001
> Content-Type: text/plain; format=flowed; charset="US-ASCII"
> Content-Disposition: inline
> Content-Transfer-Encoding: 7bit

I was obviously looking for something more elegant and less work-intensive.

I have about 600 html files.

Unruh

unread,
Jul 4, 2009, 6:34:00 PM7/4/09
to
Geico Caveman <spammers...@spam.invalid> writes:

So leave then as html. Why do you want to convert 600 files to pdf?

Steve Wampler

unread,
Jul 4, 2009, 6:43:06 PM7/4/09
to
On Sat, 04 Jul 2009 13:06:38 -0700, Geico Caveman wrote:
> I want to convert these to a single PDF file (with all the links
> converted, separate html files converted to pages in the order the links
> are encountered in index.html, recursively).

At least part of the work can be done with htmldoc <www.htmldoc.org>.
I've used to construct books from multiple web pages but have always
wanted to control the order, not pull recursively from an index.html
file. The book format is simple enough (build a fake one and look
at the resulting format) that you should be able to construct it from
a list of html files easily enough.

That reduces the problem to one of getting the (recursively generated)
list of html files starting at the index file. There's probably a tool
to do that, somewhere.

Geico Caveman

unread,
Jul 4, 2009, 10:04:32 PM7/4/09
to


I need to send it as a PDF to my boss. He is not interested in
receiving a bunch of html files. The original website is down, so I
cannot send him a link. And PDF is far more compact and convenient for
someone who is not computer savvy.

DenverD

unread,
Jul 5, 2009, 12:21:24 AM7/5/09
to
> I need to send it as a PDF to my boss. He is not interested in receiving
> a bunch of html files. The original website is down, so I cannot send
> him a link. And PDF is far more compact and convenient for someone who
> is not computer savvy.


i _guess_ it is possible to make PDFs with live links (but i do not
know of such a program available for Linux)...and, if they is, i don't
know exactly how much time YOU wanna invest checking that each link
continues to function!!!

anyway, why not burn the entire lot to a CD (DVD if you need that much
room) using the exact directory structure of the (former) site and all
files in the correct place....

then using a browser, the experience will be the same as when the site
was live....all links will work and if he wants a print or .pdf, he
can...etc..

i don't know if the boss uses something from Redmond or Jobs...so i
don't know how to make it idiot proof (that is, if you knew for sure
you could build the Redmond magic file to tell his system to launch
the default browser pointing to /[top directory]/index.html but your
"not computer savvy" boss should be able to follow YOUR "IS computer
savvy" instruction telling him where/how to begin the browse..

--
DenverD (Linux Counter 282315) via Thunderbird 3.0.1-1.1, KDE 3.5.7,
openSUSE Linux 10.3, 2.6.22.19-0.3-default #1 SMP i686 athlon

Thad Floryan

unread,
Jul 5, 2009, 2:58:28 AM7/5/09
to
On 7/4/2009 2:15 PM, Geico Caveman wrote:
> On 2009-07-04 13:32:26 -0700, Sam <s...@email-scan.com> said:
>> [...]

>> Open each page in Firefox. Print each page to a file. You'll get a
>> Postscript file. Use the ps2pdf script to convert it to a PDF.
>
> I was obviously looking for something more elegant and less work-intensive.
>
> I have about 600 html files.

First Google "html2pdf", install the one you like.

Then a one-line bash script:

for i in *.html; do html2pdf $i ${i%%.html}.pdf; done

Then use pdfsam (PDF split and merge) to aggregate all the PDFs.
pdfsam is here: <http://www.pdfsam.org/>

Unruh

unread,
Jul 5, 2009, 9:07:29 AM7/5/09
to
Geico Caveman <spammers...@spam.invalid> writes:

Set up your own website with the contents, and send him a link. To receive a pdf file whith every
link of the original expanded out in place would be an unholly god awful mess, and he would not
appreciate your sending him that. Links are not well handled by a translation to pdf.
And those 600 pages will translate to about 3000 pdf pages, it your web site is at all typical,
and a pdf file which will take forever to mail and to open. He will NOT thank you.

Geico Caveman

unread,
Jul 5, 2009, 6:09:57 PM7/5/09
to
On 2009-07-04 21:21:24 -0700, DenverD <"spam.trap\\REMOVE \"at\"
SOME\\texan.dk"> said:

>> I need to send it as a PDF to my boss. He is not interested in receiving
>> a bunch of html files. The original website is down, so I cannot send
>> him a link. And PDF is far more compact and convenient for someone who
>> is not computer savvy.
>
>
> i _guess_ it is possible to make PDFs with live links (but i do not
> know of such a program available for Linux)...and, if they is, i don't
> know exactly how much time YOU wanna invest checking that each link
> continues to function!!!
>
> anyway, why not burn the entire lot to a CD (DVD if you need that much
> room) using the exact directory structure of the (former) site and all
> files in the correct place....
>
> then using a browser, the experience will be the same as when the site
> was live....all links will work and if he wants a print or .pdf, he
> can...etc..
>
> i don't know if the boss uses something from Redmond or Jobs...so i
> don't know how to make it idiot proof (that is, if you knew for sure
> you could build the Redmond magic file to tell his system to launch
> the default browser pointing to /[top directory]/index.html but your
> "not computer savvy" boss should be able to follow YOUR "IS computer
> savvy" instruction telling him where/how to begin the browse..


Thanks for the suggestion. Giving him a CD is not an option. He is
about as savvy with this stuff as your friendly neighbourhood snail.

PDF it needs to be.

Geico Caveman

unread,
Jul 5, 2009, 6:13:02 PM7/5/09
to

Thanks for the hint. Does html2pdf create links to the generated pdfs ?
Just seems to me that the program would need to know the pdf file names
for the targets.

pdfsam sounds like a subset of pdftk.

All links have been converted to local files with wget.

Thad Floryan

unread,
Jul 5, 2009, 7:12:55 PM7/5/09
to
On 7/5/2009 3:13 PM, Geico Caveman wrote:
> [...]

> Thanks for the hint. Does html2pdf create links to the generated pdfs ?
> Just seems to me that the program would need to know the pdf file names
> for the targets.
>
> pdfsam sounds like a subset of pdftk.
>
> All links have been converted to local files with wget.

I don't know precisely what the many versions and varieties of html2pdf
actually do; you'll need to examine the Google results pages to find one
that meets your needs.

What I usually do while looking at a web page is either print the page to
a PostScript file and convert with ps2pdf or copy'n'paste the portions of
interest to me from the web page into OpenOffice Writer and export as a PDF
file -- it always perserves the links as they were at the time of the
copy'n'paste.

Since you've already converted all links in the HTML files to local files,
it would seem a simple sed replacing all ".html" with ".pdf" in the HTML
files prior to conversion to PDF files would suffice (and work) after the
HTML to PDF conversions of the files.

In other words, if a local bletch.html has an imbedded link to "foobar.html",
edit/replace that embedded link to now be "foobar.pdf".

Now convert the bletch.html to PDF.

The converted bletch.pdf file will now have that embedded link to "foobar.pdf".

Geico Caveman

unread,
Jul 5, 2009, 8:35:25 PM7/5/09
to

I understand.

Would pdfsam or pdftk honour that link and convert that into a page
reference when concatening the pdf files ?

Thad Floryan

unread,
Jul 5, 2009, 9:36:21 PM7/5/09
to

Excellent question. :-)

I didn't know about pdftk until you mentioned it earlier, so I checked:

<http://www.accesspdf.com/pdftk/>

It seems quite featureful (though last updated in 2006), but it's not clear if
it creates a Table of Contents (ToC) since searching that web page for both
"index" and "content" found no match.

pdfsam (<http://www.pdfsam.org/>) is a new program whose developer is adding
features all the time. The version I have is about a year old, but I just now
noticed the author is adding new stuff all the time. Might be worth emailing
her with your question (and your original application); she may already have
a solution. Can't hurt to ask. :-)

Maxwell Lol

unread,
Jul 6, 2009, 7:11:05 AM7/6/09
to
Geico Caveman <spammers...@spam.invalid> writes:

> Thanks for the suggestion. Giving him a CD is not an option. He is
> about as savvy with this stuff as your friendly neighbourhood snail.

Then there's the "set up a web server and copy the files" option.
Can he open a URL sent by e-mail?

DenverD

unread,
Jul 6, 2009, 8:06:13 AM7/6/09
to
>> Thanks for the suggestion. Giving him a CD is not an option. He is
>> about as savvy with this stuff as your friendly neighbourhood snail.
>
> Then there's the "set up a web server and copy the files" option.
> Can he open a URL sent by e-mail?

i can't imagine a person so stupid that he can't insert a CD in the
caddy but he CAN look at monster pdf file..

i'm beginning to think "Geico Caveman" is smart as his boss.

0 new messages