merging converted pdf documents with existing pdf documents

476 views
Skip to first unread message

hbagchi

unread,
Oct 25, 2008, 12:11:42 AM10/25/08
to Pisa XHTML2PDF Support
Hi,

Can this framework be used to generate one single pdf file by merging
multiple pdf files which have been generated from HTML by XHTML2PDF as
well files separately created by a pdf writer?
I would like to develop a document download feature which downloads a
reasonably large set of files(of type html & pdf) (>50) as one single
pdf file for offline viewing and printing with the hyperlinks pointing
to the same document.

Regards,
hbagchi

Dirk Holtwick

unread,
Oct 25, 2008, 6:43:54 AM10/25/08
to xhtm...@googlegroups.com
Hi,

for the moment it would be quite difficult to realize that with
XHTML2PDF (but I will put this on my TODO list). But you should consider
using an additional tool like "pdfsplit" from Dinu Gherman:

http://pypi.python.org/pypi?%3Aaction=search&term=pypdf&submit=search

XHTML2PDF can still be used to generate the PDF files from HTML sources.

Cheers
Dirk

hbagchi schrieb:

hbagchi

unread,
Oct 26, 2008, 8:54:39 AM10/26/08
to Pisa XHTML2PDF Support
Thanks, Dirk.

Is there any solution available which can download all hyperlinked
HTML files starting from a root html and build a single pdf doc out of
it with the hyperlinks converted to pdf bookmarks.


Regards,
hbagchi

Dirk Holtwick

unread,
Oct 26, 2008, 9:38:56 AM10/26/08
to xhtm...@googlegroups.com
I don't know. I think you have to Google for it...

Dirk

hbagchi schrieb:

Grumpy Old Man

unread,
Oct 26, 2008, 7:44:36 PM10/26/08
to Pisa XHTML2PDF Support
So based on hbagchi's post, am I correct in understanding that
xhtml2pdf will only convert a single (x)html document into a single
PDF page? In other words, if I have a "book" consisting of several
html "chapters" linked by a TOC (and among themselves), with a title
page at the root, it is impossible to convert them into a single PDF
document?

I was not able to do that with my own document as described above,
besides ignoring the linked stylesheet... (All the docs (HTML and CSS)
have passed the W3C validation)

----
It's been one veery long afternoon, first I couldn't get pisa to work
on my Linux installation at all, and had to boot into Windows which
still required me to download package after package requiring yet
another package to be even installed... after reverting to Python
2.5.2... figuring out how to get it to recognize the command itself,
figuring out why not even images would show up... so I'm new to this
(and I have started to learn some Python anyway) but does it have to
be so hard to get a simple command-line tool to work...

Dirk Holtwick

unread,
Oct 27, 2008, 9:18:53 AM10/27/08
to xhtm...@googlegroups.com
> Is there any solution available which can download all hyperlinked
> HTML files starting from a root html and build a single pdf doc out of
> it with the hyperlinks converted to pdf bookmarks.

Try "wget" or "httrack", XHTML2PDF supports wildcards for batch
conversion e.g.:

$ xhtml2pdf *.html

Dirk

Dirk Holtwick

unread,
Oct 27, 2008, 9:22:59 AM10/27/08
to xhtm...@googlegroups.com
Grumpy Old Man schrieb:

> So based on hbagchi's post, am I correct in understanding that
> xhtml2pdf will only convert a single (x)html document into a single
> PDF page? In other words, if I have a "book" consisting of several
> html "chapters" linked by a TOC (and among themselves), with a title
> page at the root, it is impossible to convert them into a single PDF
> document?

Yes, that's true. Is there interest in a feature like this? If it is so
I could write an option for the command line tool that concatenates
generated PDF after conversion. Something like:

$ xhtml2pdf --concat *.html

> It's been one veery long afternoon, first I couldn't get pisa to work
> on my Linux installation at all, and had to boot into Windows which
> still required me to download package after package requiring yet
> another package to be even installed... after reverting to Python
> 2.5.2... figuring out how to get it to recognize the command itself,
> figuring out why not even images would show up... so I'm new to this
> (and I have started to learn some Python anyway) but does it have to
> be so hard to get a simple command-line tool to work...

The problem is that XHTML2PDF consists of several third party packages.
But I think I could provide an "all batteries included" distribution for
the future if this is a barrier for installtion.

Thanks for the hints and inspirations
Dirk

Grumpy Old Man

unread,
Oct 27, 2008, 2:46:16 PM10/27/08
to Pisa XHTML2PDF Support

On Oct 27, 6:22 am, Dirk Holtwick <dirk.holtw...@gmail.com> wrote:
> Grumpy Old Man schrieb:
>
> > So based on hbagchi's post, am I correct in understanding that
> > xhtml2pdf will only convert a single (x)html document into a single
> > PDF page? <SNIP>
>
> Yes, that's true. Is there interest in a feature like this? If it is so
> I could write an option for the command line tool that concatenates
> generated PDF after conversion. Something like:
>
> $ xhtml2pdf --concat *.html

That would a helpful feature indeed (maybe some other PDF tool out
there does that already?)
The only tricky part I see is if the original pages are interlinked,
can those links be re-established once the single-page PDFs have been
generated?

More generally speaking, would it better/easier to *pre-process the
separate html file* into one big file, then turn it into a PDF
(assuming pisa supports links within a PDF file, haven't had a chance
to try that) or is it indeed easier to *post-process the separate
PDFs* after they have been generated from the html files?

> > It's been one veery long afternoon, <SNIP>
>
> The problem is that XHTML2PDF consists of several third party packages.
> But I think I could provide an "all batteries included" distribution for
> the future if this is a barrier for installtion.
>
> Thanks for the hints and inspirations
> Dirk

Looking back at my post it comes across like I'm singling out pisa,
which was not intentional... sorry. Other software (and python
scripts) have that happen too, and I'm new to the particularities of
Python. I'm sure the extra packages will come in handy anyway since
they are part of what I intend to (try to) do in Python.

But if Python eggs are supposed to be(come) the equivalent of Java's
jars, the all-in-one approach would certainly be a major boon, at
least as an option for beginners and end-users.

In the meantime I have to look closer into why my CSS properties
didn't carry over (linked rather than embedded?) and I would like to
get pisa to work in Linux... but this is obviously not the thread for
that.

Dirk Holtwick

unread,
Oct 30, 2008, 5:23:16 AM10/30/08
to xhtm...@googlegroups.com
> More generally speaking, would it better/easier to *pre-process the
> separate html file* into one big file, then turn it into a PDF
> (assuming pisa supports links within a PDF file, haven't had a chance
> to try that) or is it indeed easier to *post-process the separate
> PDFs* after they have been generated from the html files?

It would be easier to build one big document if you want the links to
work. But this could also be done before passing the HTML to XHTML2PDF
since this is a very special case.

> But if Python eggs are supposed to be(come) the equivalent of Java's
> jars, the all-in-one approach would certainly be a major boon, at
> least as an option for beginners and end-users.

The discussion is ongoing how to install Python packages but EGGs are
becoming a "quasi" standard for people who like to use Python products
but do not develop using it. Its like "gem" for Ruby and so on.

> In the meantime I have to look closer into why my CSS properties
> didn't carry over (linked rather than embedded?) and I would like to
> get pisa to work in Linux... but this is obviously not the thread for
> that.

I am quite sure that it should be possible to get it working on Linux.
Install "Reportlab" and "html5lib" as described in their documentation
and then install XHTML2PDF e.g. using the TAR.GZ file:

$ python setup.py install

Dirk

zvart

unread,
Oct 31, 2008, 12:21:14 PM10/31/08
to Pisa XHTML2PDF Support
I didn't see pyPDF http://pybrary.net/pyPdf/ mentioned in this
thread. I find it pretty handy for merging PDFs with short scripts.
A lot of times a script makes more sense for me anyway since I have to
sort my merges by page counts, special paper runs, business data,
etc... It's fast too.

On Oct 27, 6:22 am, Dirk Holtwick <dirk.holtw...@gmail.com> wrote:

Dirk Holtwick

unread,
Oct 31, 2008, 12:42:01 PM10/31/08
to xhtm...@googlegroups.com
Hi,

with the next release of XHTML2PDF there will be a joining method based
on pyPdf, it is already in the SVN in an early state. For command line
something like this will be possible:

$ pisa --join *.html

For programmers it will be like:

pdf = pisaPDF()
pdf.addFromDocument(pisaDocument("Hello <b>World</b>"))
pdf.addFromURI("some_other_pdf.pdf")
pdfBinary = pdf.getvalue()

There are many more features to come soon. I think next week the new
version will be ready.

Dirk

zvart schrieb:


> I didn't see pyPDF http://pybrary.net/pyPdf/ mentioned in this
> thread. I find it pretty handy for merging PDFs with short scripts.
> A lot of times a script makes more sense for me anyway since I have to
> sort my merges by page counts, special paper runs, business data,
> etc... It's fast too.
>

> --~--~---------~--~----~------------~-------~--~----~
> Sie erhalten diese Nachricht, weil Sie Mitglied sind von Google Groups-Grup=
> pe "Pisa XHTML2PDF Support".
> F=C3=BCr das Erstellen von Beitr=C3=Return-Path: <dirk.h...@gmail.com>
> Received: from ?192.168.178.20? (f048003028.adsl.alicedsl.de [78.48.3.28])
> by mx.google.com with ESMTPS id z15sm381248fkz.16.2008.10.30.02.22.11
> (version=SSLv3 cipher=RC4-MD5);
> Thu, 30 Oct 2008 02:22:12 -0700 (PDT)
> Message-ID: <49097D04...@gmail.com>
> Date: Thu, 30 Oct 2008 10:23:16 +0100
> From: Dirk Holtwick <dirk.h...@gmail.com>
> User-Agent: Thunderbird 2.0.0.17 (Windows/20080914)
> MIME-Version: 1.0
> To: xhtm...@googlegroups.com
> Subject: Re: [xhtml2pdf] Re: merging converted pdf documents with existing
> pdf documents
> References: <2f455225-d8da-4517...@i20g2000prf.googlegroups.com> <4902F86A...@gmail.com> <36b82c06-ac86-453e...@d10g2000pra.googlegroups.com> <490472F0...@gmail.com> <4b520c43-9fa5-4c1e...@q26g2000prq.googlegroups.com> <4905C0B3...@gmail.com> <73a98dac-06eb-4b68...@34g2000hsh.googlegroups.com>
> In-Reply-To: <73a98dac-06eb-4b68...@34g2000hsh.googlegroups.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> Content-Transfer-Encoding: 7bit

Reply all
Reply to author
Forward
0 new messages