On 24/04/12 19:57, Peter Davis wrote:
> I've been trying to find comparisons of TeX/LaTeX with Apache FOP as
> tools for composing and typesetting document pages. Google turns up
> very little that seems relevant, and most of that compares DocBook,
> rather than FOP itself.
A lot of people do seem to confuse the markup of the source document
with the mechanism (XSL/T/FO/LaTeX) used to transform the document into
printable or viewable form.
There are two main ways to do this from XML to PDF, both of which use an
intermediate format (LaTeX or FO):
1. XML doc --> XSLT program --> LaTeX doc --> LaTeX --> PDF
2. XML doc --> XSLT:FO program --> FO doc --> FOP* --> PDF
* Not just FOP: there are several commercial FO processors as well.
> Does anyone have some relevant information? I'm trying to make a
> case for why we should (or should not) use TeX/LaTeX for automated
> PDF creation.
Unattended operation means guarding your workflow from bogus markup and
bogus characters, both in the XML source and in any generated
intermediate format.
XML is often created by people or systems using low-quality software
which allows invalid or non-well-formed documents to be output. A
non-well-formed document simply can't be processed (at all): it must be
sent back to the creator with a request to do it properly, or patched up
before it can be used.
LaTeX may gag on the 10 special characters (plus <|> if they occur
outside math mode). Any system creating LaTeX must therefore guarantee
that raw special characters (perfectly innocuous in XML apart from < and
&) will never find their way into the LaTeX source unheralded. Unlike
XML, LaTeX may have problems with some Unicode characters and require
manual fixes before the document will compile.
> Features like tables of contents, indices, equations, etc. are very
> much useful.
ToC:
LaTeX is a strictly sequential processor, so it uses a two-pass process,
gathering the ToC data on the first pass and using it on the second.
XSL (both XSLT and XSL:FO) can "walk around" the document, picking
information from here and there, as well as processing it in order, so
it can create a ToC up front simply by looking through the document to
find and collate the sections headings before it starts on the normal
body of processing.
Indexes (which is what I think you mean by "indices") -- same applies:
LaTeX gathers it first, makeindex sorts and collates it, and the second
pass of LaTeX prints it.
XSL, as before, can "look back" once it gets to the end of the document,
and pick out all the index marks, sort and collate them, and then output
them.
In both the above, LaTeX already has robust and well-established
mechanisms for ToCs (and LoFs and LoTs) and indexing and bibliographic
referencing and cross-referencing. In XSL, you pretty much need to write
your own routines from scratch, unless you are using a well-known
document type like DocBook or TEI, for which XSL code is readily available.
Equations: it depends on the markup; let us assume MathML:
I don't know anyone who uses semantic MathML or has ever tried to
convert it to anything (either FO or LaTeX), so I can't help there.
Presentation MathML is more tractable, but because a math expression can
be arbitrarily complex, any XSL code to handle it must be able of coping
with arbitrary complexity. This is hard to write comprehensively, so
it's often done by limiting the code to handle just the math that is
used in the document or group of documents being processed. Here (I
think) is the MathML for E=mc²:
<m:apply xmlns:m="
http://www.w3.org/1998/Math/MathML">
<m:eq/>
<m:ci>E</m:ci>
<m:apply>
<m:times/>
<m:ci>m</m:ci>
<m:apply>
<m:power/>
<m:ci>c</m:ci>
<m:cn base="10">2</m:cn>
</m:apply>
</m:apply>
</m:apply>
The twist is that MathML uses prefix notation (the = precedes the Emc²,
and the × precedes the mc, and the power precedes the c²). David
Carlisle has written excellent XSL code to handle this.
> Also, the ability to put intra- and inter-document links
> in the PDF would be helpful.
Cross-references are done the same way in LaTeX and XML: LaTeX uses
\label{foo} and \ref{foo}; XML uses xml:id="foo" and xxx="foo" (where
xxx is an attribute of type IDREF or IDREFS). Hypertext links from XML
to LaTeX need to generate the \href{...} and ensure the use of the
hyperref package; XSL:FO has facilities to create internal and external
link sources and targets; in XSLT you write it to emit the relevant
\label and \ref commands.
HOWEVER...
XML-XSLT-LaTeX-PDF
PRO: there is one inestimable advantage: the built-in features (eg
footnoting, sectioning, floating, environments, etc) and the 4000+
packages of LaTeX. Most of the transformations I have done have
benefitted from being able to solve virtually all the formatting
requirements simply by adding in the relevant packages; typeset quality
far exceeds that of most FO processors.
CON: will gag on unheralded special characters and unknown Unicode
characters; usually generates warnings and errors which need manual
attention; extra font installation needs expert attention; very few
commercial systems using the XSLT-LaTeX route (possibly because they are
unaware of it)
XML-XSL:FO-FOP-PDF
PRO: generally does not break on unusual characters; FOP is incomplete:
for a complete implementation you need to buy a commercial FO processor;
can probably use all system-installed fonts as-is. There are MANY
commercial PDF production systems based on the FO route (possibly
because all the big vendors do this for Windows first).
CON: unless you can use the prewritten code for well-known document
types, or you are working with a toolset which includes such extras, you
have to reinvent the wheel each time for all formatting; typographic
quality is office-standard, probably not publication-standard unless
under typographic supervision.
My personal preference is for the XSLT-LaTeX route, but that's because I
already knew LaTeX before XSLT came along. There was -- still is -- the
original SGML-DSSSL-Jade-TeX path, the Omnimark SGML/XML processor, and
any number of homebrew solutions based on onsgmls and awk or Perl.
YMMV
///Peter