HTML to PDF

554 views
Skip to first unread message

Casey W. Stark

unread,
Oct 29, 2010, 12:49:31 AM10/29/10
to MathJax Users
[Previously posted at http://sourceforge.net/projects/mathjax/forums/forum/948700/topic/3819537]

Bala
2010-08-21 00:58:42 PDT

I'm trying to get pdf from mathjax rendered HTML using wkhtmltopdf,
but the output is empty boxes for math, other than math everything
fine.

wkhtmltopdf: http://code.google.com/p/wkhtmltopdf/

Is there any hope for PDF output?



Regards
Bala

----------

dpvc
2010-08-21 03:58:55 PDT

Yes, wkhtmltopdf can do it, but the problem is that it produces the
pdf version before MathJax has run and typeset the mathematics. You
need to tell it to wait for a bit in order to allow the javascript to
run and do its thing. Use the --javascript-delay option (in version
0.10, I think it is --redirect-delay in 0.9) to force wkhtmltopdf to
wait for MathJax to run. Unfortunately, you need to tell it how LONG
to wait, and that is not easy to know. It depends on how much math
there is on the page, how good your network connection is, and how
fast your computer is. I had to use --javascript-delay 8000 (that's 8
seconds) for it to render the MathJax TeX sample page, but your time
may be different. It helps to install the MathJax fonts on the
computer that is running wkhtmltopdf, as that will mean you don't have
to wait for web-font downloads as part of your processing time. You
can find them in the MathJax/fotns/HTML-CSS/TeX/otf folder.

Good luck with the project. If you get something working that others
could take advantage of, please report it back here.

Davide

----------

Bala
2010-08-21 04:20:41 PDT

Hi Davide,

Thanks for your reply, I will try with your tips and reply as soon as
possible.

Regards
Bala

----------

https://www.google.com/accounts
2010-08-21 08:11:30 PDT

It will be hard, in general, to expect other packages, such as
wkhtmltopdf to add code to wait for MathJax. It doesn't appear to
scale up very well since it means that such packages would have to
know all other libraries like MathJax that they have to wait for.
However, I doubt there is a more general mechanism that would work.
MathJax is acting like a user-initiated script that is busy modifying
the page. Perhaps there's a way for a package like wkhtmltopdf to test
whether any script is still got work to do and wait for some sort of
period of inactivity before doing its thing.

----------

dpvc
2010-08-21 08:36:11 PDT

I certainly don't expect packages like wkhtmltopdf to know about
MathJax, but it would be nice if they had a method of allowing the
javascript to signal them that all is ready and the page is stable
(some special javascript method that they add to the page, for
example). Then you could use MathJax's signal mechanism to call that
routine after MathJax has finished running, so that you get your pdf
page knowing it will be complete, but without having to use
excessively long delays.

Davide

----------

https://www.google.com/accounts
2010-08-21 09:17:18 PDT

Sounds like a good idea. Essentially, you are asking wkhtmltopdf to
add a general-purpose "wait, I'm busy" method. Of course, that would
represent a dependency problem going the other way. MathJax would have
to know of the existence of all such libraries.

It is not clear to me what a more general solution would look like.
Perhaps HTML5 has a more general document state that can be queried.
What would be really nice is for MathJax to do its formatting in such
a way that other scripts can check the document state and know that
some page-formatting operation was underway.

I know that one of the main features of the HTML5 effort is to specify
how the browser is to behave rather than just the meaning of the
document markup as in HTML4. Perhaps it would be worth looking into.

----------

dpvc
2010-08-21 10:44:28 PDT

that would represent a dependency problem going the other way. MathJax
would have to know of the existence of all such libraries.


No, not at all. Only the page you are trying to convert to pdf does,
not MathJax itself. If it is a page that you have authored, you can
arrange for that; if not, then if tools like wkhtmlpdf would allow
insertion of javascript into the page (along the lines of
GreaseMonkey), then you could ADD the required code to have it signal
them when needed. That is, you could force the insertion of
MathJax.Hub.Register.StartupHook("End",function
{wkhtmlpdf.pageReady()})

or something along those lines.

MathJax already has a signal/listener method of telling other code in
the page about what it (or has) happened in terms of mathematics on
the page. You can hook into those messages like I suggest above in
order to regular the actions you need to take.

Davide

----------

dpvc
2010-08-21 10:46:33 PDT

(I meant "about what is happening (or has happened) in terms of
mathematics". I sure wish there were a way to edit these messages
after they have been posted.)

----------

https://www.google.com/accounts
2010-08-21 10:57:08 PDT

I see your point. However, even what you suggest is plumbing that many
authors won't know is needed. The author of a page can't be expected
to know about all potential interactions between JS libraries in a
page. Typically, a website owner/author will use each library for the
functionality it provides and not expect them to interact. Obviously
there are cases where libraries are expected to interact and cases
where the author could be reasonably expect them to interact. I don't
think this is one of those cases though.

One of the cool things about JavaScript is that scripts have access to
virtually everything. That power is a real advantage in dealing with
one-off issues in a page, or even a whole website if the author has
control of all its content. However, that power is a disadvantage when
libraries with broad distribution require such power to fix
interactions that should not occur in a perfect world.

----------

caseystark
2010-08-21 13:19:49 PDT

I agree with Davide. Using the time elapsed as a wait mechanism is
really crude. It's useful because it's simple, but they should really
support using something more advanced like a ready signal.

Yeah, it's too bad that the page author would have to add this code to
any page they want to use wkhtmltopdf on, but I think it's a very
special case anyway. I'm pretty sure in most cases people will just
use print to PDF in their browser, and they can see when the math has
been typeset.

----------

victor_ivrii
2010-08-21 21:11:22 PDT

Adobe Acrobat can create pdf from webpages including those which use
javascript. However it displays TeX source rather than math. Currently
I cannot determine signature of AA as a browser when it retrieves
pages

----------

dpvc
2010-08-22 09:14:47 PDT

@caseystark and @anonymous (sorry, don't know who you are):

I wasn't thinking of this as something authors would be inserting in
their pages on the off chance that someone might use wkhtmltopdf with
it. I had it mind a page that uses a "download PDF" button (I have
seen a number of such sites), that would use wkhtmltopdf in a workflow
on the server to generate the PDF version of the page. They would
know that wkhtmltopdf was to be used, and would arrange that it would
work with their page.

Although it is nice when things "just work", I don't see how a page
author can expect to be completely ignorant of the implications of
using the various tools he or she has chosen to include in the page.
Using something like MathJax does mean that the page will not be in
its final form right away, and that is something that does have
implications, in this case that you have to tell wkhtmltopdf to wait a
bit before taking the image of the page. Another such item is that
MathJax processes the page as it is when the page finishes loading, so
if more mathematics is added dynamically to the page, you must call
MathJax again explicitly to process the new material. I don't see any
way around that given current technologies.

@victor_ivrii:

You may be having the same problem as Bala, in that the image is being
taken before MathJax has had the chance to run. You might look for an
option with Acrobat to see if it can delay taking the image for a
specific time, and use that if you can. It may also be that, as you
suggest, MathJax doesn't work with Acrobat's javascript and DOM
implementation. It certainly is not something I've tested. MathJax
will try to render the mathematics even if it does not know the
browser, but it may be crashing somewhere along the line.

Davide

----------

victor_ivrii
2010-08-22 09:29:11 PDT

It may be the same problem, and it may be different as supposedly
wkhtmltopdf uses WebKit and AA uses Adobe Web Capture and I have no
idea what it is. At some moment AA reports an error with some elements
of the page. Note that for TeX AA displays the source and for MathMl
it displays nothing at all, even space is not attributed - as it
should do with unknown tags.

I tried against jsMath - the same as with MJ albeit no error reported

There is no timing option in AA for HTML (or any other format)

Victor

----------

dpvc
2010-08-22 09:37:36 PDT

Can you be more specific about the error message you are seeing?

It certainly could be that MathJax is failing with that Acrobat
javascript/DOM implementation. If there is nothing displayed for
MathML, it may be that the mml2jax preprocessor has run, which would
put the MathML into MathJax's internal form (where it woudl not be
displayed), but the HTML-CSS output jax failed to generate the
output. I don't have Acrobat to test with. Sorry!

Davide

----------

victor_ivrii
2010-08-22 10:03:32 PDT

Can you be more specific about the error message you are seeing?

- I looked more carefully - it is by no means MJ related. Sorry for
confusion

Victor

----------

paultopping
2010-08-22 10:24:05 PDT

[Sorry, I guess I, Paul Topping, was the anonymous one. I didn't know
logging in via Google would have that affect. I've now logged in
directly to SourceForge so it should be fixed.]

I understood that the code you were suggesting a hypothetical author
to insert would only be when that author recognized that they were
using both MathJax and wkhtmltopdf and that they needed to do so to
make the latter work more reliably.

I totally do think things should just work. When they don't, it is
seen as failure regardless of the cause. That is just life. I realize
MathJax is doing what it can given browser technology. However, it is
unfortunate that it has this vulnerability.

I do think users of MathJax and tools like wkhtmltopdf should expect
things to just work. Authors and website owners will read about pieces
of functionality such as MathJax and wkhtmltopdf, desire to add them
to their website, follow their installation instructions, and expect
each to provide the functionality as promised. This is done for each
package independently. In general, they won't worry about how they are
implemented or give any thought to how browsers work. If they want
math on their pages, they will hear that MathJax will do the job,
follow its installation instructions and just expect it to perform as
advertised. The concept that it processes math in the page after the
page loads and formats will be lost on most. After all, they are not
programmers but authors and content developers.

----------

victor_ivrii
2010-08-22 11:19:30 PDT

dpvc It may also be that, as you suggest, MathJax doesn't work with
Acrobat's javascript and DOM implementation.


Talking about javascript one needs probably to distinguish between AA
itself and built-in Adobe Web Capture which may or may not running
javascript. Also AA has no option how to handle external javascript.

----------

caseystark
2010-08-22 12:13:12 PDT

Just curious, why are people making web pages into PDFs? I've never
seen this done before...

----------

net-buoy
2010-08-22 16:45:32 PDT

Don't know about the OP, but paper is still the sine qua non and we
have yet to come to the generational divide where the majority of the
human race relies solely on electronic media. Until then, it is
important for quite a few to be able to obtain a paper copy of
electronic materials and the de facto way to manage that is
conversion into pdf as that best preserves the page.

----------

Bala
2010-09-03 03:29:29 PDT

Hi Davide,
I've tested just now that pdf problem solved with your suggetion.
Thank you very much. Very long discussion going on with my topic.

Regards
Bala
Reply all
Reply to author
Forward
0 new messages