How does dompdf parse documents?

186 views
Skip to first unread message

BrianS

unread,
Dec 13, 2012, 11:35:38 AM12/13/12
to dompd...@googlegroups.com
bfrohs asks:

Somewhat off-topic, but do you parse HTML according to the specification? Or is it something more general? (And, if it's something more general, would you be interested in having it parsed according to the specification?)

In v0.6.0 beta 3 we added a HTML5-based parser (http://code.google.com/p/html5lib/). It's disabled by default while we test it further, but it can be enabled in the configuration. The library does work fairly well, but some document malformations cause the parser to throw fatal errors. And, unfortunately, the code has changed a lot since the last PHP release in 2009. I believe we can address the bugs in the current html5lib code, but bringing it up-to-date with the Python version may not be feasible.

Moving along with document parsing... When the parser is enabled it is given first shot at the document. The next step, whether or not the parser is enabled, is to feed the document into a DOMDocument object. I'm not sure how DOMDocument::loadHTML parses documents. I haven't seen any discussion on the topic, so a reading of the code may be in order. After that dompdf creates an internal object representation used for tracking element styling.

Fabien would be able to address the process in more detail, but that's a basic overview.
Reply all
Reply to author
Forward
0 new messages