Parsing error on erroneous DOCTYPE

240 views
Skip to first unread message

Max Kenten

unread,
Feb 20, 2009, 1:31:33 AM2/20/09
to phpQuery
On the following page I get an exception when calling newDocument:
http://www.chefkoch.de/magazin/artikel/943,0/AEG-Electrolux/Frischer-Saft-aus-dem-Dampfgarer.html

Warning: DOMDocument::loadXML() [domdocument.loadxml]: DOCTYPE
improperly terminated in Entity, line: 1 in /home/chroot/wm/home/wm/
inc/phpQuery/DOMDocumentWrapper.php on line 239

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Start tag
expected, '<' not found in Entity, line: 1 in /home/chroot/wm/home/wm/
inc/phpQuery/DOMDocumentWrapper.php on line 239

Quick fix was to uncomment the following lines (l. 240-241) in
DOMDocumentWrapper.php in order to allow parsing of not well-formed
HTML:
if (! $return)
$return = $this->document->loadHTML($markup);

I would like to know, why this part of the code was commented? Am I
running into some serious issues with this code being executed? So far
everything looks quite good and I'm running this code against ~20.000
different pages (though not all had this error).

Thanks in advance,
Max

Tobiasz Cudnik

unread,
Feb 20, 2009, 6:32:57 AM2/20/09
to phpQuery
Problem was in nested XML namespace. Fix is in revision 357.

> I would like to know, why this part of the code was commented?

Because this fallback has been moved to main dispatch method loadMarkup
() and limited to XHTML documents. The problem was in isXML() method
and this doc shouldn't be considered as XML.

You can always force document type thou method or content-type param,
like this:

phpQuery::newDocumentHTML('...');
phpQuery::newDocument('...', 'text/xml');

On Feb 20, 7:31 am, Max Kenten <speedblas...@web.de> wrote:
> On the following page I get an exception when calling newDocument:http://www.chefkoch.de/magazin/artikel/943,0/AEG-Electrolux/Frischer-...
Reply all
Reply to author
Forward
0 new messages