(X)HTML parsing with PHP

gareth rushgrove

unread,

Feb 5, 2008, 3:30:35 PM2/5/08

to hkit-d...@googlegroups.com

Hi

I know the topic of parsing markup in PHP came up, and I just found
something while hacking on something else that might be of interest to
some.

Parsing tag soup is a pain, so hkit uses various tricks (including
tidy) to try and make sure it's dealing with XML. Specifically in the
loadUrl method.

The PHP function DOMDocument->loadHTMLFile() appears to do this pretty
well (from a small amount of testing so far).

http://uk.php.net/manual/en/function.dom-domdocument-loadhtmlfile.php
http://uk.php.net/manual/en/function.dom-domdocument-loadhtml.php

All I'm using (for admittedly a pretty simple case) is:

$url = "http://google.com"
$xml = new DOMDocument();
$xml->loadHTMLFile($url);

If I get a chance I'll try and make the change to hkit and give it a
going over. If anyone knows a reason this won't work please let me
know.

Thanks

Gareth

--
Gareth Rushgrove
garethrushgrove.com
morethanseven.net
getjobsin.com
isitbirthday.com

Drew McLellan

unread,

Feb 6, 2008, 5:01:54 AM2/6/08

to hkit-discuss

On Feb 5, 8:30 pm, "gareth rushgrove" <gareth.rushgr...@gmail.com>
wrote:

> All I'm using (for admittedly a pretty simple case) is:
>
> $url = "http://google.com"
> $xml = new DOMDocument();
> $xml->loadHTMLFile($url);
>
> If I get a chance I'll try and make the change to hkit and give it a
> going over. If anyone knows a reason this won't work please let me
> know.

Certainly sounds like it's worth a try, Gareth. It'd be interesting to
see how it compares to using Tidy, especially if it offers a speed
improvement for those having to use a Tidy proxy.

I guess things to watch out for would be whether whether the function
is commonly enabled on shared hosting, and if it gives the same level
of control that using cURL does.

Would love to see it in action. We can always try benchmarking such
changes with the requests coming through tools.microformatic.com.

drew.

Alper Çugun

unread,

Feb 9, 2008, 6:28:40 AM2/9/08

to hkit-d...@googlegroups.com

On Feb 5, 2008, at 21:30 , gareth rushgrove wrote:

Parsing tag soup is a pain, so hkit uses various tricks (including
tidy) to try and make sure it's dealing with XML. Specifically in the
loadUrl method.

This would be nice, I found a bug in tidy where the output it produces is not to simplexml's liking. A small patch which at least suppresses these warnings and also introduces a reset to $this->base.

In a test case on my site I found that $this->base could get polluted with a strange value which could influence subsequent calls.

loadDoc.diff

Reply all

Reply to author

Forward