Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Need help for regex to find unclosed tags in html

351 views
Skip to first unread message

Triar

unread,
Jun 18, 2010, 10:15:35 PM6/18/10
to
I thought this should work: '<([a-zA-Z][a-zA-Z0-9]*)[^>]*>(.*)[^(</\1>)]'
but I just get an error: Warning: preg_match_all()
[function.preg-match-all]: Unknown modifier ']' ....

What's wrong with my regex and how is it done?

Curtis Dyer

unread,
Jun 19, 2010, 12:51:49 AM6/19/10
to
Triar <tr...@spam.la> wrote:

> I thought this should work:

Don't use regular expressions to parse HTML, use the DOM[1].
Also, you didn't use delimiters in your regular expression. See
below.

> '<([a-zA-Z][a-zA-Z0-9]*)[^>]*>

This won't even match all valid (X)HTML tag names, as far as I can
tell (e.g., namespaces).

> (.*)

The dot metacharacter doesn't match newlines by default, so

<p>
foo bar
</p>

wouldn't match, even if your pattern parsed. Read the PCRE manual
in PHP's documentation, they list and explain the available
modifiers for handling situations such as this. Again, though,
you should use the DOM to parse HTML, not regular expressions.

> [^(</\1>)]'

You can't refer to backreferences within character classes, so the
characters have no special meaning. But I can see no reason to
use a negated character class here in the first place.

> but I just get an error: Warning: preg_match_all()
> [function.preg-match-all]: Unknown modifier ']' ....

PCRE regular expressions require delimiters[2] to be enclosed
around the pattern, so your regex isn't being parsed correctly.


____
[1] = <http://php.net/manual/en/domdocument.loadhtml.php>
[2] = <http://php.net/manual/en/regexp.reference.delimiters.php>

--
Curtis Dyer
<?$x='<?$x=%c%s%c;printf($x,39,$x,39);?>';printf($x,39,$x,39);?>

Denis McMahon

unread,
Jun 19, 2010, 1:28:59 AM6/19/10
to

Use the Tidy extension

Rgds

Denis McMahon

"Álvaro G. Vicario"

unread,
Jun 21, 2010, 3:50:23 AM6/21/10
to
El 19/06/2010 6:51, Curtis Dyer escribió/wrote:
> Triar<tr...@spam.la> wrote:
>
>> I thought this should work:
>
> Don't use regular expressions to parse HTML, use the DOM[1].

You can use the DOM functions to *fix* invalid HTML but... can you
actually use it to *report* what's wrong?

(Of course, the OP probably doesn't really want to do the latter.)

--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--

E.Lanenga

unread,
Jun 21, 2010, 5:11:31 AM6/21/10
to
Denis McMahon schreef:

Thank you for your advise, but I can't use the Tidy extension.
I magaged to solve the problem, by using the DOMdocument.

Triar.

E.Lanenga

unread,
Jun 21, 2010, 5:12:42 AM6/21/10
to
Curtis Dyer schreef:

Thx for your advise. I managed to solve mij problem by using the
DOMdocument class.

Triar.

Curtis Dyer

unread,
Jun 21, 2010, 8:56:48 AM6/21/10
to
"Álvaro G. Vicario" <alvaro.NO...@demogracia.com.invalid>
wrote:

> El 19/06/2010 6:51, Curtis Dyer escribió/wrote:
>> Triar<tr...@spam.la> wrote:
>>
>>> I thought this should work:
>>
>> Don't use regular expressions to parse HTML, use the DOM[1].
>
> You can use the DOM functions to *fix* invalid HTML but... can
> you actually use it to *report* what's wrong?

Yes, with the help of libxml.

<?php

libxml_use_internal_errors(true);

$html = '<p>Learn 2 <em>close tags</p>';
$dom = new DOMDocument();
$msg = "HTML parse error: Line %d: %s\n";

if ($dom->loadHTML($html)) {
if ($err = libxml_get_last_error())
printf($msg, $err->line, $err->message);
else
echo "No parse errors.\n";
}
else {
echo "Unable to parse HTML.\n";
}

You can also get an array of LibXMLError objects with
libxml_get_errors(), so this example's a bit contrived.

"Álvaro G. Vicario"

unread,
Jun 21, 2010, 10:14:50 AM6/21/10
to
El 21/06/2010 14:56, Curtis Dyer escribi�/wrote:
> "�lvaro G. Vicario"<alvaro.NO...@demogracia.com.invalid>
> wrote:

>
>> El 19/06/2010 6:51, Curtis Dyer escribi�/wrote:
>>> Triar<tr...@spam.la> wrote:
>>>
>>>> I thought this should work:
>>>
>>> Don't use regular expressions to parse HTML, use the DOM[1].
>>
>> You can use the DOM functions to *fix* invalid HTML but... can
>> you actually use it to *report* what's wrong?
>
> Yes, with the help of libxml.
>
> <?php
>
> libxml_use_internal_errors(true);
>
> $html = '<p>Learn 2<em>close tags</p>';
> $dom = new DOMDocument();
> $msg = "HTML parse error: Line %d: %s\n";
>
> if ($dom->loadHTML($html)) {
> if ($err = libxml_get_last_error())
> printf($msg, $err->line, $err->message);
> else
> echo "No parse errors.\n";
> }
> else {
> echo "Unable to parse HTML.\n";
> }
>
> You can also get an array of LibXMLError objects with
> libxml_get_errors(), so this example's a bit contrived.

Nice! Thanks for the code snippet.


--
-- http://alvaro.es - �lvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programaci�n web: http://borrame.com

matt

unread,
Jun 21, 2010, 10:38:26 AM6/21/10
to

for starters, you need a delimiter...

0 new messages