Regular expression for HTML tag matching

1 view
Skip to first unread message

Taras_96

unread,
May 12, 2009, 10:32:45 AM5/12/09
to Regex
Hi everyone,

I've found the following regex for HTML tag matching (from
http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx)

</?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)/?>

I understand most of it except for the attribute value matching

(?:".*?"|'.*?'|[^'">\s]+)

What is the purpose of the '?:' at the start?

Also, what's the purpose of the \s*|\s* at the end (I'm reading it as
'an attribute name/value pair followed by (whitespace or whitespace))?

Taras

Eugeny Sattler

unread,
May 13, 2009, 11:48:37 PM5/13/09
to re...@googlegroups.com
a text matched by a part of a regular expression embraced with round
brackets will be saved into a special variable called "backreference".
It can be used in a later part of the same regular expression or
further in a scripting language. Ususally $1 corresponds to the first
par of round brackets, $2 corresponds to the second pair of round
brackets etc. and $0 corresponds to the whole regular expression.
On the opposite, if a (?:   ) is used instead of (   ) , nothing will
be saved into a backreference.
Not saving unnecessary info frees memory and improves regex performance.
I did not go deep into your regex but i know that alternation
construct should be parsed from opening round bracket till closing
round bracket. So you should consider the whole
(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*) and not just \s*|\s*
(no matter that it contains nested alternation).
Reply all
Reply to author
Forward
0 new messages