Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Regexp non-greedy quantifiers issue

13 views
Skip to first unread message

GGG

unread,
May 26, 2003, 5:08:00 AM5/26/03
to
yes, I RTFM.
no, i dont understand why the following two regexp's return the same
result. anyone does?

% set tcl_patchLevel
8.3.4

% regexp -inline ((?:<p>)*)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb

% regexp -inline ((?:<p>)*)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb

Ulrich Schoebel

unread,
May 26, 2003, 6:49:30 AM5/26/03
to
Hi GGG,

add .* to your RE and you'll see the difference:


% regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}


<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb

% regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}


<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa

Best regards

Ulrich


--
SIGOS Systemintegration GmbH
- TESTING IS OUR COMPETENCE -
Fon +49 911 95168-0
www.sigos.de

Egil Støren

unread,
May 26, 2003, 7:52:16 AM5/26/03
to
Ulrich Schoebel wrote:

>
> Hi GGG,
>
> add .* to your RE and you'll see the difference:
>
>
> % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa

Why then, does this work:

% regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p> <p><p><p> aaaa

???

Egil Støren

Glenn Jackman

unread,
May 26, 2003, 9:17:18 AM5/26/03
to
Egil Støren <egil....@smpeatm.no> wrote:

> Ulrich Schoebel wrote:
> > % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> > <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> > % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> > <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
>
> Why then, does this work:
>
> % regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p> <p><p><p> aaaa

http://groups.google.com/groups?threadm=slrnbc25f9.ai0.xx087%40freenet10.carleton.ca

--
Glenn Jackman
NCF Sysadmin
gle...@ncf.ca

Ulrich Schoebel

unread,
May 26, 2003, 9:14:16 AM5/26/03
to
Good question.

EOW (End Of my Wisdom) ;-)
Maybe some regex guru on this group can elaborate??

Michael Schlenker

unread,
May 26, 2003, 9:35:23 AM5/26/03
to
Ulrich Schoebel wrote:
> Egil Støren wrote:
>
>> Ulrich Schoebel wrote:
>>
>>>
>>> Hi GGG,
>>>
>>> add .* to your RE and you'll see the difference:
>>>
>>>
>>> % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
>>> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
>>> % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
>>> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
>>
>>
>>
>> Why then, does this work:
>>
>> % regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
>> <p><p><p>aaaa<p> <p><p><p> aaaa
>>
Basic limitation of the regexp engine. Don't mix non-greedy with greedy
or the result will be surprising. If a greedy expression appears the
non-greedy qualifier does nothing...

(For more info about it google the group for "Henry Spencer" and
"greedy", should get you some posts.)

A link to it can also be found on this page...
http://mini.net/tcl/1345

Michael Schlenker

GGG

unread,
May 26, 2003, 9:33:54 AM5/26/03
to
Ulrich Schoebel <ulrich....@sigos.de> wrote in message news:<basrr6$30tbr$1...@ID-148741.news.dfncis.de>...
> GGG wrote:
> > ...

> > % regexp -inline ((?:<p>)*)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
> > <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> >
> > % regexp -inline ((?:<p>)*)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
> > <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
>...

> % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
>...

Hey Ulrich

Thanks!

I'm still trying to undersand why it works like that... do you
understand this or did you just come up with the solution on a
trial-an-error basis (i know i did)

BTW, anyone knows how to change your display name in google|groups?
GGG is not very informative way to communicate with people.

rgds
Nir.

Ulrich Schoebel

unread,
May 26, 2003, 9:43:28 AM5/26/03
to
GGG wrote:
> Ulrich Schoebel <ulrich....@sigos.de> wrote in message news:<basrr6$30tbr$1...@ID-148741.news.dfncis.de>...
>
>>GGG wrote:
>>
>>>...
>>>% regexp -inline ((?:<p>)*)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
>>><p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
>>>
>>>% regexp -inline ((?:<p>)*)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
>>><p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
>>
>>...
>>% regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
>><p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
>>% regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
>><p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
>>...
>
>
> Hey Ulrich
>
> Thanks!
>
> I'm still trying to undersand why it works like that... do you
> understand this or did you just come up with the solution on a
> trial-an-error basis (i know i did)
I thought I did, but :-( it seems I still have to learn 8-)

>
> BTW, anyone knows how to change your display name in google|groups?
> GGG is not very informative way to communicate with people.
>
> rgds
> Nir.

Michael A. Cleverly

unread,
May 26, 2003, 10:38:55 AM5/26/03
to
On Mon, 26 May 2003, Ulrich Schoebel wrote:

> >> add .* to your RE and you'll see the difference:
> >>
> >> % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> >> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> >> % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> >> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
> >
> > Why then, does this work:
> >
> > % regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
> > <p><p><p>aaaa<p> <p><p><p> aaaa
> >

> Good question.
>
> EOW (End Of my Wisdom) ;-)

regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}

only has non-greedy quantifiers, so there is no "wierd" interaction
between greedy & non-greedy quantifiers.

Here's a regexp that doesn't use any non-greedy quantifiers which also
allows a variable number of leading <p>'s:

% regexp -inline {((?:<p>)*)((?:(?!<p>).)+)<p>} {<p><p><p>aaaa<p>bbb<p>}


<p><p><p>aaaa<p> <p><p><p> aaaa

Michael

Nir Levy

unread,
May 27, 2003, 5:42:05 AM5/27/03
to
Michael Schlenker <sch...@uni-oldenburg.de> wrote in message news:<bat4vp$34fi0

>
> Basic limitation of the regexp engine. Don't mix non-greedy with greedy
> or the result will be surprising. If a greedy expression appears the
> non-greedy qualifier does nothing...
>
> (For more info about it google the group for "Henry Spencer" and
> "greedy", should get you some posts.)
>
> A link to it can also be found on this page...
> http://mini.net/tcl/1345
>


Thanks Michael & Michael ,Egil ,Glenn ,and Ulrich for your interest.

I have read almost everything in groups.google and the wiki about this
issue.

Your suggestion to use only greedy quantifiers is unfortunetly not
helpfull since my case is more complicated then i showed here (but i
did manage to work around it).

I do believe that this issue should be dealt with at the tcl core or
at least documented in the upcoming re_syntax man page. And, if what
you are saying is true ("If a greedy expression appears the
non-greedy qualifier does nothing") then maybe it should be considered
a bug in tcl's regexp (since it seems to be working fine with other
languages re engine)

<script language='JScript' >
//MS JScript
var x = '<p><p><p>aaaa<p>bbb<p>'
var greedy = /((?:<p>)*)(.+)<p>/i
var nongreedy = /((?:<p>)*)(.+?)<p>/i
alert(x.match(greedy))
alert(x.match(nongreedy))
</script>

Anyone from the Tcl core team wanna pick up the glove?

regds
/NL
(managed to change the display name from GGG at last!)

Bruce B Hartweg

unread,
May 27, 2003, 9:54:06 AM5/27/03
to

Nir Levy wrote:

> Michael Schlenker <sch...@uni-oldenburg.de> wrote in message news:<bat4vp$34fi0
> >
> > Basic limitation of the regexp engine. Don't mix non-greedy with greedy
> > or the result will be surprising. If a greedy expression appears the
> > non-greedy qualifier does nothing...
> >
> > (For more info about it google the group for "Henry Spencer" and
> > "greedy", should get you some posts.)
> >
> > A link to it can also be found on this page...
> > http://mini.net/tcl/1345
> >
>
> Thanks Michael & Michael ,Egil ,Glenn ,and Ulrich for your interest.
>
> I have read almost everything in groups.google and the wiki about this
> issue.
>
> Your suggestion to use only greedy quantifiers is unfortunetly not
> helpfull since my case is more complicated then i showed here (but i
> did manage to work around it).
>
> I do believe that this issue should be dealt with at the tcl core or
> at least documented in the upcoming re_syntax man page. And, if what
> you are saying is true ("If a greedy expression appears the
> non-greedy qualifier does nothing") then maybe it should be considered
> a bug in tcl's regexp (since it seems to be working fine with other
> languages re engine)
>

Different RE engines behave different.

The statement "If a greedy expression appears the non-greedy qualifier does nothing"

is untrue. It depends where the greedy and non-greedy exppresions occur.

Re-read the matching section of the re_syntax page,
http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm#M72
it discusses how every RE has a preference (shortest or longest) it explains how
each
part of an RE has a pref, and how the entire RE takes it's pref from it's
components,
and it also mentions how you can force the entire RE to prefer shortest or longest.

Once the entire RE gets it's match, the the internal prefs are still used to divide
up the total match.

It is always tricky to mix preferences & get what you think you should
(unless you learn to think like the RE engine). You can usually get much
better results using all greedy expression & rely on proper anchoring and
lookaheads (positive or negative) to match only what you want.

Bruce


Nir Levy

unread,
May 28, 2003, 3:12:52 AM5/28/03
to
Bruce B Hartweg <Bruce_B...@raytheon.com> wrote in message news:<3ED36DFE...@raytheon.com>...

> Nir Levy wrote:
> >
> > I have read almost everything in groups.google and the wiki about this
> > issue.
> >
>
> Different RE engines behave different.
>
> The statement "If a greedy expression appears the non-greedy qualifier
> does nothing" is untrue. It depends where the greedy and non-greedy
> exppresions occur.
>
> Re-read the matching section of the re_syntax page,
> http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm#M72
>
> [...trimmed..]

>
> It is always tricky to mix preferences & get what you think you should
> (unless you learn to think like the RE engine). You can usually get much
> better results using all greedy expression & rely on proper anchoring and
> lookaheads (positive or negative) to match only what you want.
>
> Bruce

Thanks Bruce,

Well, I do stand corrected here:
(from the man page) "A branch has the same preference as the first
quantified atom in it which has a preference. ... Note that the
quantifiers {1,1} and {1,1}? can be used to force longest and shortest
preference, respectively, on a subexpression or a whole RE."

I believe i might be starting to begin to maybe think i understand
what is going on here, but i am not sure :-)

I just used neg lookahead to accomplish what i wanted (as suggested by
many) and it worked like a charm (don't you just love lookaheads? and
to think that some people say they could do without it.. bah)

/NL

Donal K. Fellows

unread,
May 28, 2003, 6:32:25 AM5/28/03
to
GGG wrote:
> yes, I RTFM.
> no, i dont understand why the following two regexp's return the same
> result. anyone does?

Yes.

> % regexp -inline ((?:<p>)*)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}

> % regexp -inline ((?:<p>)*)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}

Tcl's RE engine has two modes of operation: greedy and non-greedy. All REs are
one or the other. The matching mode is selected by whether the first quantifier
encountered is a greedy or non-greedy one. In your case, the first quantifier
is always the '*' attached to (?:<p>) and hence that flags everything as greedy.
The other greedy/non-greedy flags are ignored. While it would be nice to be
able to mix quantifiers, producing an automata-theoretic description of what is
going on in such a matching is very hard indeed. (What happens in Perl is best
described as a horrible hack.)

In this case, there is a way to fix the problem:

# Force non-greedy matching
regexp -inline ((?:<p>)*?)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
# => <p><p> {} <p>

# That stopped matching <p>s too early, so we need to combine non-greedy with
# some deep trickyness (negative lookahead assertion) to get the right thing
regexp -inline ((?:<p>)*?)(?!<p>)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
# => <p><p><p>aaaa<p> <p><p><p> aaaa

PLEASE NOTE! If you're serious about parsing HTML (or XML), don't use REs.
There's too many nasty hidden gotchas.

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ donal....@man.ac.uk
-- If I teach this course thoroughly enough, none of you will attempt the exam
questions on this topic, and I shall consider this to be a complete success.
-- Arthur Norman

Nir Levy

unread,
May 29, 2003, 8:17:38 AM5/29/03
to
"Donal K. Fellows" <donal.k...@man.ac.uk> wrote in message news:<3ED49039...@man.ac.uk>...

>
> PLEASE NOTE! If you're serious about parsing HTML (or XML), don't use REs.
> There's too many nasty hidden gotchas.
>
> Donal.

I know. <p> was just an example. :-)

0 new messages