% set tcl_patchLevel
8.3.4
% regexp -inline ((?:<p>)*)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
% regexp -inline ((?:<p>)*)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
add .* to your RE and you'll see the difference:
% regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
% regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
Best regards
Ulrich
--
SIGOS Systemintegration GmbH
- TESTING IS OUR COMPETENCE -
Fon +49 911 95168-0
www.sigos.de
>
> Hi GGG,
>
> add .* to your RE and you'll see the difference:
>
>
> % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
Why then, does this work:
% regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p> <p><p><p> aaaa
???
Egil Støren
http://groups.google.com/groups?threadm=slrnbc25f9.ai0.xx087%40freenet10.carleton.ca
--
Glenn Jackman
NCF Sysadmin
gle...@ncf.ca
EOW (End Of my Wisdom) ;-)
Maybe some regex guru on this group can elaborate??
(For more info about it google the group for "Henry Spencer" and
"greedy", should get you some posts.)
A link to it can also be found on this page...
http://mini.net/tcl/1345
Michael Schlenker
Hey Ulrich
Thanks!
I'm still trying to undersand why it works like that... do you
understand this or did you just come up with the solution on a
trial-an-error basis (i know i did)
BTW, anyone knows how to change your display name in google|groups?
GGG is not very informative way to communicate with people.
rgds
Nir.
> >> add .* to your RE and you'll see the difference:
> >>
> >> % regexp -inline ((?:<p>)*)(.+)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> >> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa<p>bbb
> >> % regexp -inline ((?:<p>)*)(.+?)<p>.* {<p><p><p>aaaa<p>bbb<p>}
> >> <p><p><p>aaaa<p>bbb<p> <p><p><p> aaaa
> >
> > Why then, does this work:
> >
> > % regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
> > <p><p><p>aaaa<p> <p><p><p> aaaa
> >
> Good question.
>
> EOW (End Of my Wisdom) ;-)
regexp -inline (<p><p><p>)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
only has non-greedy quantifiers, so there is no "wierd" interaction
between greedy & non-greedy quantifiers.
Here's a regexp that doesn't use any non-greedy quantifiers which also
allows a variable number of leading <p>'s:
% regexp -inline {((?:<p>)*)((?:(?!<p>).)+)<p>} {<p><p><p>aaaa<p>bbb<p>}
<p><p><p>aaaa<p> <p><p><p> aaaa
Michael
Thanks Michael & Michael ,Egil ,Glenn ,and Ulrich for your interest.
I have read almost everything in groups.google and the wiki about this
issue.
Your suggestion to use only greedy quantifiers is unfortunetly not
helpfull since my case is more complicated then i showed here (but i
did manage to work around it).
I do believe that this issue should be dealt with at the tcl core or
at least documented in the upcoming re_syntax man page. And, if what
you are saying is true ("If a greedy expression appears the
non-greedy qualifier does nothing") then maybe it should be considered
a bug in tcl's regexp (since it seems to be working fine with other
languages re engine)
<script language='JScript' >
//MS JScript
var x = '<p><p><p>aaaa<p>bbb<p>'
var greedy = /((?:<p>)*)(.+)<p>/i
var nongreedy = /((?:<p>)*)(.+?)<p>/i
alert(x.match(greedy))
alert(x.match(nongreedy))
</script>
Anyone from the Tcl core team wanna pick up the glove?
regds
/NL
(managed to change the display name from GGG at last!)
Nir Levy wrote:
> Michael Schlenker <sch...@uni-oldenburg.de> wrote in message news:<bat4vp$34fi0
> >
> > Basic limitation of the regexp engine. Don't mix non-greedy with greedy
> > or the result will be surprising. If a greedy expression appears the
> > non-greedy qualifier does nothing...
> >
> > (For more info about it google the group for "Henry Spencer" and
> > "greedy", should get you some posts.)
> >
> > A link to it can also be found on this page...
> > http://mini.net/tcl/1345
> >
>
> Thanks Michael & Michael ,Egil ,Glenn ,and Ulrich for your interest.
>
> I have read almost everything in groups.google and the wiki about this
> issue.
>
> Your suggestion to use only greedy quantifiers is unfortunetly not
> helpfull since my case is more complicated then i showed here (but i
> did manage to work around it).
>
> I do believe that this issue should be dealt with at the tcl core or
> at least documented in the upcoming re_syntax man page. And, if what
> you are saying is true ("If a greedy expression appears the
> non-greedy qualifier does nothing") then maybe it should be considered
> a bug in tcl's regexp (since it seems to be working fine with other
> languages re engine)
>
Different RE engines behave different.
The statement "If a greedy expression appears the non-greedy qualifier does nothing"
is untrue. It depends where the greedy and non-greedy exppresions occur.
Re-read the matching section of the re_syntax page,
http://www.tcl.tk/man/tcl8.4/TclCmd/re_syntax.htm#M72
it discusses how every RE has a preference (shortest or longest) it explains how
each
part of an RE has a pref, and how the entire RE takes it's pref from it's
components,
and it also mentions how you can force the entire RE to prefer shortest or longest.
Once the entire RE gets it's match, the the internal prefs are still used to divide
up the total match.
It is always tricky to mix preferences & get what you think you should
(unless you learn to think like the RE engine). You can usually get much
better results using all greedy expression & rely on proper anchoring and
lookaheads (positive or negative) to match only what you want.
Bruce
Thanks Bruce,
Well, I do stand corrected here:
(from the man page) "A branch has the same preference as the first
quantified atom in it which has a preference. ... Note that the
quantifiers {1,1} and {1,1}? can be used to force longest and shortest
preference, respectively, on a subexpression or a whole RE."
I believe i might be starting to begin to maybe think i understand
what is going on here, but i am not sure :-)
I just used neg lookahead to accomplish what i wanted (as suggested by
many) and it worked like a charm (don't you just love lookaheads? and
to think that some people say they could do without it.. bah)
/NL
Yes.
> % regexp -inline ((?:<p>)*)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
> % regexp -inline ((?:<p>)*)(.+?)<p> {<p><p><p>aaaa<p>bbb<p>}
Tcl's RE engine has two modes of operation: greedy and non-greedy. All REs are
one or the other. The matching mode is selected by whether the first quantifier
encountered is a greedy or non-greedy one. In your case, the first quantifier
is always the '*' attached to (?:<p>) and hence that flags everything as greedy.
The other greedy/non-greedy flags are ignored. While it would be nice to be
able to mix quantifiers, producing an automata-theoretic description of what is
going on in such a matching is very hard indeed. (What happens in Perl is best
described as a horrible hack.)
In this case, there is a way to fix the problem:
# Force non-greedy matching
regexp -inline ((?:<p>)*?)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
# => <p><p> {} <p>
# That stopped matching <p>s too early, so we need to combine non-greedy with
# some deep trickyness (negative lookahead assertion) to get the right thing
regexp -inline ((?:<p>)*?)(?!<p>)(.+)<p> {<p><p><p>aaaa<p>bbb<p>}
# => <p><p><p>aaaa<p> <p><p><p> aaaa
PLEASE NOTE! If you're serious about parsing HTML (or XML), don't use REs.
There's too many nasty hidden gotchas.
Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ donal....@man.ac.uk
-- If I teach this course thoroughly enough, none of you will attempt the exam
questions on this topic, and I shall consider this to be a complete success.
-- Arthur Norman
I know. <p> was just an example. :-)