Regular Expression Question - non greedy match

John Moreno

unread,

Dec 20, 1997, 3:00:00 AM12/20/97

to

In comp.lang.perl.misc Eli the Bearded <#@qz.to> wrote:

> John Moreno <phe...@interpath.com> wrote:
> > Honza Pazdziora <ade...@fi.muni.cz> wrote:
> > > phe...@interpath.com (John Moreno) writes:
> > > > $count = s/(#! rnews )(\d+\n)(.+?)/$1.(length $3).("\n").$3/gse;
> > > > Which gives 1 char for the (length $3), I would have thought that it
> > > > would go until it hit the next #! rnews.
> > > Why should it? You said .+? which means at least one any character and
> > > the least possible count. Which gives us nice 1.
> > I understand what it was doing, I'm just not sure why the ? wasn't more
> > greedy - I would have thought that when using the /g that a ? at the end
>
> The DWIM operator has not yet become part of the standard perl regular
> expression functionality. In the meantime, we have normal greedy repetion
> operators such as + and *, and we have non-greedy versions, such as +?
> and *?. The greedy ones always try to grab as much as possible, the non-
> greedy ones always try to grab as little as possible.

Well, I didn't expect it use the DWIM operator, I guess what I didn't
understand clearly was that the ? is evaluated in relationship to the
whole expression not just it's little part of it. Which means that if
it did what I thought it would do (use the ? with the next value being
considered the beginning of the regex) it would have problems with
infinite recursions.

Instead it has the more generally useful lookahead functions - they just
didn't occur to me until after I'd already completed the project.

> It is my opinion that non-greedy repetition is very subtle and tends to
> expose a large number of programmer bugs. I almost always recommend doing
> without it.

Well it isn't normally my first thought either - not that I have enough
experience for that to mean anything.

--
John Moreno

Eli the Bearded

unread,

Dec 21, 1997, 3:00:00 AM12/21/97

to

John Moreno <phe...@interpath.com> wrote:
> Honza Pazdziora <ade...@fi.muni.cz> wrote:
> > phe...@interpath.com (John Moreno) writes:
> > > $count = s/(#! rnews )(\d+\n)(.+?)/$1.(length $3).("\n").$3/gse;
> > > Which gives 1 char for the (length $3), I would have thought that it
> > > would go until it hit the next #! rnews.
> > Why should it? You said .+? which means at least one any character and
> > the least possible count. Which gives us nice 1.
> I understand what it was doing, I'm just not sure why the ? wasn't more
> greedy - I would have thought that when using the /g that a ? at the end

The DWIM operator has not yet become part of the standard perl regular
expression functionality. In the meantime, we have normal greedy repetion
operators such as + and *, and we have non-greedy versions, such as +?
and *?. The greedy ones always try to grab as much as possible, the non-
greedy ones always try to grab as little as possible.

It is my opinion that non-greedy repetition is very subtle and tends to

expose a large number of programmer bugs. I almost always recommend doing
without it.

I'd do something like this myself:

$r='#! rnews ';

$count = s/$r\d+\n((?:(?!$r)#|[^#]+)+)/$r.(length $1)."\n$1"/gse;

Or maybe even this (for added security against false positives from
#! rnews appearing somewhere besides the start of a line):

$r='#! rnews ';

$count = s/^$r\d+\n((?:(?!\n$r)\n|.+)+)/$r.(length $1)."\n$1"/gem;

Elijah
------
#!/usr/bin/perl -- -*- my ny.pm sig -*-
$_=$^ ;s;s;sss;;s^.^ju^&&s&P&,\n&&&(s(_..)(ers)||s|^|^^|)&&s(T)(q(st%eg))eg;
s<.(o).><$& new 1$$>i+s+\dst.+$a--||reverse(q(rep k))+ge;s*%.+u* so+*i;s=\++
="me"=mex&&s%ege%l$"hke%;$a||s/^\S+ /\/\//;s;\d+;yor;;s[KE]<ac$&>i;print $_;

Ilya Zakharevich

unread,

Dec 21, 1997, 3:00:00 AM12/21/97

to

In article <eli$97122...@qz.little-neck.ny.us>,

Eli the Bearded <#@qz.to> wrote:

> > > > $count = s/(#! rnews )(\d+\n)(.+?)/$1.(length $3).("\n").$3/gse;
> > > > Which gives 1 char for the (length $3), I would have thought that it
> > > > would go until it hit the next #! rnews.
> > > Why should it? You said .+? which means at least one any character and
> > > the least possible count. Which gives us nice 1.
> > I understand what it was doing, I'm just not sure why the ? wasn't more
> > greedy - I would have thought that when using the /g that a ? at the end

> It is my opinion that non-greedy repetition is very subtle and tends to

> expose a large number of programmer bugs. I almost always recommend doing
> without it.
>
> I'd do something like this myself:
>
> $r='#! rnews ';
>
> $count = s/$r\d+\n((?:(?!$r)#|[^#]+)+)/$r.(length $1)."\n$1"/gse;

How is it less error-prone than
/$r(\d+\n)(.+?)(?=$r|$)/
?

Ilya

Karen Lindsey

unread,

Dec 21, 1997, 3:00:00 AM12/21/97

to

Eli the Bearded <#@qz.to> wrote:

[snip]

> It is my opinion that non-greedy repetition is very subtle and tends to
> expose a large number of programmer bugs. I almost always recommend doing

> without it. ^^^^^^^^^^^^^^^
[snip]

> Or maybe even this (for added security against false positives from
> #! rnews appearing somewhere besides the start of a line):

^^^^^^^^
[snip]

This really made me laugh - it broke my news unbatching program... because
it was at the start of a line...

Ho ho ho.

- KSL. -
(Well, I did write it myself after all).

Eli the Bearded

unread,

Dec 22, 1997, 3:00:00 AM12/22/97

to

Ilya Zakharevich <il...@math.ohio-state.edu> wrote:
> Eli the Bearded <#@qz.to> wrote:
> > It is my opinion that non-greedy repetition is very subtle and tends to
> > expose a large number of programmer bugs. I almost always recommend doing
> > without it.

> > $r='#! rnews ';
> > $count = s/$r\d+\n((?:(?!$r)#|[^#]+)+)/$r.(length $1)."\n$1"/gse;
> How is it less error-prone than
> /$r(\d+\n)(.+?)(?=$r|$)/
> ?

Well, other than your just adding a new set of parens that throw off
the replacement count, I'd use \Z instead of $ to ensure the right
anchoring if /m were ever added to the flags.

I know your point was that "Here is a case were non-greedy regexps
are not harmful, and in fact simplify things considerably." I know.
There is very strong deliberate anchoring around the part that needed
to be captured, which makes non-greedy matching safe. My point was
to show that it could be solved without needing non-greedy matching.

Elijah
------
will probably always be getting in Ilya's hair over this issue

Ilya Zakharevich

unread,

Dec 22, 1997, 3:00:00 AM12/22/97

to

In article <eli$97122...@qz.little-neck.ny.us>,

Eli the Bearded <#@qz.to> wrote:
> Ilya Zakharevich <il...@math.ohio-state.edu> wrote:
> > Eli the Bearded <#@qz.to> wrote:
> > > $r='#! rnews ';
> > > $count = s/$r\d+\n((?:(?!$r)#|[^#]+)+)/$r.(length $1)."\n$1"/gse;
> > How is it less error-prone than
> > /$r(\d+\n)(.+?)(?=$r|$)/
> > ?
>
> Well, other than your just adding a new set of parens that throw off
> the replacement count,

There is no new set of parens.

> I'd use \Z instead of $ to ensure the right
> anchoring if /m were ever added to the flags.

\Z is not much better than $. I would use \z, but since 5.005 will
not have it ;-), one is forced to use these abomination \Z(?!\n).

> I know your point was that "Here is a case were non-greedy regexps
> are not harmful, and in fact simplify things considerably." I know.
> There is very strong deliberate anchoring around the part that needed
> to be captured, which makes non-greedy matching safe. My point was
> to show that it could be solved without needing non-greedy matching.

The problem is that few people will be able to read your re, so check
that "it could be solved without needing non-greedy matching". I
could not.

Re's are unreadable from the start. One should not deliberately add
to this, unless necessary.

Ilya

P.S. Interested people may look in ./t/op/pat.t of >= 5.004_55 to see
the beast which matches "matching parens" (let's look... it is near
matchit). The only reason I wrote it is to check new (?{...}) engine,
this is not what I would like to have in production code (though it is
_almost_ readable).

P.P.S. Well, to decrease quoted/contents ratio, I may just include it
here:

m'
(
$
(?{ $c = 1 }) # Initialize
(?:
(?(?{ $c == 0 }) # PREVIOUS iteration was OK, stop the loop
(?!
) # Fail: will unwind one iteration back
)
(?:
[^()]+ # Match a big chunk
(?=
[()]
) # Do not try to match subchunks
|
\(
(?{ ++$c })
|
$
(?{ --$c })
)
)+ # This may not match with different subblocks
)
(?(?{ $c != 0 })
(?!
) # Fail
) # Otherwise the chunk 1 may succeed with $c>0
'xg;