Bug in Perl regex engine?

2 views
Skip to first unread message

Edi Weitz

unread,
Sep 11, 2002, 3:17:12 PM9/11/02
to
Sorry if this is an old hat but I didn't find it in the FAQ.

It is my understanding that this code

edi@dyn164:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
**
edi@dyn164:~ >

should behave differently. Isn't the group referenced by $1 one that
can only match the character A or the character B, so _if_ $1 is
defined it cannot be the empty string as in this case? In other words:
Perl should have printed '*A*', '*B', or nothing at all. Given the
fact that the whole string matched it thus should print nothing.

I've seen this behaviour with 5.6.1, 5.6.0, and with older versions,
so I guess this is not considered a bug. So, what's the part which I
don't understand?

Thanks,
Edi.

brian d foy

unread,
Sep 11, 2002, 3:27:57 PM9/11/02
to
In article <87u1kw8...@dyn164.dbdmedia.de>, Edi Weitz <e...@agharta.de> wrote:

> edi@dyn164:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
> **

> I've seen this behaviour with 5.6.1, 5.6.0, and with older versions,


> so I guess this is not considered a bug. So, what's the part which I
> don't understand?

$1 is the empty string, which is the result of the match in the first
set of parens.

the first set of parens says "match A or B zero or more times, non-greedily".
it matched zero times, so $1 is the empty string (which is what it matched).

and, since you realized it's not a bug in the regex engine, why say
so in the subject line?

--
brian d foy <com...@panix.com> - Perl services for hire
The Perl Review - a new magazine devoted to Perl
<http://www.theperlreview.com>

Jeff 'japhy' Pinyan

unread,
Sep 11, 2002, 3:37:18 PM9/11/02
to
On Wed, 11 Sep 2002, brian d foy wrote:

>In article <87u1kw8...@dyn164.dbdmedia.de>, Edi Weitz <e...@agharta.de> wrote:
>
>> edi@dyn164:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
>> **
>
>> I've seen this behaviour with 5.6.1, 5.6.0, and with older versions,
>> so I guess this is not considered a bug. So, what's the part which I
>> don't understand?
>
>$1 is the empty string, which is the result of the match in the first
>set of parens.

$1 can only be the empty string if the CONTENTS of the first parenthetical
group match the empty string.

"X" =~ /(Y)?X/

yields $1 being undef, even though the match succeeds. However, it seems
that the non-greediness fools the regex engine:

"X" =~ /(Y)??X/

yields $1 being "".

>the first set of parens says "match A or B zero or more times, non-greedily".
>it matched zero times, so $1 is the empty string (which is what it matched).

I think this is a bug.

--
Jeff "japhy" Pinyan RPI Acacia Brother #734 2002 Acacia Senior Dean
"And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)

Joe Schaefer

unread,
Sep 11, 2002, 3:50:20 PM9/11/02
to
Jeff 'japhy' Pinyan <pin...@rpi.edu> writes:

[...]

> $1 can only be the empty string if the CONTENTS of the first
> parenthetical group match the empty string.
>
> "X" =~ /(Y)?X/
>
> yields $1 being undef, even though the match succeeds. However, it
> seems that the non-greediness fools the regex engine:
>
> "X" =~ /(Y)??X/
>
> yields $1 being "".
>
> >the first set of parens says "match A or B zero or more times,
> >non-greedily". it matched zero times, so $1 is the empty string
> >(which is what it matched).
>
> I think this is a bug.

Me too- this looks related to a recent bug report on p5p:

http://archive.develooper.com/perl5-...@perl.org/msg85782.html

I wonder if that patch takes care of this problem also. Anybody
running bleadperl care to test it?

--
Joe Schaefer "Rule 2. Measure. Don't tune for speed until you've measured,
and even then don't unless one part of the code overwhelms the
rest."
-- Rob Pike

Edi Weitz

unread,
Sep 11, 2002, 4:46:14 PM9/11/02
to
brian d foy <com...@panix.com> writes:

> In article <87u1kw8...@dyn164.dbdmedia.de>, Edi Weitz <e...@agharta.de> wrote:
>
> > edi@dyn164:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
> > **
>
> > I've seen this behaviour with 5.6.1, 5.6.0, and with older versions,
> > so I guess this is not considered a bug. So, what's the part which I
> > don't understand?
>
> $1 is the empty string, which is the result of the match in the first
> set of parens.
>
> the first set of parens says "match A or B zero or more times, non-greedily".
> it matched zero times, so $1 is the empty string (which is what it matched).

No, I think what you're talking about is

edi@bird:~ > perl -e '$_="CD"; /((?:A|B)*?)CD/; print "*$1*\n" if defined $1'
**
edi@bird:~ >

in which case it would of course be correct for $1 to be the empty
string. Note that these two cases are different - the regular
expression /(?:A|B)*?/ _can_ match the empty string, /A|B/ can't.

> and, since you realized it's not a bug in the regex engine,

I didn't "realize" that. I just said that I found this behaviour in
different Perl implementation and that "I guess this is not considered
a bug" meaning I might have overlooked a paragraph in the docs where
this is explained as an exception to the usual rules.

I personally still think this is a bug. Emacs does too:

1. Perl 5.6.1:

edi@bird:~ > perl -e '$_="CD"; print "yes\n" if /(A|B)*?CD/;'
yes
edi@bird:~ > perl -e '$_="CD"; print "yes\n" if /(A|B)*?CD\1/;'
yes
edi@bird:~ >

2. GNU Emacs 21.2:

(string-match "\\(A\\|B\\)*?CD" "CD")
0
(string-match "\\(A\\|B\\)*?CD\\1" "CD")
nil

> why say so in the subject line?

Maybe your news reader wasn't able to render the question mark I typed
in the subject line...

Cheers,
Edi.

Uri Guttman

unread,
Sep 11, 2002, 5:59:19 PM9/11/02
to
>>>>> "J'P" == Jeff 'japhy' Pinyan <pin...@rpi.edu> writes:

>>> edi@dyn164:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if
>>> defined $1'

J'P> $1 can only be the empty string if the CONTENTS of the first
J'P> parenthetical group match the empty string.

but there is a well known issue with grabs and modifier counts. you only
see the value of the last grab and you can't trust the grab numbers to
mean anything.

J'P> "X" =~ /(Y)?X/

J'P> yields $1 being undef, even though the match succeeds. However,
J'P> it seems that the non-greediness fools the regex engine:

J'P> "X" =~ /(Y)??X/

J'P> yields $1 being "".

>> the first set of parens says "match A or B zero or more times,
>> non-greedily". it matched zero times, so $1 is the empty string
>> (which is what it matched).

J'P> I think this is a bug.

maybe but

perl -e '"X" =~ /((Y)*X)/ && print "*$2*\n"' ;

prints ** and $1 is correctly 'X'

it isn't the greedy thing but a * modifier count on a grab. the last
thing grabbed is what is put into the $2 var and it has to be '' since
it the (Y) must fail. so i am not sure it is a bug but as i said above,
* modifiers and $1 and friends don't play well together. there is a
semantic problem bigger than just a possible code bug. which iteration
and value gets put into the var? since the last one is grabbed by
default, it is the '' string which is the only way the whole regex
matches.

so to the OP, don't do that! :)

perl6 will have ways to grab all count iterations of a grab in an array
so you will be able to do that.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
----- Stem and Perl Development, Systems Architecture, Design and Coding ----
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Joe Schaefer

unread,
Sep 11, 2002, 6:27:27 PM9/11/02
to
Joe Schaefer <joe+u...@sunstarsys.com> writes:

[...]

> Me too- this looks related to a recent bug report on p5p:
>
> http://archive.develooper.com/perl5-...@perl.org/msg85782.html
>
> I wonder if that patch takes care of this problem also. Anybody
> running bleadperl care to test it?

Looks like it does:

% perl5.8.0 -we '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
**
% cd perl-current
% ./perl -we '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
%

--
Joe Schaefer "The important thing is not to stop questioning. Curiosity has
its own reason for existing."
--Albert Einstein

Edi Weitz

unread,
Sep 11, 2002, 7:21:19 PM9/11/02
to
Uri Guttman <u...@stemsystems.com> writes:

> >>>>> "J'P" == Jeff 'japhy' Pinyan <pin...@rpi.edu> writes:
>
> >>> edi@dyn164:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if
> >>> defined $1'
>
> J'P> $1 can only be the empty string if the CONTENTS of the first
> J'P> parenthetical group match the empty string.
>
> but there is a well known issue with grabs and modifier counts. you only
> see the value of the last grab and you can't trust the grab numbers to
> mean anything.

But that would be rather useless. In almost all other cases Perl
returns the values I'd expect.

The Camel Book (I have the 2nd edition here) is quite clear about the
subject:

"\1, \2, \3, and so on, are equivalent to whatever the corresponding
set of parentheses matched, counting opening parentheses from left
to right. (If the particular pair of parentheses had a quantifier
such as * after it, such that it matched a series of substrings,
then only the last match counts as the backreference.)"

I'd say this is a defined behaviour and far from "you can't trust it
to mean anything".

Also, how would you explain the following?

edi@bird:~ > perl -e '$_="CD"; /(A|B)*?CD/; print "*$1*\n" if defined $1'
**
edi@bird:~ > perl -e '$_="CD"; /(A|B)*CD/; print "*$1*\n" if defined $1'
edi@bird:~ >

Here the greedy and the non-greedy version clearly result in the same
match. I think $1 should be the same in both cases, too.

> [snip]


>
> J'P> I think this is a bug.
>
> maybe but
>
> perl -e '"X" =~ /((Y)*X)/ && print "*$2*\n"' ;
>
> prints ** and $1 is correctly 'X'

Yes, but I don't actually care whether $2 _prints_ as '', the question
is whether it is defined:

edi@bird:~ > perl -e '"X" =~ /((Y)*X)/ && print "yo\n" if defined $2'
edi@bird:~ >

This is correct IMHO, so it's not the same problem as above. $2 is
undefined and rightly so because it _cannot_ be the empty string - it
must be either 'Y' or nothing.

Judging from the fact that perl-current doesn't show the behaviour I
reported and that other regex implementations (see my Emacs example in
another posting in this thread) behave differently I think I will file
this as a bug in my book.

> so to the OP, don't do that! :)

Actually this wasn't a real-world example. I wrote my own regex engine
and tested it against the 600+ test cases that come with PCRE. I found
three cases where my engine and Perl differ. One of them is the regex
which started this thread, another one is similar. The third one I'll
report in another thread... :)

Cheers,
Edi.

Tina Mueller

unread,
Sep 12, 2002, 6:11:21 AM9/12/02
to
Uri Guttman <u...@stemsystems.com> wrote:

> J'P> I think this is a bug.

> maybe but

> perl -e '"X" =~ /((Y)*X)/ && print "*$2*\n"' ;

> prints ** and $1 is correctly 'X'

> it isn't the greedy thing but a * modifier count on a grab. the last
> thing grabbed is what is put into the $2 var and it has to be '' since
> it the (Y) must fail.

"you should always enable warnings"

=) SCNR

tina@tox:~> perl -we '"X" =~ /((Y)*X)/ && print "*$2*\n"'
Use of uninitialized value in concatenation (.) or string at -e line 1.
**

$2 is undefined.

regards, tina
--
http://www.tinita.de/ \ enter__| |__the___ _ _ ___
http://Movies.tinita.de/ \ / _` / _ \/ _ \ '_(_-< of
http://PerlQuotes.tinita.de/ \ \ _,_\ __/\ __/_| /__/ perception

Reply all
Reply to author
Forward
0 new messages