Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Strange behavior of 'Alternative capture group numbering'

5 views
Skip to first unread message

Raymundo

unread,
Jan 1, 2012, 11:59:44 AM1/1/12
to
Hello,

At first, I'm sorry that I'm not good at English.

I'm reading "perlretut" (Perl Regular Expression Tutorial) of version
5.14 now:
http://perldoc.perl.org/perlretut.html

While I was reading "Alternative capture group numbering" section,
I wrote a simple test program to practice it myself.

I'm using Strawberry Perl 5.12.3 on Windows XP.

Here is my code:
-----
#!perl
use strict;
use warnings;

while (1) {
my $input = <STDIN>;
chomp $input;
if ( $input =~ /(?|(a)(b)|(c))(d)/ ) {
print "1[$1] 2[$2] 3[$3]\n";
}
}
-----

Here is the result:
-----
abd
1[a] 2[b] 3[d]
cd
Use of uninitialized value $2 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 2.
1[c] 2[] 3[d]
----

Okay. This is what I expected and what the document said. 'd' is
assigned to $3 because the maximum number in the alternative numbering
group is 2.

Then I modified the pattern, only changing the order of two group in
the alternative numbering group:
-----
if ( $input =~ /(?|(c)|(a)(b))(d)/ ) {
-----
This is the result:
-----
abd
Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 1.
1[a] 2[d] 3[]
cd
Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 2.
1[c] 2[d] 3[]
----

I have no idea why the result differs from the first one.
Why 'd' is in $2, not $3? Where did 'b' of 'abd' go after matching?

Is this a bug? Or is there something that I misunderstand?

Any help would be appreciated.
Thank you.

s...@netherlands.com

unread,
Jan 1, 2012, 3:32:34 PM1/1/12
to
Its probably not a bug if you had to program branch reset code,
because the whole thing is buggy and tends to crash at the drop of
a hat.

Using the regex debug mechanism some observations can be noted.
The last branch-reset alternation is labled BRANCH (FAIL).
Apparently, the number of capture buffers in this branch is
NOT counted when calculating the largest number of buffers.
Therefore, the # capture buffer after the branch-reset is the
largest of the branches BEFORE the last branch.

Example:

(?|
(x) ()
|
(c)
|
(a) (b) (r)
)
(d)

Produces this code:

1: BRANCH (13)
2: OPEN1 (4)
4: EXACT <x> (6)
6: CLOSE1 (8)
8: OPEN2 (11)
10: NOTHING (11)
11: CLOSE2 (40)
13: BRANCH (20)
14: OPEN1 (16)
16: EXACT <c> (18)
18: CLOSE1 (40)
20: BRANCH (FAIL)
21: OPEN1 (23)
23: EXACT <a> (25)
25: CLOSE1 (27)
27: OPEN2 (29)
29: EXACT <b> (31)
31: CLOSE2 (33)
33: OPEN3 (35)
35: EXACT <r> (37)
37: CLOSE3 (40)
39: TAIL (40)
40: OPEN3 (42)
42: EXACT <d> (44)
44: CLOSE3 (46)
46: END (0)

You can see that (d) is capture buffer 3, but it should be 4.

So the simple solution is that the largest number of capture buffers
should not be in the last branch.

There are a couple of ways around this.

1 - Pad a different branch with a NOTHING capture group.
(?|
(c) ()
| (a)(b)
)
(d)

or,

2 - Move the largest number of captures into another branch.
(?|
(a)(b)
| (c)
)
(d)

This is just an observation that seems to hold true.
In my mind, branch-reset in Perl or any PCRE engine is just
one big bug, and should be avoided.

-sln

Ben Morrow

unread,
Jan 1, 2012, 5:16:43 PM1/1/12
to

Quoth s...@netherlands.com:
> On Sun, 1 Jan 2012 08:59:44 -0800 (PST), Raymundo <gyp...@gmail.com> wrote:
>
> >Then I modified the pattern, only changing the order of two group in
> >the alternative numbering group:
> >-----
> > if ( $input =~ /(?|(c)|(a)(b))(d)/ ) {
> >-----
> >This is the result:
> >-----
> >abd
> >Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
> >\test.pl line 13, <STDIN> line 1.
> >1[a] 2[d] 3[]
> >cd
> >Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
> >\test.pl line 13, <STDIN> line 2.
> >1[c] 2[d] 3[]
> >----
> >
> >I have no idea why the result differs from the first one.
> >Why 'd' is in $2, not $3? Where did 'b' of 'abd' go after matching?
> >
> >Is this a bug? Or is there something that I misunderstand?
>
> Its probably not a bug if you had to program branch reset code,
> because the whole thing is buggy and tends to crash at the drop of
> a hat.

It looks to me like a bug in perl, and it appears to have been fixed in
5.14.

If you have any other instances of (?|) causing problems (that persist
in 5.14), and certainly if you have any examples of crashes, you should
report them with perlbug.

Ben

Raymundo

unread,
Jan 1, 2012, 6:57:21 PM1/1/12
to
On 1월2일, 오전7시16분, Ben Morrow <b...@morrow.me.uk> wrote:
> Quoth s...@netherlands.com:
>
>
> It looks to me like a bug in perl, and it appears to have been fixed in
> 5.14.
>
> If you have any other instances of (?|) causing problems (that persist
> in 5.14), and certainly if you have any examples of crashes, you should
> report them with perlbug.
>
> Ben


Thank you, sln and Ben.

I've posted the same question on my twitter, and received replies
saying
that 5.14 shows correct results. One of my follows sent me this link:
http://perl5.git.perl.org/perl.git/commit/fd4be6f07df0e6a021290ef721c5d73550e0248c


Happy New Year~ :-)

G.Y.Park from South Korea
0 new messages