Efficiency of s///e?

Tim McDaniel

unread,

May 16, 2013, 11:36:48 PM5/16/13

to

There's a sub in our code base which has something like

my $prevent_infinite_loop = 0;
while ($prevent_infinite_loop++ < 1000 && $text =~ /(complicated) (regular) (expression)/) {
... various calculations on $1, $2, $3, ... resulting in $replacement;
$text =~ s/(complicated) (regular) (expression)/$replacement/;
# that's exactly the same regular expression as above
}

(though I think that, with the particular pattern, an infinite loop is
impossible.) To add new features and for maintainability,
I've developed

$text =~ s{(simpler)}{
my $found = $1;
... various calculations involving split on $found, unshift, ...
$replacement;
}eg;

I'm wondering about the efficiency of this approach, partincularly
s{}{}e. For instance, is the right-hand side code compiled at run
time, or at compile time? Any other major concerns? We still use
Perl 5.8.8 for now, alas.

--
Tim McDaniel, tm...@panix.com

Charles DeRykus

unread,

May 17, 2013, 1:15:03 AM5/17/13

to

It's syntax checked and compiled at compile time along with the rest of
your program.

The only gotcha IMO is the replacement morphing into a
long,hard-to-unravel mess that's hard-on-the-eyes and tough to debug.
Even commented, a big multi-line s/pattern/replacement/ becomes vertigo
inducing.

s{ ... }
{ $1 ... # blah
.... # more blah...
...
....
}gex;

At some point, a plain old "if" block seems better.

--
Charles DeRykus

Ben Morrow

unread,

May 17, 2013, 4:58:17 AM5/17/13

to

Quoth Charles DeRykus <der...@gmail.com>:

> On 5/16/2013 8:36 PM, Tim McDaniel wrote:
> > There's a sub in our code base which has something like
> >
> > my $prevent_infinite_loop = 0;
> > while ($prevent_infinite_loop++ < 1000 && $text =~ /(complicated)
> (regular) (expression)/) {
> > ... various calculations on $1, $2, $3, ... resulting in
> $replacement;
> > $text =~ s/(complicated) (regular) (expression)/$replacement/;
> > # that's exactly the same regular expression as above
> > }
> >
> > (though I think that, with the particular pattern, an infinite loop is
> > impossible.) To add new features and for maintainability,
> > I've developed
> >
> > $text =~ s{(simpler)}{
> > my $found = $1;
> > ... various calculations involving split on $found, unshift, ...
> > $replacement;
> > }eg;

You can also use @+ and substr, as was discussed here a little while
ago:

while (... and $text =~ /.../) {
my ($start, $length) = ($-[0], $+[0] - $-[0]);

# calculate $replacement

substr $text, $start, $length, $replacement;

}

> > I'm wondering about the efficiency of this approach, partincularly
> > s{}{}e. For instance, is the right-hand side code compiled at run
> > time, or at compile time? Any other major concerns? We still use
> > Perl 5.8.8 for now, alas.
> >
>
> It's syntax checked and compiled at compile time along with the rest of
> your program.
>
> The only gotcha IMO is the replacement morphing into a
> long,hard-to-unravel mess that's hard-on-the-eyes and tough to debug.
> Even commented, a big multi-line s/pattern/replacement/ becomes vertigo
> inducing.
>
> s{ ... }
> { $1 ... # blah
> .... # more blah...
> ...
> ....
> }gex;
>
> At some point, a plain old "if" block seems better.

A block like

s{...}{
# code
}gex;

isn't entirely different from an if: to some extent it's just a
brace-delimited block like any other. However, given that it does
actually have slightly strange parsing rules, I'd probably prefer to
move large amounts of code into a sub, either named or anonymous:

my $dorepl = sub { ... };
s/$pattern/$dorepl->()/ge;

Ben

Tim McDaniel

unread,

May 17, 2013, 12:57:24 PM5/17/13

to

In article <928h6a-...@anubis.morrow.me.uk>,

Ben Morrow <b...@morrow.me.uk> wrote:
>A block like
>
> s{...}{
> # code
> }gex;

You don't need the "x" to allow arbitrary formatting on the right-hand
side, only for the left, right?

>isn't entirely different from an if: to some extent it's just a
>brace-delimited block like any other. However, given that it does
>actually have slightly strange parsing rules,

Oh, please don't leave it like that, you code-tease! Is it because
the terminator, "}" here, can screw it up?
s{...}{
$blort .= "}";
}

would screw up by terminating at the apparently "inner" "}"?
(perlop, Gory details of parsing quoted constructs) Or anything else?

--
Tim McDaniel, tm...@panix.com

Ben Morrow

unread,

May 17, 2013, 3:35:30 PM5/17/13

to

Quoth tm...@panix.com:

> In article <928h6a-...@anubis.morrow.me.uk>,
> Ben Morrow <b...@morrow.me.uk> wrote:
> >A block like
> >
> > s{...}{
> > # code
> > }gex;
>
> You don't need the "x" to allow arbitrary formatting on the right-hand
> side, only for the left, right?

I think with /e that's right, yes. That hadn't occurred to me... (I just
checked and you can include comments in the RHS with /e, which is what I
thought might not have worked.)

> >isn't entirely different from an if: to some extent it's just a
> >brace-delimited block like any other. However, given that it does
> >actually have slightly strange parsing rules,
>
> Oh, please don't leave it like that, you code-tease! Is it because
> the terminator, "}" here, can screw it up?
> s{...}{
> $blort .= "}";
> }
>
> would screw up by terminating at the apparently "inner" "}"?
> (perlop, Gory details of parsing quoted constructs) Or anything else?

The handling of the closing terminator is what I was referring to. The
first unbalanced, unescaped } will close the s{}{}e, and \} will be
converted to } before the read-this-as-Perl parser sees it. Fortunately
there are relatively few real situations where this matters, though it's
easy to create artificial situations with bizarre results, like

s{...}{ { "foo" \} # } }e

It's also worth noting that \\ is *not* converted to \, even though \\}
prevents the backslash from escaping the }. As I said, slightly strange
rules...

Ben

Charles DeRykus

unread,

May 17, 2013, 4:09:32 PM5/17/13

to

On 5/17/2013 9:57 AM, Tim McDaniel wrote:
> In article <928h6a-...@anubis.morrow.me.uk>,
> Ben Morrow <b...@morrow.me.uk> wrote:
>> A block like
>>
>> s{...}{
>> # code
>> }gex;
>
> You don't need the "x" to allow arbitrary formatting on the right-hand
> side, only for the left, right?

Yes. I meant show comments on the left. See what I mean about vertigo...

>
>> isn't entirely different from an if: to some extent it's just a
>> brace-delimited block like any other. However, given that it does
>> actually have slightly strange parsing rules,
>
> Oh, please don't leave it like that, you code-tease! Is it because
> the terminator, "}" here, can screw it up?
> s{...}{
> $blort .= "}";
> }
>
> would screw up by terminating at the apparently "inner" "}"?
> (perlop, Gory details of parsing quoted constructs) Or anything else?

Yes, usually, you'd have to escape delimiters, "\}" or "\{"

Though, if you have a matching pair, you're ok:

{ $blort .= "{}"; }e

Yet, escaping one, but not both, there's a problem again:

( $blort .= "\{}"; }e; # verboten!

--
Charles DeRykus

Eric Pozharski

unread,

May 18, 2013, 3:30:24 AM5/18/13

to

Care to explain why you replaced $prevent_infinite_loop with elipsis?

#!/usr/bin/perl

use strict;
use warnings;

my( $aa, $ab ) = qw/ aaaa bbbb /;

my $prevent_infinite_loop;
while( ++$prevent_infinite_loop < 1000 && $aa =~ /aa/ ) {
my( $start, $length ) = ( $-[0], $+[0] + $-[0] );
my $replacement = 'a';
substr $aa, $start, $length, $replacement;
print "aa: $aa\n";
}

$ab =~ s{bb}{ print "ab (before): $ab\n"; 'b' }ge;
print "ab (now): $ab\n";

__END__

{2809:6} [0:0]% p.x foo.tI7BTR.pl
p.x:1: no such file or directory: ./Build
aa: aaa
aa: aa
aa: a
ab (before): bbbb
ab (before): bbbb
ab (now): bb

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Ben Morrow

unread,

May 18, 2013, 9:26:49 AM5/18/13

to

Quoth Eric Pozharski <why...@pozharski.name>:

> with <928h6a-...@anubis.morrow.me.uk> Ben Morrow wrote:
> >
> > Quoth Charles DeRykus <der...@gmail.com>:
> >> On 5/16/2013 8:36 PM, Tim McDaniel wrote:
> >> > There's a sub in our code base which has something like
> >> >
> >> > my $prevent_infinite_loop = 0;
> >> > while ($prevent_infinite_loop++ < 1000 && $text =~ /(complicated)
> >> (regular) (expression)/) {

[...]

> >
> > You can also use @+ and substr, as was discussed here a little while
> > ago:
> >
> > while (... and $text =~ /.../) {

[...]

>
> Care to explain why you replaced $prevent_infinite_loop with elipsis?

To avoid having to type '$prevent_infinite_loop++ < 1000'? No other
reason; the implication was intended to be 'this bit is the same as
before', just like the inside of the regex.

> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my( $aa, $ab ) = qw/ aaaa bbbb /;
>
> my $prevent_infinite_loop;
> while( ++$prevent_infinite_loop < 1000 && $aa =~ /aa/ ) {
> my( $start, $length ) = ( $-[0], $+[0] + $-[0] );
> my $replacement = 'a';
> substr $aa, $start, $length, $replacement;
> print "aa: $aa\n";
> }
>
> $ab =~ s{bb}{ print "ab (before): $ab\n"; 'b' }ge;
> print "ab (now): $ab\n";
>
> __END__
>
> {2809:6} [0:0]% p.x foo.tI7BTR.pl
> p.x:1: no such file or directory: ./Build
> aa: aaa
> aa: aa
> aa: a
> ab (before): bbbb
> ab (before): bbbb
> ab (now): bb

I'm not sure what this is supposed to be demonstrating...

Ben

szr

unread,

May 18, 2013, 6:51:11 PM5/18/13

to

On 5/17/2013 12:35 PM, Ben Morrow wrote:
> The handling of the closing terminator is what I was referring to. The
> first unbalanced, unescaped } will close the s{}{}e, and \} will be
> converted to } before the read-this-as-Perl parser sees it. Fortunately
> there are relatively few real situations where this matters, though it's
> easy to create artificial situations with bizarre results, like
>
> s{...}{ { "foo" \} # } }e
>

> It's also worth noting that \\ is*not* converted to \, even though \\}

> prevents the backslash from escaping the }. As I said, slightly strange
> rules...

The same also seems to be true of eval( expr ), such as in:

eval qq{ { "foo" \} # } };

Remove the closing curly brace immediately after the # and you get a

Can't find string terminator "}" ...

error, not unlike what happens when the same is done in a substitution
like yours above.

Also, keep in mind, in the substitution, you can use other delimiters
besides { ... }, such as:

$s =~ s{ ... }< { "foo" } # } >e

In which case escaping that } that comes before the # actually cases an
error:

syntax error at line ..., near ""foo" \"

Although, the same is no completely true of qq<...>, as both

eval qq< { "foo" \} # } >;

and

eval qq< { "foo" } # } >;

yield no errors or warnings and return the string: foo

Not sure what exactly accounts for this difference though.

--
szr

Ben Morrow

unread,

May 18, 2013, 7:26:10 PM5/18/13

to

Quoth szr <s...@sREzMOrVEoman.com>:

> On 5/17/2013 12:35 PM, Ben Morrow wrote:
> > The handling of the closing terminator is what I was referring to. The
> > first unbalanced, unescaped } will close the s{}{}e, and \} will be
> > converted to } before the read-this-as-Perl parser sees it. Fortunately
> > there are relatively few real situations where this matters, though it's
> > easy to create artificial situations with bizarre results, like
> >
> > s{...}{ { "foo" \} # } }e
> >
> > It's also worth noting that \\ is*not* converted to \, even though \\}
> > prevents the backslash from escaping the }. As I said, slightly strange
> > rules...
>
> The same also seems to be true of eval( expr ), such as in:
>
> eval qq{ { "foo" \} # } };
>
> Remove the closing curly brace immediately after the # and you get a
>
> Can't find string terminator "}" ...
>
> error, not unlike what happens when the same is done in a substitution
> like yours above.

Yep, and for much the same reason: the RHS of a s/// is basically a
qq//-quoted string. The most important difference is that, unlike eval,
s///e compiles its RHS at compile time, which I think accounts for the
differences you point out below.

> Also, keep in mind, in the substitution, you can use other delimiters
> besides { ... }, such as:
>
> $s =~ s{ ... }< { "foo" } # } >e

You can, but it's usually a bad idea with /e. < and > in particular
often appear alone, so the chance of accidentally closing the s///e is
much higher. Also, code blocks in Perl are usually brace-delimited, so
it's clearer if that applies to s///e as well.

> In which case escaping that } that comes before the # actually cases an
> error:
>
> syntax error at line ..., near ""foo" \"
>
> Although, the same is no completely true of qq<...>, as both
>
> eval qq< { "foo" \} # } >;
>
> and
>
> eval qq< { "foo" } # } >;
>
> yield no errors or warnings and return the string: foo
>
> Not sure what exactly accounts for this difference though.

toke.c :). The Perl string-parsing rules are *terrifying*; there's one
place in toke.c where the code takes a weighted average of probabilities
to guess whether {...} in a regexp is a hash-subscript or a {1,2}
construction. If it weren't for the fact that it actually seems to do
what you expect almost all the time it would make me thoroughly nervous.

In this case, I think the difference is that s///e sees the string at
compile time, as it was written in the source, and just applies an extra
rule about unescaping escaped closing delimiters. eval sees the string
at runtime, after it's been parsed by the qq// parser, by which point
the backslash has already disappeared.

There are other differences because of this; for instance

perl -E'$_ = "f"; s/f/\\$x/e; say; say eval qq/\\\$x/'

In fact, the RHS of s///e is more like q// than qq//, except that even
q// converts \\ -> \ which s///e doesn't.

Ben

Eric Pozharski

unread,

May 19, 2013, 6:38:36 AM5/19/13

to

with <p5ck6a-...@anubis.morrow.me.uk> Ben Morrow wrote:

*SKIP*

>> #!/usr/bin/perl
>>
>> use strict;
>> use warnings;
>>
>> my( $aa, $ab ) = qw/ aaaa bbbb /;
>>
>> my $prevent_infinite_loop;
>> while( ++$prevent_infinite_loop < 1000 && $aa =~ /aa/ ) {
>> my( $start, $length ) = ( $-[0], $+[0] + $-[0] );
>> my $replacement = 'a';
>> substr $aa, $start, $length, $replacement;
>> print "aa: $aa\n";
>> }
>>
>> $ab =~ s{bb}{ print "ab (before): $ab\n"; 'b' }ge;
>> print "ab (now): $ab\n";
>>
>> __END__
>>
>> {2809:6} [0:0]% p.x foo.tI7BTR.pl
>> p.x:1: no such file or directory: ./Build
>> aa: aaa
>> aa: aa
>> aa: a
>> ab (before): bbbb
>> ab (before): bbbb
>> ab (now): bb
>
> I'm not sure what this is supposed to be demonstrating...

I'm trying to show that

while( m// ) { code(); s/// }

differs from

s//code()/eg

Rainer Weikusat

unread,

May 19, 2013, 1:56:42 PM5/19/13

to

In particular, s///g scans the text from left to right and replaces
matches it found in the original input string. The loop will rescan
the text upon each iteration, possibly performing replacements on the
results of earlier replacements.

Ben Morrow

unread,

May 19, 2013, 6:00:42 PM5/19/13

to

Quoth Eric Pozharski <why...@pozharski.name>:

> with <p5ck6a-...@anubis.morrow.me.uk> Ben Morrow wrote:
>
> > I'm not sure what this is supposed to be demonstrating...
>
> I'm trying to show that
>
> while( m// ) { code(); s/// }
>
> differs from
>
> s//code()/eg

OK. Then I'm not sure *why* you're demonstrating that, since I wasn't
questioning it.

Ben

Eric Pozharski

unread,

May 20, 2013, 2:19:48 AM5/20/13

to

with <alun6a-...@anubis.morrow.me.uk> Ben Morrow wrote:

*SKIP*

> OK. Then I'm not sure *why* you're demonstrating that, since I wasn't
> questioning it.

Good. Now I have a better understanding of your way of thinking. It
won't make problems anymore.

Tim McDaniel

unread,

May 20, 2013, 11:58:39 AM5/20/13

to

In article <slrnkphatc...@orphan.zombinet>,

Eric Pozharski <why...@pozharski.name> wrote:
>I'm trying to show that
>
> while( m// ) { code(); s/// }
>
>differs from
>
> s//code()/eg

Oh, rescanning! Yes, in general that should indeed be considered, and
thank you for mentioning it -- s///g doesn't rescan and so avoids any
infinite loop problems.

(If you're curious about my original problem that prompted this:
rescanning doesn't produce any different results for my case -- the
right-hand side of the s/// cannot produce text that would match
again. The pattern it's looking for is very roughly
[[ (word character or space) * | (word character or space) * ]]
and the replacement is one of the alternatives, which therefore
cannot have [ or ].)

--
Tim McDaniel, tm...@panix.com