Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

hotplug regexes, other misc regex questions

15 views
Skip to first unread message

Steve Fink

unread,
Sep 15, 2002, 7:27:29 PM9/15/02
to perl6-l...@perl.org
What should this do:

my $x = "the letter x";
print "yes" if $x =~ /the { $x .= "!" } .* !/;

Does this print "yes"?

print "yes" if "helo" =~ /hel { .pos-- } lo/;

Would it be correct for this to print 0? Would it be correct for this
to print 2?

my $n = 0;
"aargh" =~ /a* { $n++ } aargh/;
print $n;

It might want to print zero because if regexes are not required to pay
attention to modifications in the input string, then it can optimize
away the a*.

What possible outputs are legal for this:

"aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/

Steve Fink

unread,
Sep 16, 2002, 1:41:45 AM9/16/02
to Luke Palmer, perl6-l...@perl.org
On Sun, Sep 15, 2002 at 11:10:12PM -0600, Luke Palmer wrote:

> On Sun, 15 Sep 2002, Steve Fink wrote:
>
> > What should this do:
> >
> > my $x = "the letter x";
> > print "yes" if $x =~ /the { $x .= "!" } .* !/;
>
> Print 'string has no method "!"' :)

Argh, I keep forgetting that. Yes, thanks, I meant $x _= "!".

> However, if you s/\.=/_=/, then I would want it to. But it might be
> considered undefined due to optimization issues. For instance, what if
> you deleted a character before the matching position? Does it keep the
> current place in memory, or the current pos in the string?

Right -- which is why I'm asking these questions. I am implementing
this stuff right now, with optimizations, and I want to know what
optimizations are legal. For that, I need some kind of official-ish
ruling.

> > Does this print "yes"?
> >
> > print "yes" if "helo" =~ /hel { .pos-- } lo/;
>

> I want it to. I don't see any reason it couldn't.

I'd probably vote yes, also. The reason why it wouldn't is if you
check to see if you have enough characters left for the minimum length
match. So if you match /hello/ with

have >= 5 characters left?
1st one is h?
2nd one is e?
.
.
.

then are you allowed to do the same thing if there is embedded code?
You can't if the input string or the match position can be modified
midstream. Neither is a major hardship, since you can just translate
to:

have >= 3 chars left?
match h
match e
match l
(run code)
have >= 2 chars left?
match l
match o

But if both are disallowed, then I'll code it to keep the faster one.

> > What possible outputs are legal for this:
> >
> > "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/
>

> Once again, implementation dependent. For instance, if that string had
> been studied, it wouldn't print anything at all. But on the standard (or
> naive) implementation, I would assume:
>
> 111
> 2
> 21
> 2
> 211
> 2
> 21
> 2

Huh? What implementation is that? I think my naive implementation
gives

111
112
11
121
122
12
1
211
212
21
221
222
22
2

though I can't run it to try because some support code hasn't been
checked in yet.

Luke Palmer

unread,
Sep 16, 2002, 1:10:12 AM9/16/02
to Steve Fink, perl6-l...@perl.org
On Sun, 15 Sep 2002, Steve Fink wrote:

> What should this do:
>
> my $x = "the letter x";
> print "yes" if $x =~ /the { $x .= "!" } .* !/;

Print 'string has no method "!"' :)

However, if you s/\.=/_=/, then I would want it to. But it might be

considered undefined due to optimization issues. For instance, what if
you deleted a character before the matching position? Does it keep the
current place in memory, or the current pos in the string?

> Does this print "yes"?


>
> print "yes" if "helo" =~ /hel { .pos-- } lo/;

I want it to. I don't see any reason it couldn't.

> Would it be correct for this to print 0? Would it be correct for this


> to print 2?
>
> my $n = 0;
> "aargh" =~ /a* { $n++ } aargh/;
> print $n;

I would say it's correct to print anything. $n is not hypotheticalized
(h14d :), so it's behaviour inside the match is implementation-dependent.

> What possible outputs are legal for this:
>
> "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/

Once again, implementation dependent. For instance, if that string had

been studied, it wouldn't print anything at all. But on the standard (or
naive) implementation, I would assume:

111
2
21
2
211
2
21
2


Luke

Markus Laire

unread,
Sep 16, 2002, 3:32:17 AM9/16/02
to Steve Fink, perl6-l...@perl.org, Luke Palmer

Your code seems to backtrack to the beginning at every failure. First
code only backtracks one char at time.

> > > "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/

> > 111
> > 2
> > 21
> > 2
> > 211
> > 2
> > 21
> > 2

match <1> 3 times (111)
backtrack 1 char, match <2> (112)
backtrack 2 chars, match <21> (121)
backtrack 1 char, match <2> (122)
backtrack 3 chars, match <211> (211)
backtrack 1 char, match <2> (212)
backtrack 2 chars, match <21> (221)
backtrack 1 char, match <2> (222)

Paretheses shows which complete matches are tried at each position.

--
Markus Laire 'malaire' <markus...@nic.fi>

Damian Conway

unread,
Sep 18, 2002, 11:01:35 AM9/18/02
to perl6-l...@perl.org
Steve Fink wrote:

> What should this do:
>
> my $x = "the letter x";
> print "yes" if $x =~ /the { $x .= "!" } .* !/;
>
> Does this print "yes"?

If it's allowed at all, I think the match should succeed.


> print "yes" if "helo" =~ /hel { .pos-- } lo/;

This definitely has to work. But remember the call to C<pos> is on
the "match object" (i.e. $0), not the string.


> Would it be correct for this to print 0? Would it be correct for this
> to print 2?
>
> my $n = 0;
> "aargh" =~ /a* { $n++ } aargh/;
> print $n;

Yes. ;-)


> What possible outputs are legal for this:
>
> "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/

Unless Larry specifies a required semantics, there are potentially very
many acceptable outputs from this, depending on implementation.

Therefore, your implementation must print the superposition of all possible
outputs. ;-)

Actually, I would expect that *any* pattern with closures in it should act as
though it were trying each branch and loop in the normal sequence (even if
it optimizes that sequence away (which probably means it can't do that
optimization in the first place (which means it should act as though it were
trying each branch and loop in the normal sequence %-))).

Of course, LMMV.

Damian

Josh Jore

unread,
Sep 18, 2002, 12:26:47 PM9/18/02
to perl6-l...@perl.org
On Wed, 18 Sep 2002, Damian Conway wrote:

> > Would it be correct for this to print 0? Would it be correct for this
> > to print 2?
> >
> > my $n = 0;
> > "aargh" =~ /a* { $n++ } aargh/;
> > print $n;
>
> Yes. ;-)

Wouldn't that print 2 if $n is lexical and 0 if it's localized? Or are
lexicals localized now?

> > What possible outputs are legal for this:
> >
> > "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/

I take it that what I've learned from _Mastering_Regular_Expressions_
doesn't quite apply here? From that interpretation I'd think it'd print
"111\n" since the second part of the alternation wouldn't be tried.

Joshua b. Jore
http://www.greentechnologist.org

Luke Palmer

unread,
Sep 18, 2002, 3:49:55 PM9/18/02
to Josh Jore, perl6-l...@perl.org
On Wed, 18 Sep 2002, Josh Jore wrote:

> On Wed, 18 Sep 2002, Damian Conway wrote:
>
> > > Would it be correct for this to print 0? Would it be correct for this
> > > to print 2?
> > >
> > > my $n = 0;
> > > "aargh" =~ /a* { $n++ } aargh/;
> > > print $n;
> >
> > Yes. ;-)
>
> Wouldn't that print 2 if $n is lexical and 0 if it's localized? Or are
> lexicals localized now?

Well, { $n++ } is within the lexical scope of $n, so it doesn't matter.
What matters is whether $n++ was hypotheticalized like so:

"aargh" =~ /a* { let $n++ } aargh/ # can it work that way?

Then it would either print 1 or 0, because if it backtracked, the ++ would
be undone. If the change is adopted that you can't optimize when there's
a closure in the middle of the optimization, it would print 1.

> > > What possible outputs are legal for this:
> > >
> > > "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/
>
> I take it that what I've learned from _Mastering_Regular_Expressions_
> doesn't quite apply here? From that interpretation I'd think it'd print
> "111\n" since the second part of the alternation wouldn't be tried.

The first time through, yes. But then it tries to match the "x", and says
"oops, that's not what I have" and backtracks. That tries the second of
the alternation, which doesn't work either. So it backtracks so the * is
only getting two of the first one, et cetera...

Or are you talking about something else from Mastering Regular
Expressions? Like some kind of optimization that happens?

Luke

Damian Conway

unread,
Sep 19, 2002, 9:19:04 AM9/19/02
to perl6-l...@perl.org
Josh Jore wrote:

>>>Would it be correct for this to print 0? Would it be correct for this
>>>to print 2?
>>>
>>> my $n = 0;
>>> "aargh" =~ /a* { $n++ } aargh/;
>>> print $n;
>>
>>Yes. ;-)
>
> Wouldn't that print 2 if $n is lexical

Err. It *is* lexical in this example.

> and 0 if it's localized?

No. Without the C<my> it would still print either 0 or 2, depending
on the implementation/optimization.


> Or are lexicals localized now?

They can be. But this example C<$n> isn't.
(Just because it's used in a nested closure doesn't mean it's
localized within the pattern).


>>>What possible outputs are legal for this:
>>>
>>> "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/
>>
>
> I take it that what I've learned from _Mastering_Regular_Expressions_
> doesn't quite apply here? From that interpretation I'd think it'd print
> "111\n" since the second part of the alternation wouldn't be tried.

No. It would fail to match the final C<x> in the pattern and start
backtracking.

Damian

Josh Jore

unread,
Sep 19, 2002, 2:07:20 PM9/19/02
to perl6-l...@perl.org

I missed the trailing 'x/' since my perl5 eyes read that as '/x'. My bad.

Joshua b. Jore -{ weird geeky madness }-> http://www.greentechnologist.org


Larry Wall

unread,
Sep 20, 2002, 4:42:56 AM9/20/02
to Steve Fink, perl6-l...@perl.org
On Sun, 15 Sep 2002, Steve Fink wrote:
: What should this do:

:
: my $x = "the letter x";
: print "yes" if $x =~ /the { $x .= "!" } .* !/;

Depends. I think it may be necessary for speed and safety reasons
to set COW on the string we're matching, so that you're always matching
against the original string. Perhaps we can make it work in the specific
case of concatenation, since we want to it to work for extensible
strings coming from a filehandle. But if a fast implementation needs to
keep pointers into a string rather than offsets from the beginning,
we're asking for core dumps if the string is modified out from under the
pointers, or we have to adjust all known pointers any time the string
may be modified.

: Does this print "yes"?


:
: print "yes" if "helo" =~ /hel { .pos-- } lo/;

Yes, that isn't modifying the string.

: Would it be correct for this to print 0? Would it be correct for this


: to print 2?
:
: my $n = 0;
: "aargh" =~ /a* { $n++ } aargh/;
: print $n;
:
: It might want to print zero because if regexes are not required to pay
: attention to modifications in the input string, then it can optimize
: away the a*.

Implementation dependent, but there will probably be a :modifier to
specifically turn off optimizations.

: What possible outputs are legal for this:


:
: "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/

Lots. :-)

Larry

Sean O'Rourke

unread,
Sep 20, 2002, 11:56:26 AM9/20/02
to Larry Wall, Steve Fink, perl6-l...@perl.org
On Fri, 20 Sep 2002, Larry Wall wrote:
> But if a fast implementation needs to keep pointers into a string
> rather than offsets from the beginning, we're asking for core dumps if
> the string is modified out from under the pointers, or we have to
> adjust all known pointers any time the string may be modified.

With the current Parrot GC, keeping pointers into the string while doing
unrelated allocation will get you a core dump, since the string body might
be copied. So unless the regex engine copies strings off into its own
private non-collected storage, we're stuck with offsets anyways.

/s

Larry Wall

unread,
Sep 20, 2002, 12:13:44 PM9/20/02
to Sean O'Rourke, Steve Fink, perl6-l...@perl.org
On Fri, 20 Sep 2002, Sean O'Rourke wrote:

That's fine, if it's a constraint. I had thought perhaps COW would
allow a locked-down copy to work with without forcing unnecessary
copying, but I suppose it doesn't hook into GC at that level. I'd also
hoped it would solve any $&-style inefficiencies. But hey, that's not
my job this time around... :-)

Larry

Steve Fink

unread,
Sep 21, 2002, 4:51:45 PM9/21/02
to Markus Laire, perl6-l...@perl.org, Luke Palmer
On Mon, Sep 16, 2002 at 10:32:17AM +0300, Markus Laire wrote:
> On 15 Sep 2002 at 22:41, Steve Fink wrote:
>
> Your code seems to backtrack to the beginning at every failure. First
> code only backtracks one char at time.
> > Huh? What implementation is that? I think my naive implementation
> > gives
> >
> > 111
> > 112
> > 11
> > 121
> > 122
> > 12
> > 1
> > 211
> > 212
> > 21
> > 221
> > 222
> > 22
> > 2
> >
>

Backtracking needs to be done on the state machine, not on the input
string. For any pattern that can be reduced to a DFA, they will have
equivalent output, but consider something like (unoptimized)

"aax" =~ /(a|aa)x/

You try to match 'a' and get it. You try to match 'x' and fail. You
are at "a.ax" in your input string. Backing up in the input string
doesn't help. You need to backtrack to the next possibility of the
previous match operation, in this case to the aa of the a|aa.

My code is not backtracking to the beginning. It is backtracking to
the previous possibility in the NFA. So /(a|a)*x/ matches (a|a) as
many times as possible, then moves on to match the x. On failure, it
undoes one match of (a|a), tries to match the (a|a) a different way,
and if it succeeds matches as many times as possible again before
trying the x. If (a|a) cannot match a different way, it tries to
continue on to the x without matching that last (a|a) at all. Finally,
if that fails, it undoes another match of the (a|a) and repeats.

Programmatically, the implementation of R* looks something like:

loop: match R or goto next
goto loop

back: if we have at least one R match left on the stack, goto R.back
otherwise, fail (goto last backtrack point)

next: ...(in this case, match x or goto back)

where R.back is the backtrack point for a successful (a|a) match (it
behaves like a continuation, so it is as if the "match R" call returns
another result.) R is assumed to push its backtrack information on the
stack if it succeeds, or to leave the stack untouched if it fails.
R.back is assumed to clean up any previous backtrack state of R and
change it to either a new backtrack state or clean it up, depending on
whether it can succeed in another way.

Steve Fink

unread,
Sep 21, 2002, 5:08:25 PM9/21/02
to Damian Conway, perl6-l...@perl.org
On Wed, Sep 18, 2002 at 05:01:35PM +0200, Damian Conway wrote:
> Steve Fink wrote:
>
> >What possible outputs are legal for this:
> >
> > "aaa" =~ /( a { print 1 } | a { print 2 })* { print "\n" } x/
>
> Unless Larry specifies a required semantics, there are potentially very
> many acceptable outputs from this, depending on implementation.
>
> Therefore, your implementation must print the superposition of all possible
> outputs. ;-)

It does. In fact, I wrote the superposition of all possible sequences
of outputs in my email. Don't tell me you actually _read_ my question
before answering it? You, of all people, should know the dangers of
observing these things! Just out of curiousity, which one did it
collapse to?

Superpositions don't seem to work that well as a building block for
our irregular expressions. Making a* map to ANY(,a,aa,aaa,...) loses
some rather important ordering information. Let's see... what if you
makes an OANY ("Ordered ANY") so that

R* -> OANY(aaaa..., aaa..., aa..., ...)
R*? -> OANY(,a,aa,aaa,aaaa,...)
R|S -> OANY(R,S)

Then if you "optimize" by converting as many OANY's to ANY's as you
can, would it then be correct to take maximal subtrees of ANY's and
implement them with DFAs?

Bleh. Why bother? The OANY -> ANY decision is no easier than doing it
directly on the parse tree, I think.

Oops, gotta go. The nice men in the white coats are back.

Andrew Pimlott

unread,
Sep 23, 2002, 9:18:44 AM9/23/02
to perl6-l...@perl.org
On Wed, Sep 18, 2002 at 05:01:35PM +0200, Damian Conway wrote:
> Steve Fink wrote:
> > print "yes" if "helo" =~ /hel { .pos-- } lo/;
>
> This definitely has to work. But remember the call to C<pos> is on
> the "match object" (i.e. $0), not the string.
>
> Actually, I would expect that *any* pattern with closures in it
> should act as though it were trying each branch and loop in the
> normal sequence (even if it optimizes that sequence away (which
> probably means it can't do that optimization in the first place
> (which means it should act as though it were trying each branch
> and loop in the normal sequence %-))).

I hope there will be a way to say "my code is side-effect free", to
permit optimizations.

Andrew

0 new messages