RFD: code inside RExen

Ilya Zakharevich

unread,

May 3, 2004, 4:29:48 AM5/3/04

to

PRELIMINARIES

When one inserts some Perl code to become a part of a REx (e.g., the
current way is to use a (?{CODE}) construct), one should care about
the following questions:

a) when this code is executed?

b) what information about the match is available when this code is
executed?

c) what happens with temporaries introduced in CODE when the match
backtracks? Are they destroyed at this moment, or later? Same
for local()ized variables (this is essentially the same in
disguise)?

I know three meaningful answers to 'a': two implementation-dependent,
the third one not. The implementation-independent one is 'after the
match succeeds'. The first implementation-dependent answer is
"immediately when the REx matches `up to this point'"; the other one
is "immediately when the REx match FAILS somewhere after the current
point". Combining the last two flavors, one can obtain the DO/UNDO
semantic of backtracking.

In 'b', there are several possible subquestions: should the current
position in string be available? Should the offsets and contents of
$1 etc be available? Same for $^N? In the case of recursive RExen,
should the similar information about the state of the "enclosing"
match be available? When/if we have "named catching groups", the same
question can be applied to them too.

'c' makes sense only in the implementation-dependent variants of the
answer to 'a'. With the current implementation, only the positive
answer to 'c' enables the UNDO semantic of backtracking.

HISTORICALLY PRESENT BEHAVIOUR

The answers for (?{CODE}) are: a: immediate; b: full info available;
c: temporaries are destroyed immediately.

These answers are due to the desire to make (?{CODE}) as universally
useful as possible; however, they result in many inconveniences in an
actual use of this construct, as well as in significant slowdown of
execution of this construct.

PROPOSAL

When a code reference is interpolated into a REx, this embeds the code
into the REx similarly to the current (?{CODE}) block. The defaults are:

a) the code is executed after the match completes (however, the
available information about the match reflects the state at the
moment the code block was encountered in the REx).

b) only the information about $^N is made available during the execution.

c) N/A due to the answer to 'a'.

These are defaults only; note that the answers to 'a' and 'b' are NOT
those for (?{CODE}). These settings should be made fully customizable
by flags to the REx.

What do you think?
Ilya

----- End forwarded message -----

Brian McCauley

unread,

May 4, 2004, 8:24:56 AM5/4/04

to

Ilya Zakharevich <nospam...@ilyaz.org> writes:

> PRELIMINARIES
>
> When one inserts some Perl code to become a part of a REx (e.g., the
> current way is to use a (?{CODE}) construct), one should care about
> the following questions:
>
> a) when this code is executed?
>
> b) what information about the match is available when this code is
> executed?
>
> c) what happens with temporaries introduced in CODE when the match
> backtracks? Are they destroyed at this moment, or later? Same
> for local()ized variables (this is essentially the same in
> disguise)?

I think you should add:

d) how does CODE signal to the regex engine that the zero-width
assertion (?{CODE}) should be considered to have failed?

> HISTORICALLY PRESENT BEHAVIOUR

The return value of CODE is currently just stored in $^R. The only
way to have a CODE assertion that can fail is to wrap (?{}) in (?())
or use (??{}).

This is messy.

> PROPOSAL

How bad a backward incompatable change would it be to say the
(?{CODE}) assertion fails if CODE returns false? After all even in
5.8.4 the documentation still says "may be changed or deleted without
notice."

If it is felt to be too much of a backward incompatable change perhaps
we could allow it as a 'use re' option. In the case of pre-compiled
regex using qr/(?{CODE})/ the effective option should, of course, be
the one that is in scope where the qr// occurs not the one where the
regex is finally executed.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Ilya Zakharevich

unread,

May 7, 2004, 3:39:20 PM5/7/04

to

[A complimentary Cc of this posting was sent to
Brian McCauley
<nob...@mail.com>], who wrote in article <u9pt9kr...@wcl-l.bham.ac.uk>:

> > When one inserts some Perl code to become a part of a REx (e.g., the
> > current way is to use a (?{CODE}) construct), one should care about
> > the following questions:
> >
> > a) when this code is executed?
> >
> > b) what information about the match is available when this code is
> > executed?
> >
> > c) what happens with temporaries introduced in CODE when the match
> > backtracks? Are they destroyed at this moment, or later? Same
> > for local()ized variables (this is essentially the same in
> > disguise)?
>
> I think you should add:
>
> d) how does CODE signal to the regex engine that the zero-width
> assertion (?{CODE}) should be considered to have failed?

Thanks, when I posted I soon realized that something like this was
missing. From my TODO file:

Code references: three more variants:

should FALSE return value stop the match?

should the return value be used as a REx?

should the return value be used as a repetition count?

> > PROPOSAL
>
> How bad a backward incompatable change would it be to say the
> (?{CODE}) assertion fails if CODE returns false? After all even in
> 5.8.4 the documentation still says "may be changed or deleted without
> notice."

Who cares about the docs; nobody reads it anyway... :-( ;-)

In short: since many constructs were not possible without (?{}), it
became sufficiently widely used so that no changes are possible.
However, I intentionally inserted a very strict syntax check for (?{})
to allow for future expansion.

For example, allowing (?{}FLAGS) should be backward compatible. If
this thread(s) leads to a reasonable discussion, one can use the same
flag mechanism on the new and on the old style of CODE-in-REx.

=======================================================

Another thing missed in the previous post: the info in 'b' should
include the current position in the string. I think that it makes
sense to make this info available by default too when a code reference
is executed...

Thanks again,
Ilya