comprehensive list of perl6 rule tokens

11 views
Skip to first unread message

Jeff 'japhy' Pinyan

unread,
May 24, 2005, 8:25:03 PM5/24/05
to perl6-l...@perl.org
I'm working on a Perl 5 module that will allow for the parsing of a Perl 6
rule into a tree structure -- specifically, I'm subclassing/extending
Regexp::Parser into Perl6::Rule::Parser. This module is designed ONLY to
PARSE the contents of a rule; it is not concerned with the implementation
of all the new things Perl 6 rules will offer, merely their syntax. Once
this module is done, I'll work on a slightly broader one which will
concern itself with the exterior of the rule (the m:xyz:abc('def')/.../
part, rather than the contents of the rule itself).

To do this effectively, I need an exhaustive list of all tokens that can
appear in a Perl 6 rule. By "token", I mean a single unit of purpose,
such as ^^ and <after ...> and **{3..6}. I have looked through the latest
revisions of Apo05 and Syn05 (from Dec 2004) and come up with the
following list:

http://japhy.perlmonk.org/perl6/rules.txt

The list is split up by leading character. I think it's complete, but I'm
probably wrong, which is why I need more eyes to look over it and tell me
what I've missed.

I just got an email back from Damian which will help me move in the right
direction, but I'd like this to be open to as many knowledgeable minds as
possible.

The part which needs a bit of clarification right now, in my opinion, is
character classes. From what I can gather, these are character classes:

<[a-z] +<digit>>
<+<alpha> -[aeiouAEIOU]>

but I want to be sure. I'm also curious about whitespace. Is "<[" one
token, or can I write "< [a-z] >" and have it be a character class?

Thanks for your help. Unless you're difficult.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart

Jonathan Scott Duff

unread,
May 24, 2005, 11:03:29 PM5/24/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote:
> http://japhy.perlmonk.org/perl6/rules.txt

That looks completish to me. (At least I didn't think, "hey! where's
such and such?")

One thing that I noticed and had to look up was

<-prop X>

though. Because ...

> The part which needs a bit of clarification right now, in my opinion, is
> character classes. From what I can gather, these are character classes:
>
> <[a-z] +<digit>>
> <+<alpha> -[aeiouAEIOU]>

I believe that Larry blessed Pm's idea to allow

<[a..z]+digit>
<+alpha-[aeiouAEIOU]>

which implies to me that assertions starting with one of "<[",
"<-" or "<+" should be treated as character classes. This doesn't
seem to play well with <-prop X>. Maybe it does though.

Also, I think that it's [a..z] now rather than [a-z] but I'm not
entirely sure. At least that's how PGE implements it.

> but I want to be sure. I'm also curious about whitespace. Is "<[" one
> token, or can I write "< [a-z] >" and have it be a character class?

I think you need to write "<["

-Scott
--
Jonathan Scott Duff
du...@pobox.com

Jeff 'japhy' Pinyan

unread,
May 24, 2005, 11:24:50 PM5/24/05
to Jonathan Scott Duff, perl6-l...@perl.org
On May 24, Jonathan Scott Duff said:

> On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote:
>> http://japhy.perlmonk.org/perl6/rules.txt
>
> That looks completish to me. (At least I didn't think, "hey! where's
> such and such?")

Oh, frabjous day!

> One thing that I noticed and had to look up was
>
> <-prop X>
>
> though. Because ...

I wish <!prop X> was allowed. I don't see why <!...> has to be confined
to zero-width assertions.

>> The part which needs a bit of clarification right now, in my opinion, is
>> character classes. From what I can gather, these are character classes:
>>
>> <[a-z] +<digit>>
>> <+<alpha> -[aeiouAEIOU]>
>
> I believe that Larry blessed Pm's idea to allow
>
> <[a..z]+digit>
> <+alpha-[aeiouAEIOU]>

Ok, that's news to me. (I have yet to peruse the archives.) That's nice,
not requiring you to <>-ize property names inside a character class
assertion. I'd think whitespace would be permitted in between parts of a
character class, but perhaps I'm wrong. That would kinda go against the
whole "whitespace for readability" idea of Perl 6 rules, though.

> which implies to me that assertions starting with one of "<[",
> "<-" or "<+" should be treated as character classes. This doesn't
> seem to play well with <-prop X>. Maybe it does though.

Considering the Unicode properties are like char class macro-things (like
\w and \d), I don't see a problem, except for the fact that there's more
than one "word" (chunk of non-whitespace) associated with them. Maybe
Unicode properties retain their enclosing <>'s?

> Also, I think that it's [a..z] now rather than [a-z] but I'm not
> entirely sure. At least that's how PGE implements it.

Ok. I'll wait for a message from On High about that. It's a minor
detail.

>> but I want to be sure. I'm also curious about whitespace. Is "<[" one
>> token, or can I write "< [a-z] >" and have it be a character class?
>
> I think you need to write "<["

I expected as much.

Jonathan Scott Duff

unread,
May 25, 2005, 9:44:24 AM5/25/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote:
> I wish <!prop X> was allowed. I don't see why <!...> has to be confined
> to zero-width assertions.

I don't either actually. One thing that occurred to me while responding
to your original email was that <!foo> might have slightly wrong
huffmanization. Is zero-width the common case? If not, we could use
character doubling for emphasis: <!foo> consumes, while <!!foo> is
zero-width.

But that's just a random rambling on my part. I trust @Larry has put
wee more thought into it than I. :-)

Mark A. Biggar

unread,
May 25, 2005, 10:41:19 AM5/25/05
to du...@pobox.com, Jeff 'japhy' Pinyan, perl6-l...@perl.org

But what would a consuming <!...> mean? As it can have no internal
backtracking points (it only has them if it fails), it would match (and
consume) the whole rest of the string, then if there were any more to
the pattern, would immediately backtrack back out left of itself. Thus
it would be semantically identical to the zero-width version. So
zero-width is really the only possibility for <!...>.

Now <prop X> is a character class just like <+digit> and so
under the new character class syntax, would probably be written
<+prop X> or if the white space is a problem, then maybe <+prop:X>
(or <+prop(X)> as Larry gets the colon :-), but that is a pretty
adverbial case so ':' maybe okay) with the complemented case being
<-prop:X>. Actually the 'prop' may be unnecessary at all, as we know
we're in the character class sub-language because we saw the '<+', '<-'
or '<[', so we could just define the various Unicode character property
codes (I.e., Lu, Ll, Zs, etc) as pre-defined character class names just
like 'digit' or 'letter'.

BTW, as a matter of terminology, <-digit> should probably be called the
complement of <+digit> instead of the negation so as not to confuse it
with the <!...> negative zero-width assertion case.

--
ma...@biggar.org
mark.a...@comcast.net

Jeff 'japhy' Pinyan

unread,
May 25, 2005, 10:50:33 AM5/25/05
to Jonathan Scott Duff, perl6-l...@perl.org
On May 25, Jonathan Scott Duff said:

> On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote:
>> I wish <!prop X> was allowed. I don't see why <!...> has to be confined
>> to zero-width assertions.
>
> I don't either actually. One thing that occurred to me while responding
> to your original email was that <!foo> might have slightly wrong
> huffmanization. Is zero-width the common case? If not, we could use
> character doubling for emphasis: <!foo> consumes, while <!!foo> is
> zero-width.

But that's not even the point. The ! in <!after ...> is not what makes
<!after ...> a zero-width assertion, it's the 'after' that does that. All
the ! does is negate the boolean sense of the assertion, which seems like
a useful thing to have.

Hrm, but I think I see the problem. How does one define "negation" for an
arbitrary assertion? Is <!foo> saying "if <foo> matches, fail"? Because
then <!prop X> doesn't make mean the same as <-prop X>. We don't want
negation, we want complement.

I guess '!' is only well-defined for zero-width assertions. When you want
to say <!foo>, I guess <!before <foo>> or <!after <foo>> is the proper way
to go.

Jeff 'japhy' Pinyan

unread,
May 25, 2005, 10:55:59 AM5/25/05
to Mark A. Biggar, du...@pobox.com, perl6-l...@perl.org
On May 25, Mark A. Biggar said:

> Jonathan Scott Duff wrote:
>> On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote:
>>
>>> I wish <!prop X> was allowed. I don't see why <!...> has to be confined to
>>> zero-width assertions.
>>
>> I don't either actually. One thing that occurred to me while responding
>> to your original email was that <!foo> might have slightly wrong
>> huffmanization. Is zero-width the common case? If not, we could use
>> character doubling for emphasis: <!foo> consumes, while <!!foo> is
>> zero-width.
>

> Now <prop X> is a character class just like <+digit> and so
> under the new character class syntax, would probably be written
> <+prop X> or if the white space is a problem, then maybe <+prop:X>
> (or <+prop(X)> as Larry gets the colon :-), but that is a pretty
> adverbial case so ':' maybe okay) with the complemented case being
> <-prop:X>. Actually the 'prop' may be unnecessary at all, as we know
> we're in the character class sub-language because we saw the '<+', '<-'
> or '<[', so we could just define the various Unicode character property
> codes (I.e., Lu, Ll, Zs, etc) as pre-defined character class names just
> like 'digit' or 'letter'.

Yeah, that was going to be my next step, except that the unknowing person
might make a sub-rule of their own called, say, "Zs", and then which would
take precedence? Perhaps <prop:X> is a good way of writing it.

> BTW, as a matter of terminology, <-digit> should probably be called the
> complement of <+digit> instead of the negation so as not to confuse it with
> the <!...> negative zero-width assertion case.

Yeah, I just wrote that in my recent reply to Scott. I realized the
nomenclature would be a point of confusion.

Mark A. Biggar

unread,
May 25, 2005, 11:28:11 AM5/25/05
to Jeff 'japhy' Pinyan, du...@pobox.com, perl6-l...@perl.org
Jeff 'japhy' Pinyan wrote:
> On May 25, Mark A. Biggar said:
>
>> Jonathan Scott Duff wrote:
>>
>>> On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote:
>>>
>>>> I wish <!prop X> was allowed. I don't see why <!...> has to be
>>>> confined to zero-width assertions.
>>>
>>>
>>> I don't either actually. One thing that occurred to me while responding
>>> to your original email was that <!foo> might have slightly wrong
>>> huffmanization. Is zero-width the common case? If not, we could use
>>> character doubling for emphasis: <!foo> consumes, while <!!foo> is
>>> zero-width.
>>
>>
>> Now <prop X> is a character class just like <+digit> and so
>> under the new character class syntax, would probably be written
>> <+prop X> or if the white space is a problem, then maybe <+prop:X>
>> (or <+prop(X)> as Larry gets the colon :-), but that is a pretty
>> adverbial case so ':' maybe okay) with the complemented case being
>> <-prop:X>. Actually the 'prop' may be unnecessary at all, as we know
>> we're in the character class sub-language because we saw the '<+', '<-'
>> or '<[', so we could just define the various Unicode character property
>> codes (I.e., Lu, Ll, Zs, etc) as pre-defined character class names just
>> like 'digit' or 'letter'.
>
>
> Yeah, that was going to be my next step, except that the unknowing
> person might make a sub-rule of their own called, say, "Zs", and then
> which would take precedence? Perhaps <prop:X> is a good way of writing it.

Well we have the same problem with someone redefining 'digit'. But
character classes are their own sub-language and we may need to
distinguish between Rule::digit and CharClass::digit in the syntax. Of
course we could hack it and say that a rule that consists of nothing but
a single character class item is usable in other character classes by
its name, but that could lead to subtle bugs where someone modifies that
special rule to add stuff to it and breaks all usage of it as a
character class everywhere else. Now a grammar is just a special kind
of class that contains special kinds of methods called rules, maybe we
need another special kind of method in a grammar that just define a
named character class for later use? In any case as usual with methods
a user define character class should override a predefined one of the
same name.

--
ma...@biggar.org
mark.a...@comcast.net

Patrick R. Michaud

unread,
May 26, 2005, 12:45:12 PM5/26/05
to Jeff 'japhy' Pinyan, Mark A. Biggar, du...@pobox.com, perl6-l...@perl.org
Rather than answer each message in this thread individually, I'll
try to aggregate them here. Disclaimer: These are just my
interpretations of how rules are defined; I'm not the one who
decides how they *should* be defined.

On Wed, May 25, 2005 at 10:55:59AM -0400, Jeff 'japhy' Pinyan wrote:
> On May 25, Mark A. Biggar said:
> >Jonathan Scott Duff wrote:
> >>On Tue, May 24, 2005 at 11:24:50PM -0400, Jeff 'japhy' Pinyan wrote:
> >>>I wish <!prop X> was allowed. I don't see why <!...> has to be confined
> >>>to zero-width assertions.

<!...> isn't confined to use with zero-width assertions, but <!...>
always acts as a zero-width assertion. In essence, since we're requiring
a negative match, nothing is consumed by that negative match.

In some senses <!subrule> is the same as <!before <subrule> >.

> >Now <prop X> is a character class just like <+digit> and so
> >under the new character class syntax, would probably be written
> ><+prop X> or if the white space is a problem, then maybe <+prop:X>
> >(or <+prop(X)> as Larry gets the colon :-), but that is a pretty
> >adverbial case so ':' maybe okay) with the complemented case being
> ><-prop:X>.

The whitespace itself isn't a problem, but it means that whatever
follows is parsed using rules syntax and not a string constant. Thus we
probably want <prop:Lu> or <prop("Lu")> and not <prop Lu>.

And to be a little pedantic in terminology, I call <prop:Lu> a capturing
subrule, not a character class match (although that subrule probably does
match and capture just a single character). The character class
match would be <+prop:Lu> or something like that. However, we do
get into a parsing issue with <+prop:Lu+prop:Ll>, which would probably
have to be written as <+prop('Lu')+prop('Ll')>, unless we treat the +
as "special". (AFAIK, the :-argument form of subrule calls isn't well
defined yet -- it's only briefly mentioned/proposed in A05.)

> >Actually the 'prop' may be unnecessary at all, as we know
> >we're in the character class sub-language because we saw the '<+', '<-'
> >or '<[', so we could just define the various Unicode character property
> >codes (I.e., Lu, Ll, Zs, etc) as pre-defined character class names just
> >like 'digit' or 'letter'.

I like this.

> Yeah, that was going to be my next step, except that the unknowing person
> might make a sub-rule of their own called, say, "Zs", and then which would
> take precedence? Perhaps <prop:X> is a good way of writing it.

Well, it works out the same as if someone creates their own "digit" or
"alpha" rule. One can always get to the built-in definition by explicit
scoping using <Grammar::digit> (or wherever the built-ins end up
being defined).

Pm

Patrick R. Michaud

unread,
May 26, 2005, 12:19:42 PM5/26/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote:
> I have looked through the latest
> revisions of Apo05 and Syn05 (from Dec 2004) and come up with the
> following list:
>
> http://japhy.perlmonk.org/perl6/rules.txt

I'll review the list below, but it's also worthwhile to read

http://www.nntp.perl.org/group/perl.perl6.language/21120

which is Larry's latest missive on character classes, and

http://www.nntp.perl.org/group/perl.perl6.language/20985

which describes the capturing semantics (but be sure to note
the lengthy threads that follow concerning changes in the
indexing from $1, $2, ... to $0, $1, ... ).

Here's my comments on the table at http://japhy.perlmonk.org/perl6/rules.txt,
downloaded 26-May 1526 UTC:

CHAR EXAMPLE IMPL DESCRIPTION
===========================================
& a&b N conjunction
&var N subroutine

I'm not sure that "&var" means subroutine anymore. A05 does mention
it, but S05 does not, and I think it invites way too much confusion
with conjunctions. Consider "a&var($x|$y)" versus "a & var ( $x | $y )".
But if are allowing &var (and I hope we do not), then the parens are
required.

x* Y previous atom 0 or more times
x**{n..m} N previous atom n..m times

Keeping in mind that the "n..m" can actually be any sort of closure
(although it's not implemented that way yet in PGE). The rules
engine will generally optimize parsing and handling of "n..m" when
it can (e.g., when "n" and "m" are both constants).

( (x) Y capture 'x'
) Y must match opening '('

It may be worth noting that parens not only capture, they also
introduce a new scope for any nested subpattern and subrule captures.

:ignorecase N case insensitivity :i
:global N match globally :g
:continue N start scanning after previous match :c
...etc

I'm not sure these are "tokens" in the sense of "single unit of purpose"
in your original message. I think these are all adverbs, and the "token"
is just the initial C<:> at the beginning of a group.

:keepall N all rules and invoked rules remember everything

That's now ":parsetree" according to Damian's proposed capture rules.

<commit> N backtracking fails completely
<cut> N remove what matched up to this point from the string
<after P> N we must be after the pattern P
<!after P> N we must NOT be after the pattern P
<before P> N we must be before the pattern P
<!before P> N we must NOT be before the pattern P

As with ':words', etc., I'm not sure that these qualify as "tokens"
when parsing the regex -- the tokens are actually "<" or "<!" and
indicate a call to a subrule of some sort, and these are just predefined
rules. The rules parser and engine may indeed tokenize them for
optimization purposes, but I don't think the language defines them
as fundamental "tokens", and someone is free to override the predefined
rules with their own. (Perhaps <cut> and <commit> cannot be overridden.)

<?ws> N match whitespace by :w rules
<?sp> N match a space character (chr 32 ONLY)

Here the token is "<?", indicating a non-capturing subrule.

<$rule> N indirect rule
<::$rulename> N indirect symbolic rule
<@rules> N like '@rules'
<%rules> N like '%rules'
<{ code }> N code produces a rule
<&foo()> N subroutine returns rule
<( code )> N code must return true or backtracking ensues

Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&",
and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$",
"<!@", etc. counterparts. Of course, one could claim that these are
really separated as in "<", "?", and "$" tokens, but PGE's parser currently
treats them as a unit to make it easier to jump directly into the correct
handler for what follows.

<[a-z]> N character class
<+alpha> N character class
<-[a-z]> N complemented character class

The tokens for character class manipulation are currently "<[", "<+",
and "<-", although that's not officially documented in A05 or S05 yet.
Also, ranges are now <[a..z]> -- an unescaped hyphen appearing in an
enumerated character class generates a warning.

<+\w-[0-9]> N character class "arithmetic"

I'm not sure that it's been decided/documented that \w, \s, etc.
can appear in character class arithmetic (although it seems like it
should).

<prop:X> N Unicode property match
<-prop:X> N complemented Unicode property match

Here "prop" is just a subrule (or character class) similar to
<+alpha>, <+digit>, etc. Also, note that <prop:X> is a capturing
subrule, while <+prop:X> would be a character class match (and presumably
not capture).

<rule> N match rule (and capture to $rule)
<?rule> N match rule (don't capture)
<<rule>> N match rule (don't capture)

Do we still have the <<rule>> syntax, or was that abandoned in
favor of <?rule> ? (I know there are still some remnants of <<...>>
in S05 and A05, but I'm not sure they're intentional.)

> Thanks for your help. Unless you're difficult.

"You're welcome" unless $Pm ~~ /<?difficult>/;

Pm

Patrick R. Michaud

unread,
May 26, 2005, 1:06:45 PM5/26/05
to Mark A. Biggar, Jeff 'japhy' Pinyan, du...@pobox.com, perl6-l...@perl.org
On Wed, May 25, 2005 at 08:28:11AM -0700, Mark A. Biggar wrote:
> Jeff 'japhy' Pinyan wrote:
> >Yeah, that was going to be my next step, except that the unknowing
> >person might make a sub-rule of their own called, say, "Zs", and then
> >which would take precedence? Perhaps <prop:X> is a good way of writing it.
>
> Well we have the same problem with someone redefining 'digit'. But
> character classes are their own sub-language and we may need to
> distinguish between Rule::digit and CharClass::digit in the syntax.

Larry makes the case that "character classes really are rules of
a sort" in http://www.nntp.perl.org/group/perl.perl6.language/21120.
Essentially, since characters are no longer fixed-width entities,
the distinction between "character class" and "rule" can get pretty
fuzzy, and so we're probably better off not trying to force one.

It may simply be that in general a character class expression like

<+alnum-[aeiou]+punct>

ends up decomposing into something like

[ <![aeiou]><?alnum> | <?punct> ]

which doesn't make any assumptions about character widths. Of course,
the rules engine would (eventually) be written to recognize the common
compositions and "character class" subrules and optimize for them.

Pm

Jeff 'japhy' Pinyan

unread,
May 26, 2005, 7:05:41 PM5/26/05
to Patrick R. Michaud, perl6-l...@perl.org
On May 26, Patrick R. Michaud said:

> On Tue, May 24, 2005 at 08:25:03PM -0400, Jeff 'japhy' Pinyan wrote:
>> I have looked through the latest
>> revisions of Apo05 and Syn05 (from Dec 2004) and come up with the
>> following list:
>>
>> http://japhy.perlmonk.org/perl6/rules.txt
>
> I'll review the list below, but it's also worthwhile to read
>
> http://www.nntp.perl.org/group/perl.perl6.language/21120
>
> which is Larry's latest missive on character classes, and
>
> http://www.nntp.perl.org/group/perl.perl6.language/20985
>
> which describes the capturing semantics (but be sure to note
> the lengthy threads that follow concerning changes in the
> indexing from $1, $2, ... to $0, $1, ... ).

I'll check them out. Right now, I'm really only concerned with syntax
rather than implementation. Perl6::Rule::Parser will only parse the rule
into a tree structure.

> & a&b N conjunction
> &var N subroutine
>
> I'm not sure that "&var" means subroutine anymore. A05 does mention

Ok. If it goes away, I'm fine with that.

> x**{n..m} N previous atom n..m times
>
> Keeping in mind that the "n..m" can actually be any sort of closure

Yeah, I know.

> ( (x) Y capture 'x'
> ) Y must match opening '('
>
> It may be worth noting that parens not only capture, they also
> introduce a new scope for any nested subpattern and subrule captures.

Ok. I don't think that'll affects me right now.

> :ignorecase N case insensitivity :i
> :global N match globally :g
> :continue N start scanning after previous match :c
> ...etc
>
> I'm not sure these are "tokens" in the sense of "single unit of purpose"
> in your original message. I think these are all adverbs, and the "token"
> is just the initial C<:> at the beginning of a group.

I understand, but that set is particularly important to me, because as far
as I am concerned, the rule

/abc/

is the object Perl6::Rule::Parser::exact->new('abc'), whereas the rule

/:i abc/

is the object Perl6::Rule::Parser::exactf->new('abc') -- this is using
node terminology from Perl 5, where "exactf" means "exact with case
folding".

> :keepall N all rules and invoked rules remember everything
>
> That's now ":parsetree" according to Damian's proposed capture rules.

Ok. I haven't seen those yet.

> <commit> N backtracking fails completely
> <cut> N remove what matched up to this point from the string
> <after P> N we must be after the pattern P
> <!after P> N we must NOT be after the pattern P
> <before P> N we must be before the pattern P
> <!before P> N we must NOT be before the pattern P
>
> As with ':words', etc., I'm not sure that these qualify as "tokens"
> when parsing the regex -- the tokens are actually "<" or "<!" and

I understand. Luckily this new syntax will enable me to abstract things
in the parser.

my $obj = $S->object(assertion => $name, $neg);
# where $name is the part after the < or <!
# and $neg is a boolean denoting the presence of !

Since there's no longer different prefixes for every type of assertion, I
no longer need to make specific classes of objects.

> <?ws> N match whitespace by :w rules
> <?sp> N match a space character (chr 32 ONLY)
>
> Here the token is "<?", indicating a non-capturing subrule.

Right.

> <$rule> N indirect rule
> <::$rulename> N indirect symbolic rule
> <@rules> N like '@rules'
> <%rules> N like '%rules'
> <{ code }> N code produces a rule
> <&foo()> N subroutine returns rule
> <( code )> N code must return true or backtracking ensues
>
> Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&",
> and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$",
> "<!@", etc. counterparts.

Per your second message, <!@rules> would mean <!before <@rules>>, right?

> Of course, one could claim that these are
> really separated as in "<", "?", and "$" tokens, but PGE's parser currently
> treats them as a unit to make it easier to jump directly into the correct
> handler for what follows.

Yes, so does mine. :)

> <[a-z]> N character class
> <+alpha> N character class
> <-[a-z]> N complemented character class
>
> The tokens for character class manipulation are currently "<[", "<+",
> and "<-", although that's not officially documented in A05 or S05 yet.
> Also, ranges are now <[a..z]> -- an unescaped hyphen appearing in an
> enumerated character class generates a warning.
>
> <+\w-[0-9]> N character class "arithmetic"
>
> I'm not sure that it's been decided/documented that \w, \s, etc.
> can appear in character class arithmetic (although it seems like it
> should).

The new character class idiom is going to confuse me for a while. I'll
have to read the above URL in which Larry sheds light.

> <prop:X> N Unicode property match
> <-prop:X> N complemented Unicode property match
>
> Here "prop" is just a subrule (or character class) similar to
> <+alpha>, <+digit>, etc. Also, note that <prop:X> is a capturing
> subrule, while <+prop:X> would be a character class match (and presumably
> not capture).

I think I'll wait to handle Unicode properties until a syntax has been
agreed upon... <prop:X>, <X>, <prop(X)>, etc.

> <rule> N match rule (and capture to $rule)
> <?rule> N match rule (don't capture)
> <<rule>> N match rule (don't capture)
>
> Do we still have the <<rule>> syntax, or was that abandoned in
> favor of <?rule> ? (I know there are still some remnants of <<...>>
> in S05 and A05, but I'm not sure they're intentional.)

I saw <<...>> in A/S 05, but if they're accidental, then I just won't deal
with it.

And, what's the deal with <RULE> capturing? Does that mean I have to
write <?digit> everywhere instead of <digit> unless I want a capture? Eh,
I guess \d exists for that reason...

>> Thanks for your help. Unless you're difficult.
>
> "You're welcome" unless $Pm ~~ /<?difficult>/;

Difficulty nonexistent.

Patrick R. Michaud

unread,
May 26, 2005, 8:36:52 PM5/26/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Thu, May 26, 2005 at 07:05:41PM -0400, Jeff 'japhy' Pinyan wrote:
> >Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&",
> >and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$",
> >"<!@", etc. counterparts.
>
> Per your second message, <!@rules> would mean <!before <@rules>>, right?

I think so -- at least, it seems that way to me. I feel as though
I may be missing some subtle difference between the two (and if anyone
can identify one I'd really appreciate it).

> And, what's the deal with <RULE> capturing? Does that mean I have to
> write <?digit> everywhere instead of <digit> unless I want a capture? Eh,
> I guess \d exists for that reason...

That, or you could write <+digit>.

Pm

Jeff 'japhy' Pinyan

unread,
May 28, 2005, 12:58:01 AM5/28/05
to perl6-l...@perl.org
In regards to http://www.nntp.perl.org/group/perl.perl6.language/21120
which discusses character class syntax in Perl 6, I have some comments to
make.

First, I've been very interested in seeing proper set notation for char
classes in Perl 5. I was pretty vocal about it during TPC in 2002, I
think, and have since added some features that are in Perl 5 now that
allow you to define your own Unicode properties with not only + and - and
! but & as well.

If we want to treat character classes as sets, then we should try to use
notation that reads properly. I don't see how '+' and '|' are any
different in this case: <+Foo +Bar> and <Foo | Bar> should produce the
same results always. I suppose the + is helpful in distinguishing a
character class assertion from any other, though. To *complement* a
character class, I think the character ~ is appropriate. Intersection
should be done with &. Subtraction can be provided with -, although it's
really just a shorthand: A - B is really A & ~B... but I suppose huffman
encoding tells us we should provide the - sign.

Here are some examples, then:

<+alpha -vowels> all alphabetic characters except vowels
<+alpha & ~vowels> same thing
<[a..z] -[aeiou]> all characters 'a' through 'z' minus vowels
<[a..z] & ~[aeiou]> same thing
<~(X & Y) | Z> all characters not in X-and-Y, or in Z

The last example shows <~ which is currently unclaimed as far as
assertions go. Since I'd be advocating the removal of a unary - in
character classes (to be replaced by ~), I think this would be ok. The
allowance for a unary + in character classes has already been justified.

For the people who are really going to use it, the notation won't be
foreign. And I'd expect most people who'd use it would actually abstract
a good portion of it away into their own property definitions, so that

<~(X & Y) | Z>

would actually just be

<+My_XYZ_Property>

which would be defined elsewhere.

What say you?

Jonathan Scott Duff

unread,
May 28, 2005, 11:22:42 AM5/28/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Sat, May 28, 2005 at 12:58:01AM -0400, Jeff 'japhy' Pinyan wrote:
>[ set notation for character classes ]
>
> What say you?

Off the top of my head I think using & and | within character classes
will cause confusion.

/ (<~(X & Y) | Z> | <Q & R>) & <M | N> /

So much for the "visual pill" of <xxx>

Also, character classes don't currently utilize parentheses for
anything. This is a good thing as you don't have to distinguish between
which parens are within assertions and which are without. Or do you
proposed that even the parens within assertions should capture to $0
and friends?

Mark A Biggar

unread,
May 28, 2005, 3:35:04 PM5/28/05
to du...@pobox.com, Jeff 'japhy' Pinyan, Jonathan Scott Duff, perl6-l...@perl.org
I'm having a hard time coming up eith examples where I need anything otehr than union and difference for character classes. Most of the predefined character classes are disjoint, so intersection is almost useless. So for now let's just stick with + and - and simple sets with not parens, unless we can come up with cases that really need anything more complicated.

--
Mark Biggar
ma...@biggar.org
mark.a...@comcast.net
mbi...@paypal.com

-------------- Original message --------------

> On Sat, May 28, 2005 at 12:58:01AM -0400, Jeff 'japhy' Pinyan wrote:
> >[ set notation for character classes ]
> >
> > What say you?
>
> Off the top of my head I think using & and | within character classes
> will cause confusion.
>

> / (<~(X & Y) | Z> | ) & /

>
> So much for the "visual pill" of
>

Jeff 'japhy' Pinyan

unread,
May 29, 2005, 12:52:25 PM5/29/05
to Patrick R. Michaud, perl6-l...@perl.org
> On May 26, Patrick R. Michaud said:
>
>> <commit> N backtracking fails completely
>> <cut> N remove what matched up to this point from the
>> string
>> <after P> N we must be after the pattern P
>> <!after P> N we must NOT be after the pattern P
>> <before P> N we must be before the pattern P
>> <!before P> N we must NOT be before the pattern P
>>
>> As with ':words', etc., I'm not sure that these qualify as "tokens"
>> when parsing the regex -- the tokens are actually "<" or "<!" and

I'm curious if <commit> and <cut> "capture" anything. They don't start
with '?', so following the guidelines, it would appear they capture, but
that doesn't make sense. Should they be written as <?commit> and <?cut>,
or is the fact that they capture silently ignored because they're not
consuming anything?

Same thing with <null> and <prior>. And with <after P> and <before P>.
It should be assumed that <!after P> doesn't capture because it can only
capture if P matches, in which case <!after P> fails.

So, what's the deal?

Patrick R. Michaud

unread,
May 31, 2005, 12:43:54 PM5/31/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Sun, May 29, 2005 at 12:52:25PM -0400, Jeff 'japhy' Pinyan wrote:
> I'm curious if <commit> and <cut> "capture" anything. They don't start
> with '?', so following the guidelines, it would appear they capture, but
> that doesn't make sense. Should they be written as <?commit> and <?cut>,
> or is the fact that they capture silently ignored because they're not
> consuming anything?
>
> Same thing with <null> and <prior>. And with <after P> and <before P>.
> It should be assumed that <!after P> doesn't capture because it can only
> capture if P matches, in which case <!after P> fails.
>
> So, what's the deal?

I'm not the language designer, but FWIW here is my interpretation.

First, we have to remember that "capture" now means more than just
grabbing characters from a string -- it also generates a successful
match and a corresponding match object. Thus, even though <after>,
<before>, <commit>, <cut>, and <null> are zero width assertions,
maybe they should still produce a corresponding match object
indicating a successful match. This might end up being useful in
alternations or other rule structures:

m/ [ abc <commit> def | ab ]/ ;
if $<commit> { say "we found 'abcdef'"; }

m/ [ abc | def <null> ]/;
if $<null> { say "we found 'def'"; }

I don't *know* that this would be useful, and certainly there are
other ways to achieve the same results, but keeping the same
capture semantics for zero-length assertions seems to work
out okay. Of course, to avoid the generation of the match objects
one can use <?commit>, <?cut>, <?null>, etc. I suspect that for the
majority of cases the choice of <commit> vs. <?commit> isn't going to
make a whole lot of difference, and for the places where it does make
a difference it's nice to preserve the interpretation being used by
other subrules.

Things could be a bit interesting from a performance/optimization
perspective; conceivably an optimizer could do a lot better for the
common case if we somehow declared that <null>, <commit>, <cut>, etc.
never capture. But I think the execution cost of capturing vs.
non-capturing in PGE is minimal relative to other considerations,
so we're a bit premature to try to optimize there. Overall I think
we'll be better off keeping things consistent for programmers at
the language level, and then build better/smarter optimizers into
the pattern matching engine to handle the common cases.

Pm

Larry Wall

unread,
May 31, 2005, 4:20:57 PM5/31/05
to perl6-l...@perl.org
On Thu, May 26, 2005 at 11:19:42AM -0500, Patrick R. Michaud wrote:
: Do we still have the <<rule>> syntax, or was that abandoned in

: favor of <?rule> ? (I know there are still some remnants of <<...>>
: in S05 and A05, but I'm not sure they're intentional.)

It's gone, though we're reserving it for something else we haven't
thought of yet. :-)

Larry

Patrick R. Michaud

unread,
May 31, 2005, 6:27:58 PM5/31/05
to perl6-l...@perl.org

Excellent. I'll start updating S05 and A05 to match.

While we're on the topic, can we also bless <+alnum-digit> as
the official syntax instead of <+<alnum>-<digit>> ? I can make
that change as well.

And I think I have another post brewing regarding the relationship
between character classes and subrules, but will save that for a bit
later.

Pm

Patrick R. Michaud

unread,
May 31, 2005, 7:17:10 PM5/31/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Thu, May 26, 2005 at 11:19:42AM -0500, Patrick R. Michaud wrote:
> <$rule> N indirect rule
> <::$rulename> N indirect symbolic rule
> <@rules> N like '@rules'
> <%rules> N like '%rules'
> <{ code }> N code produces a rule
> <&foo()> N subroutine returns rule
> <( code )> N code must return true or backtracking ensues
>
> Here the leading tokens are actually "<$", "<::$", "<@", "<%", "<{", "<&",
> and "<(", and I suspect we have "<?$", "<?::$", "<?@", and "<!$", "<!::$",
> "<!@", etc. counterparts.

Oops. After re-reading A05 I would now assume we don't have
non-capturing counterparts for <$rule>, <@rules>, <%rules>, ... --
they're already non capturing. From A05:

[Update: Only rules of the form <ident> are captured by
default. You can use := to force capture of anything else,
or the :keepall adverb to capture everything else.]

Somewhere I thought I read that <$rule> captures to $/{'$rule'},
but I don't find it now, so if A05 holds here, then we don't
need "<?$", "<?@", "<?::$", etc. (Whew!) Somehow I much prefer
A05's formulation, in that only rules of the form <ident>
capture, and we use aliases or parentheses to capture anything
else.

Thus one can say that <+alpha>, <-alpha>, <[aeiou]>, <!alpha>, <?alpha>,
<$alpha>, <@alpha>, etc. are all non-capturing constructs.

Pm

Jeff 'japhy' Pinyan

unread,
Jun 2, 2005, 12:52:36 AM6/2/05
to Patrick R. Michaud, perl6-l...@perl.org
Further woes, arguments, questions:

In regards to <@array>, A5 says "A leading @ matches like a bare array..."
but this is an over-generalization. A leading '@' merely indicates the
rule is found in an array. <@array[3]> would be the same as
<$fourth_element_of_array>, assuming those two values are identical.

Next, about <before RULE> and <after RULE>. What is the justification for
that syntax? There is no other example of a <-sequence with whitespace,
at least that I can see. It would appear "RULE" is an argument of sorts
to the 'before' and 'after' rules, but how do they access that argument?
How do I write a rule that takes an argument?

Patrick R. Michaud

unread,
Jun 2, 2005, 1:51:50 AM6/2/05
to Jeff 'japhy' Pinyan, perl6-l...@perl.org
On Thu, Jun 02, 2005 at 12:52:36AM -0400, Jeff 'japhy' Pinyan wrote:
> Further woes, arguments, questions:
>
> In regards to <@array>, A5 says "A leading @ matches like a bare array..."
> but this is an over-generalization. A leading '@' merely indicates the
> rule is found in an array. <@array[3]> would be the same as
> <$fourth_element_of_array>, assuming those two values are identical.

I'll leave this to the A05 authors to decide. :-) S05 doesn't
present any examples of subscripted rules or hashes, so perhaps
that particular syntax was reconsidered (i.e., I think one could
do <{ @array[3] }>).

> Next, about <before RULE> and <after RULE>. What is the justification for
> that syntax? There is no other example of a <-sequence with whitespace,
> at least that I can see. It would appear "RULE" is an argument of sorts
> to the 'before' and 'after' rules, but how do they access that argument?
> How do I write a rule that takes an argument?

According to A05, rules take arguments much the same way that
subs do. (In fact, it's *very* useful to think of rules as subs or
methods.) So, one can do:

rule myrule ($x) { \w+ $x }

and $x is scoped something like a subroutine parameter would be.

A05 also mentions several mechanisms for passing parameters to
a rule:

<myrule pattern> # same as calling myrule(/pattern/)
<myrule: text> # same as calling myrule(q<text>)
<myrule(expr)> # same as calling myrule(expr)

Of course, there are other "implicit" parameters that are given
to a rule -- the target string to be matched and an initial
starting position. But I think some of those details are still
being worked out.

Pm

Patrick R. Michaud

unread,
Jun 2, 2005, 2:12:49 PM6/2/05
to TSa (Thomas Sandlaß), perl6-l...@perl.org
On Thu, Jun 02, 2005 at 09:14:33AM +0200, "TSa (Thomas Sandlaß)" wrote:
> Patrick R. Michaud wrote:
> >Of course, there are other "implicit" parameters that are given
> >to a rule -- the target string to be matched and an initial
> >starting position. But I think some of those details are still
> >being worked out.
>
> Wasn't it said that rules have the current match object/state
> as invocant? I would assume that everything else can be found
> through it? Actually the mnemonics that $/ is the match and
> methods on $?SELF are called with ./method fits. The only
> remaining thing is to define the method set of the Match class.

Alas, it doesn't seem to be quite that straightforward. Or maybe
it is, and I'm just not seeing it yet. So, I'll just "think out
loud" here for a bit...

If the current match object/state ($/) is the invocant of the rule,
then in order for rule inheritance to work properly $/ must be able
to be an instance of a Grammar. A05 explicitly recognizes this
possibility when it says "the state object may in fact be an
instance of a grammar class". If that's the case, we might not
need a separate C<Match> class, and we just place the methods needed
for inspecting "match objects" into the Grammar class.

But somehow my brain just has trouble accepting that applying a
rule to a target returns an "instance of a Grammar". The wording seems
all wrong, or perhaps I just need to adjust what I think of when I see the
word "Grammar".

Getting rule inheritance to work properly is a bit tricky. When
confronted with something like

"label: x = y + z" ~~ rx :w / (\w+) \: <Foo::expr> /

we have to create at least two match objects, and if we say that
the match objects are the rule invocants, then the grammar engine
has to be smart enough to recognize that the match object it creates
to use as the invocant of <Foo::expr> is an instance of grammar Foo.
Or, perhaps Foo::expr and all rules in Foo are really constructors of
some sort that build Foo objects--that seems more logical. But if we
say that Foo::expr is a sub or method that constructs Foo objects
(as opposed to having a pre-existing invocant), then Foo::expr needs
to have the target string and starting position available to it somehow,
as I mentioned in my previous message.

On another topic, what do we do with rules that aren't members
of a grammar? A05 says:

Within a closure, C<$_> represents the current state of the
current regex, and by extension, the current state of all
the regexes participating in the current match. (The type of
the state object is the current grammar class, which may be
an anonymous type if the current grammar has no name. If
the regex is not a member of a grammar, it's of type RULE.)

I suspect the first sentence is out of date -- that C<$_> above
is now really C<$/> (the match object). Since in the case of a bare
rule we don't have a "current grammar", what can we say about
the type of the state object beyond "it may be an anonymous type"?
I think the state object ought to have some sort of base type --
is it Grammar? Rule? If we say it's a "Rule", then we're
effectively saying that "applying a Rule to a target results
in a Rule object containing the state of the match", which just
sounds completely wrong to my ears/eyes (even though it may in
fact be correct).

Or perhaps all of this is to resolved using roles, mixins, or
multiple superclasses. But to get back to my original statement
that "some of the details are still being worked out", I find
that A05 is somewhat speculative on many of the details of how
grammars, inheritance, state objects, and rules will interact,
and S05 is practically silent on the topic. So, there's definitely
some work to do.

And of course we have to figure out how to map all of this into
what Parrot has available, or update Parrot to provide what we need
to do this. :-)

I'll be appreciative of any illumination that others can provide
to the above, especially from @Larry.

Pm

Patrick R. Michaud

unread,
Jun 2, 2005, 5:17:33 PM6/2/05
to TSa (Thomas Sandlaß), perl6-l...@perl.org
On Thu, Jun 02, 2005 at 09:19:22PM +0200, "TSa (Thomas Sandlaß)" wrote:
> >I think the state object ought to have some sort of base type --
> >is it Grammar? Rule? If we say it's a "Rule", then we're
> >effectively saying that "applying a Rule to a target results
> >in a Rule object containing the state of the match", which just
> >sounds completely wrong to my ears/eyes (even though it may in
> >fact be correct).
>
> It sounds perfectly to my ears. You should think of ordinary
> subs as classes and calls to them as instances of that class.
> The environment created at runtime for such an invokation actually
> *is* a sub object. [...]

Okay, viewing it that way makes me quite a bit more comfortable with it.
I've also re-read parts of A05 (and S12 and S06) that seem to help
to clear things up a bit. Specifically, A05 says

In this case a rule is simply a method in a grammar class,
and a grammar class is any class derived implicitly or
explicitly from the universal [Rule] grammar class.

I take from this that

grammar Foo { ... }

creates a class Foo derived from Rule, and any rules declared in the
braces are (class) methods of Foo that create instances of Foo.
Combined with other statements, this also means that match objects are
instances of either Rule or the grammar method used to create them.

I also guess this means we really don't need a C<Grammar> type
as mentioned in S06, since that's being covered by Rule. Or
perhaps Grammar is an abstract subclass of Rule and the C<grammar>
statement derives new grammars from Grammar.

A05 also mentions

The built-in regex assertions like C<< <before \w> >> are
really just calls to methods in the [Rule] class.

which works well for me for the time being.

So, returning to invocants, and throwing in lexically scoped rules
to boot, consider the following. (This may more properly belong on
p6c, but since the thread started here I'll keep it here for a message
or two until it becomes obvious it belongs somewhere else.)

grammar Foo {
rule alpha { <[abcdef]> }
rule beta { <alpha>+ <digit>+ }
}

my rule alpha { <[uvwxyz]> }
my rule beta { <alpha>+ <digit>+ }

m :w / <beta> <Foo::beta> /

We have to get the code for each of the two "beta" rules to properly
dispatch to the appropriate grammar-scoped or lexically-scoped rule,
and get the appropriate invocant. The process appears to be:

1. If the subrule is scoped (e.g., <Foo::beta>), then dispatch
to Foo.beta(target,pos).
2. If the subrule is not scoped (e.g., <beta>), then look for
a lexically-scoped subrule of the same name -- i.e.,
beta(target, pos) and call that if it exists. Otherwise,
call $/.beta(target, pos).

In each case, the subrule returns a match object of the same type
as its "invocant".

Or something like that.

Pm

Reply all
Reply to author
Forward
0 new messages