grammar: difference between rule, token and regex

r...@giddyplanet.com

unread,

Jun 2, 2006, 4:48:05 PM6/2/06

to perl6-i...@perl.org

Hi

I am toying around with Parrot and the compiler tools. The documenation
of Perl 6 grammars that I have been able to find only describe rule. But
the grammars in Parrot 0.4.4 for punie and APL use rule, token and regex
elements.

Can someone please clarify the difference between these three types, and
when you should use one or the other?

Kind Regards
Rene H. Møller

Jerry Gay

unread,

Jun 2, 2006, 4:56:55 PM6/2/06

to Rene Hangstrup Møller, p6l, perl6-i...@perl.org

i'm forwarding this to p6l, as it's a language question and probably
best asked there. that said, the regex/token/rule change is a recent
one, and is documented in S05
(http://dev.perl.org/perl6/doc/design/syn/S05.html)

in particular, see the "Regexes really are regexes now" section, which
describes the differences. also, there are some recent threads on p6l
with regard to this topic, which you may find enlightening. you can
find these via google groups, or some other nntp archive.
~jerry

Patrick R. Michaud

unread,

Jun 2, 2006, 5:39:58 PM6/2/06

to jerry gay, Rene Hangstrup Møller, p6l, perl6-i...@perl.org

On Fri, Jun 02, 2006 at 01:56:55PM -0700, jerry gay wrote:
> On 6/2/06, Rene Hangstrup Møller <r...@giddyplanet.com> wrote:
> >I am toying around with Parrot and the compiler tools. The documenation
> >of Perl 6 grammars that I have been able to find only describe rule. But
> >the grammars in Parrot 0.4.4 for punie and APL use rule, token and regex
> >elements.
> >
> >Can someone please clarify the difference between these three types, and
> >when you should use one or the other?
>
> i'm forwarding this to p6l, as it's a language question and probably
> best asked there. that said, the regex/token/rule change is a recent
> one, and is documented in S05
> (http://dev.perl.org/perl6/doc/design/syn/S05.html)

Jerry is correct that S05 is the place to look for information
on this. But to summarize an answer to your question:

- a C<regex> is a "normal" regular expression

- a C<token> is a regex with the :ratchet modifier set. The
:ratchet modifier disables backtracking by default, so that
a plain quantifier such as '*' or '+' will greedily match whatever
it can but won't backtrack if the remainder of the match fails.

- a C<rule> is a regex with both the :ratchet and :sigspace
modifiers set. The :sigspace modifier indicates that whitespace
in the rule should be replaced by a intertoken separator rule
such as <?ws> (a whitespace matching rule).

So,

rule { a* c b+ }

is the same as

token { <?ws> a* <?ws> c <?ws> b+ <?ws> }

is the same as

regex { <?ws>: a*: <?ws>: c <?ws>: b+: <?ws> }

To answer your other question, about when to use each, here are
some rules of thumb (sorry for the pun):

- If the quantifiers in the rule need to do backtracking, use 'regex'

- If backtracking isn't needed, use 'token'

- If the components of the regex can have intertoken separators
between them, use rule (and perhaps define a custom <ws> rule
that matches the language's idea of "intertoken separator").

Here's a quick contrived example to illustrate the difference:

token identifier { <alpha> \w* }

token integer { \d+ }

token value { <identifier> | <integer> }

token operator { \+ | - | \* | / }

rule expression { <value> [ <operator> <value> ]* }

rule assignment { <identifier> \:= <expression> }

The "token" declarations all define regexes that do not match
any whitespace. Thus, "abc" is a valid identifier but " abc "
is not.

The rule declarations, however, allow for whitespace to occur
between each of the elements. Thus, each of the following
are valid assignments in the above language, as the use of
"rule" tells us where whitespace is allowed in the match:

b:=3+a*4
b := 3 + a * 4
b :=3 +a* 4

I can come up with more examples if desired, but that's the basics
behind each.

Hope this helps,

Pm