I am toying around with Parrot and the compiler tools. The documenation
of Perl 6 grammars that I have been able to find only describe rule. But
the grammars in Parrot 0.4.4 for punie and APL use rule, token and regex
elements.
Can someone please clarify the difference between these three types, and
when you should use one or the other?
Kind Regards
Rene H. Møller
in particular, see the "Regexes really are regexes now" section, which
describes the differences. also, there are some recent threads on p6l
with regard to this topic, which you may find enlightening. you can
find these via google groups, or some other nntp archive.
~jerry
Jerry is correct that S05 is the place to look for information
on this. But to summarize an answer to your question:
- a C<regex> is a "normal" regular expression
- a C<token> is a regex with the :ratchet modifier set. The
:ratchet modifier disables backtracking by default, so that
a plain quantifier such as '*' or '+' will greedily match whatever
it can but won't backtrack if the remainder of the match fails.
- a C<rule> is a regex with both the :ratchet and :sigspace
modifiers set. The :sigspace modifier indicates that whitespace
in the rule should be replaced by a intertoken separator rule
such as <?ws> (a whitespace matching rule).
So,
rule { a* c b+ }
is the same as
token { <?ws> a* <?ws> c <?ws> b+ <?ws> }
is the same as
regex { <?ws>: a*: <?ws>: c <?ws>: b+: <?ws> }
To answer your other question, about when to use each, here are
some rules of thumb (sorry for the pun):
- If the quantifiers in the rule need to do backtracking, use 'regex'
- If backtracking isn't needed, use 'token'
- If the components of the regex can have intertoken separators
between them, use rule (and perhaps define a custom <ws> rule
that matches the language's idea of "intertoken separator").
Here's a quick contrived example to illustrate the difference:
token identifier { <alpha> \w* }
token integer { \d+ }
token value { <identifier> | <integer> }
token operator { \+ | - | \* | / }
rule expression { <value> [ <operator> <value> ]* }
rule assignment { <identifier> \:= <expression> }
The "token" declarations all define regexes that do not match
any whitespace. Thus, "abc" is a valid identifier but " abc "
is not.
The rule declarations, however, allow for whitespace to occur
between each of the elements. Thus, each of the following
are valid assignments in the above language, as the use of
"rule" tells us where whitespace is allowed in the match:
b:=3+a*4
b := 3 + a * 4
b :=3 +a* 4
I can come up with more examples if desired, but that's the basics
behind each.
Hope this helps,
Pm