Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[svn:perl6-synopsis] r8883 - doc/trunk/design/syn

4 views

Skip to first unread message

la...@cvs.perl.org

unread,

Apr 20, 2006, 5:07:51 AM4/20/06

to perl6-l...@perl.org

Author: larry
Date: Thu Apr 20 02:07:51 2006
New Revision: 8883

Modified:
doc/trunk/design/syn/S05.pod

Log:
Various clarifications.
Documented that null first alternative is ignored.
Removed colon separator after last modifier, now just use space.
Deleted the :once modifier. (A state variable suffices.)
A match object in boolean context isn't always forced to be eager.
Added :ratchet and :panic modifiers to limit backtracking in the parser.
Clarified when rules are allowed vs enforced in variable usage.
Added <%a|%b|%c> form for simple longest-token scoping.
Clarified that hash matches skip over key before value is matched.
Documented behavior of $<KEY>.
Added *+ ++ ?+ and :+ to force greed on specific atom.
Added token and parse rule variants for grammar productions.
Added <<<...>>> syntax.

Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod (original)
+++ doc/trunk/design/syn/S05.pod Thu Apr 20 02:07:51 2006
@@ -11,11 +11,11 @@

=head1 VERSION

- Maintainer: Patrick Michaud <pmic...@pobox.com>
+ Maintainer: Patrick Michaud <pmic...@pobox.com> (& TimToady)
Date: 24 Jun 2002
- Last Modified: 6 Apr 2006
+ Last Modified: 20 Apr 2006
Number: 5
- Version: 15
+ Version: 16

This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them "rules" because they haven't been
@@ -30,8 +30,8 @@
it doesn't look like it. The individual capture variables (such as C<$0>,
C<$1>, etc.) are just elements of C<$/>.

-By the way, the numbered capture variables now start at C<$0>, C<$1>,
-C<$2>, etc. See below.
+By the way, the numbered capture variables now start at C<$0> rather than
+C<$1>. See below.

=head1 Unchanged syntactic features

@@ -68,6 +68,8 @@
=item *

The extended syntax (C</x>) is no longer required...it's the default.
+(In fact, it's pretty much mandatory--the only way to get back to
+the old syntax is with the C<:Perl5>/C<:P5> modifier.)

=item *

@@ -78,7 +80,11 @@

There is no C</e> evaluation modifier on substitutions; instead use:

- s/pattern/{ code() }/
+ s/pattern/{ doit() }/
+
+Instead of C</ee> say:
+
+ s/pattern/{ eval doit() }/

=item *

@@ -87,8 +93,9 @@
m:g:i/\s* (\w*) \s* ,?/;

Every modifier must start with its own colon. The delimiter must be
-separated from the final modifier by a colon or whitespace if it would
-be taken as an argument to the preceding modifier.
+separated from the final modifier by whitespace if it would be taken
+as an argument to the preceding modifier (which is true for any
+bracketing character).

=item *

@@ -127,19 +134,13 @@

is roughly equivalent to

- m:p/.*? pattern/
-
-=item *
-
-The new C<:once> modifier replaces the Perl 5 C<?...?> syntax:
+ m:p/.*? <( pattern )> /

- m:once/ pattern / # only matches first time
+Also note that any rule called as a subrule is implicitly anchored to the
+current position anyway.

=item *

-[Note: We're still not sure if :w is ultimately going to work exactly
-as described below. But this is how it works for now.]
-
The new C<:w> (C<:words>) modifier causes whitespace sequences to be
replaced by C<\s*> or C<\s+> subpattern as defined by the C<< <?ws> >> rule.

@@ -164,6 +165,9 @@
C<< <?ws> >> can't decide what to do until it sees the data. It still does
the right thing. If not, define your own C<< <?ws> >> and C<:w> will use that.

+In general you don't need to use C<:w> within grammars because
+the parse rules automatically handle whitespace policy for you.
+
=item *

New modifiers specify Unicode level:
@@ -177,9 +181,9 @@

=item *

-The new C<:perl5> modifier allows Perl 5 regex syntax to be used instead:
+The new C<:Perl5> modifier allows Perl 5 regex syntax to be used instead:

- m:perl5/(?mi)^[a-z]{1,2}(?=\s)/
+ m:Perl5/(?mi)^[a-z]{1,2}(?=\s)/

(It does not go so far as to allow you to put your modifiers at
the end.)
@@ -194,16 +198,16 @@
If followed by an C<x>, it means repetition. Use C<:x(4)> for the
general form. So

- s:4x { (<?ident>) = (\N+) $$}{$0 => $1};
+ s:4x [ (<?ident>) = (\N+) $$] [$0 => $1];

is the same as:

- s:x(4) { (<?ident>) = (\N+) $$}{$0 => $1};
+ s:x(4) [ (<?ident>) = (\N+) $$] [$0 => $1];

which is almost the same as:

$_.pos = 0;
- s:c{ (<?ident>) = (\N+) $$}{$0 => $1} for 1..4;
+ s:c [ (<?ident>) = (\N+) $$] [$0 => $1] for 1..4;

except that the string is unchanged unless all four matches are found.
However, ranges are allowed, so you can say C<:x(1..4)> to change anywhere
@@ -250,10 +254,15 @@
$str = "abracadabra";

if $str ~~ m:exhaustive/ a (.*) a / {
- @substrings = $/.matches(); # br brac bracad bracadabr
- # c cad cadabr d dabr br
+ say "@()"; # br brac bracad bracadabr c cad cadabr d dabr br
}

+Note that the C<~~> above can return as soon as the first match is found,
+and the rest of the matches may be performed lazily by C<@()>.
+
+[Conjecture: the C<:exhaustive> modifier should have an optional argument
+specifying how many seconds to run before giving up, since it's trivially
+easy to ask for the heat death of the universe to happen first.]

=item *

@@ -275,7 +284,24 @@

=item *

-The C<:i>, C<:w>, C<:perl5>, and Unicode-level modifiers can be
+The new C<:ratchet> modifier causes this rule to not backtrack by default.
+(Generally you do not use this modifier directly, since it's implied by
+C<token> and C<parse> declarations.) The effect of this modifier is
+to imply a C<:> after every construct that could backtrack, including
+bare C<*>, C<+>, and C<?> quantifiers, as well as alternations.
+
+=item *
+
+The new C<:panic> modifier causes this rule and all invoked subrules
+to try to backtrack on any rules that would otherwise default to
+not backtracking because they have C<:ratchet> set. Never panic
+unless you're desperate and want the pattern matcher to do a lot of
+unnecessary work. If you have an error in your grammar, it's almost
+certainly a bad idea to fix it by backtracking.
+
+=item *
+
+The C<:i>, C<:w>, C<:Perl5>, and Unicode-level modifiers can be
placed inside the rule (and are lexically scoped):

m/:w alignment = [:i left|right|cent[er|re]] /
@@ -297,7 +323,6 @@
To use parens or brackets for your delimiters you have to separate:

m:fuzzy (pattern);
- m:fuzzy:(pattern);

or you'll end up with:

@@ -346,7 +371,10 @@

=item *

-An unescaped C<#> now always introduces a comment.
+An unescaped C<#> now always introduces a comment. If followed
+by an opening bracket character (and if not in the first column),
+it introduces an embedded comment that terminates with the closing
+bracket. Otherwise the comment terminates at the newline.

=item *

@@ -438,7 +466,7 @@
so that the closure is never actually run in that case. But it's
a closure that must be run in the general case, so you can use
it to generate a range on the fly based on the earlier matching.
-(Of course, bear in mind the closure is run I<before> attempting to
+(Of course, bear in mind the closure must be run I<before> attempting to
match whatever it quantifies.)

=item *
@@ -473,7 +501,9 @@

/ \Q$var\E /

-(To get rule interpolation use an assertion - see below)
+However, if C<$var> contains a rule object, rather attempting to
+convert it to a string, it is called as if you said C<< <$var> >>.
+See assertions below.

=item *

@@ -486,7 +516,8 @@
/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /

-As with a scalar variable, each element is matched as a literal.
+As with a scalar variable, each element is matched as a literal unless
+it happens to be a rule object, in which case it is matched as a subrule.

=item *

@@ -503,15 +534,23 @@

=item *

-If it is a string or rule object, it is executed as a subrule.
+If it is a string, it is matched literally, starting after where the
+key left off matching.

=item *

-If it has the value 1, nothing special happens beyond the match.
+If it is a rule object, it is executed as a subrule, with an initial
+position after the matched key.

=item *

-Any other value causes the match to fail.
+If it has the value 1, nothing special happens except that the key match
+succeeds.
+
+=item *
+
+Any other value causes the match to fail. In particular, shorter keys
+are not tried if a longer one matches and fails.

=back

@@ -547,6 +586,11 @@
tree and looking for things in the opposite order going to the left.
It is illegal to do lookbehind on a pattern that cannot be reversed.

+Note: the effect of a forward-scanning lookbehind at the top level
+can be achieved with:
+
+ / .*? prestuff <( mainpat >) /
+
=item *

A leading C<?> causes the assertion not to capture what it matches (see
@@ -556,28 +600,66 @@
/ <?ident> <ws> / # only $/<ws> captured
/ <?ident> <?ws> / # nothing captured

+The non-capturing behavior may be overridden with a C<:keepall>.
+
=item *

A leading C<$> indicates an indirect rule. The variable must contain
-either a hard reference to a rule, or a string containing the rule.
+either a rule object, or a string to be compiled as the rule. The
+string is never matched literally.

=item *

A leading C<::> indicates a symbolic indirect rule:

- / <::($somename)>
+ / <::($somename)> /

-The variable must contain the name of a rule.
+The variable must contain the name of a rule. By the rules of single method
+dispatch this is first searched for in the current grammar and its ancestors.
+If this search fails an attempt is made to dispatch via MMD, in which case
+it can find rules defined as multis rather than methods.

=item *

A leading C<@> matches like a bare array except that each element
-is treated as a rule (string or hard ref) rather than as a literal.
+is treated as a rule (string or rule object) rather than as a literal.
+That is, a string is forced to be compiled as a rule rather than matched
+literally. (There is no difference for a rule object.)

=item *

-A leading C<%> matches like a bare hash except that each key
-is treated as a rule (string or hard ref) rather than as a literal.
+A leading C<%> matches like a bare hash except that each value is always
+treated as a rule, even if it is a string that must be compiled to a rule
+at match time.
+
+With both bare hash and hash in angles, the key is always skipped
+over before calling any rule in the value. That rule may, however,
+magically access the key anyway as if the rule had started before the
+key and matched with C<< <KEY> >> assertion. That is, C<< $<KEY> >>
+will contain the keyword or token that this rule was looked up under,
+and that value will be returned by the current match object even if
+you do nothing special with it within the match. (This also works
+for the name of a macro as seen from an C<is parsed> rule, since
+internally that turns into a hash lookup.)
+
+As with bare hash, the longest key matches according to the longest token
+rule, but in addition, you may combine multiple hashes under the same
+longest-token consideration like this:
+
+ <%statement|%prefix|%term>
+
+This means that, despite being in a later hash, C<< %term<food> >>
+will be selected in preference to C<< %prefix<foo> >> because it's
+the longer token. However, if there is a tie, the earlier hash wins,
+so C<< %statement<if> >> hides any C<< %prefix<if> >> or C<< %term<if> >>.
+
+In contrast, if you say
+
+ [ <%prefix> | <%term> ]
+
+a C<< %prefix<foo> >> would be selected in preference to a C<< %term<food> >>.
+(Which is not what you usually want if your language is to do longest-token
+consistently.)

=item *

@@ -592,7 +674,7 @@
rule closure binds the I<result object> for this match, ignores the
rest of the current rule, and reports success:

- / (\d) <{ return $0.sqrt }> NotReached /;
+ / (\d) <{ return $0.sqrt }> NotReached /;

This has the effect of capturing the square root of the numified string,
instead of the string. The C<NotReached> part is not reached.
@@ -654,14 +736,16 @@
/ <after foo> \d+ <before bar> /

except that the scan for "foo" can be done in the forward direction,
-while a lookbehind assertion would presumably scan for \d+ and then
-match "foo" backwards. The use of C<< <(...)> >> affects only the
+while a lookbehind assertion would presumably scan for C<\d+> and then
+match "C<foo>" backwards. The use of C<< <(...)> >> affects only the
meaning of the "result object" and the positions of the beginning and
ending of the match. That is, after the match above, C<$()> contains
only the digits matched, and C<.pos> is pointing to after the digits.
Other captures (named or numbered) are unaffected and may be accessed
through C<$/>.

+It is a syntax error to use an unbalanced C<< <( >> or C<< )> >>.
+
=item *

A leading C<[> or C<+> indicates an enumerated character class. Ranges
@@ -717,6 +801,17 @@

/ <!before _ > / # We aren't before an _

+Note that C<< <!alpha> >> is different from C<< <-alpha> >> because the
+latter matches C</./> when it is not an alpha.
+
+=item *
+
+Conjecture: Multiple opening angles are matched by a corresponding
+number of closing angles, and otherwise function as single angles.
+This can be used to visually isolate unmatched angles inside:
+
+ <<<Ccode: a >> 1>>>
+
=back

=head1 Backslash reform
@@ -904,6 +999,49 @@
causes it to produce a C<Code> or C<Rule> reference, which the switch
statement then selects upon.

+=item *
+
+Just as C<rx> has variants, so does the C<rule> declarator.
+In particular, there are two special variants for use in grammars:
+C<token> and C<parse>.
+
+A token declaration:
+
+ token ident { [ <alpha> | _ ] \w+ }
+
+never backtracks by default. That is, it likes to commit to whatever
+it has scanned so far. The above is equivalent to
+
+ rule ident { [ <alpha>: | _ ]: \w+: }
+
+but rather easier to read. The bare C<*>, C<+> and C<?> quantifiers
+never backtrack in a C<token> unless some outer rule has specified a
+C<:panic> option that applies. If you want to prevent even that, use
+C<*:>, C<+:> or C<?:> to prevent any backtracking into the quantifier.
+If you want to explicitly backtrack, append either a C<?> or a C<+>
+to the quantifier. The C<?> forces minimal matching as usual,
+while the C<+> forces greedy matching. The C<token> declarator is
+really just short for
+
+ rule :ratchet { ... }
+
+The other is the C<parse> declarator, for declaring non-terminal
+productions in a grammar. It also does not backtrack unless a
+C<:panic> is in effect or you explicitly specify a backtracking
+quantifier. In addition, a C<parse> rule also assumes C<:words>.
+A C<parse> is really short for:
+
+ rule :ratchet :words { ... }
+
+=item *
+
+The Perl 5 C<?...?> syntax ("match once") was rarely used and can be
+now emulated more cleanly with a state variable:
+
+ (state $x) ||= / pattern /; # only matches first time
+
+To reset the pattern, simply set C<$x = 0>.
+
=back

=head1 Backtracking control
@@ -912,14 +1050,40 @@

=item *

+By default, backtracking is greedy in C<rx>, C<m>, C<s>, and the
+like. It's also greedy in ordinary rules. In C<parse> and C<token>
+declarations, backtracking must be explicit.
+
+=item *
+
+To force the preceding atom to do eager backtracking,
+append a C<:?> or C<?> to the atom. If the preceding token is
+a quantifier, the C<:> may be omitted, so C<*?> works just as
+in Perl 5.
+
+=item *
+
+To force the preceding atom to do greedy backtracking,
+append a C<:+> or C<+> to the atom. If the preceding token
+is a quantifier, the C<:> may be omitted. (Perl 5 has no
+corresponding construct because backtracking always defaults
+to greedy in Perl 5.)
+
+=item *
+
+To force the preceding atom to do no backtracking, use a single C<:>
+without a subsequent C<?> or C<+>.
Backtracking over a single colon causes the rule engine not to retry
the preceding atom:

- m:w/ $ <expr> [ , <expr> ]* : $ /
+ m:w/ $ <expr> [ , <expr> ]*: $ /

(i.e. there's no point trying fewer C<< <expr> >> matches, if there's
no closing parenthesis on the horizon)

+To force all the atoms in an expression not to backtrack by default,
+use C<:ratchet> or C<parse> or C<token>.
+
=item *

Backtracking over a double colon causes the surrounding group of
@@ -931,8 +1095,12 @@
]
/

-(i.e. there's no point trying to match a different keyword if one
-was already found but failed).
+(i.e. there's no point trying to match a different keyword if one was
+already found but failed). Note that you can still back into such an
+alternation, so you may also need to put C<:> after it if you also
+want to disable that. If a an explicit or implicit C<:ratchet> has
+disabled backtracking, you need to put C<:+> after the alternation
+to enable backing into another alternative if the first pick fails.

=item *

@@ -993,9 +1161,10 @@

=item *

-...so too you can have anonymous rules and I<named> rules:
+...so too you can have anonymous rules and I<named> rules (and tokens,
+and parses):

- rule ident { [<alpha>|_] \w* }
+ token ident { [<alpha>|_] \w* }

# and later...

@@ -1007,11 +1176,11 @@
such as:

rule serial_number { <[A..Z]> \d**{8} }
- rule type { alpha | beta | production | deprecated | legacy }
+ token type { alpha | beta | production | deprecated | legacy }

in other rules as named assertions:

- rule identification { [soft|hard]ware <type> <serial_number> }
+ parse identification { [soft|hard]ware <type> <serial_number> }

=back

@@ -1049,6 +1218,10 @@

This makes it easier to catch errors like this:

+ /a|b|c|/
+
+As a special case, however, the first null alternative in a match like
+
m:w/ [
| if :: <expr> <block>
| for :: <list> <block>
@@ -1056,6 +1229,19 @@
]
/

+is simply ignored. Only the first alternative is special that way.
+If you write:
+
+ m:w/ [
+ if :: <expr> <block> |
+ for :: <list> <block> |
+ loop :: <loop_controls>? <block> |
+ ]
+ /
+
+
+it's still an error.
+
=item *

However, it's okay for a non-null syntactic construct to have a degenerate
@@ -1099,6 +1285,10 @@
# or:
/pattern/; if $/ {...}

+With C<:global> or C<:overlap> or C<:exhaustive> the boolean is
+allowed to return true on the first match. The C<Match> object can
+produce the rest of the results lazily if evaluated in list context.
+
=item *

In string context it evaluates to the stringified value of its
@@ -1121,7 +1311,7 @@

=item *

-When used as a scalar, a Match object evaluates to its underlying
+When used as a scalar, a C<Match> object evaluates to its underlying
result object. Usually this is just the entire match string, but
you can override that by calling C<return> inside a rule:

@@ -1146,7 +1336,7 @@
Additionally, the C<Match> object delegates its C<coerce> calls
(such as C<+$match> and C<~$match>) to its underlying result object.
The only exception is that C<Match> handles boolean coercion itself,
-which returns whether the match had succeeded.
+which returns whether the match had succeeded at least once.

This means that these two work the same:

@@ -1155,7 +1345,7 @@

=item *

-When used as an array, a Match object pretends to be an array of all
+When used as an array, a C<Match> object pretends to be an array of all
its positional captures. Hence

($key, $val) = m:w/ (\S+) => (\S+)/;
@@ -1179,11 +1369,13 @@

Note that, as a scalar variable, C<$/> doesn't automatically flatten
in list context. Use C<@()> as a shorthand for C<@($/)> to flatten
-the positional captures under list context.
+the positional captures under list context. Note that a C<Match> object
+is allowed to evaluate its match lazily in list context. Use C<**@()>
+to force an eager match.

=item *

-When used as a hash, a Match object pretends to be a hash of all its named
+When used as a hash, a C<Match> object pretends to be a hash of all its named
captures. The keys do not include any sigils, so if you capture to
variable C<< @<foo> >> its real name is C<$/{'foo'}> or C<< $/<foo> >>.
However, you may still refer to it as C<< @<foo> >> anywhere C<$/>
@@ -1192,7 +1384,8 @@

Note that, as a scalar variable, C<$/> doesn't automatically flatten
in list context. Use C<%()> as a shorthand for C<%($/)> to flatten as a
-hash, or bind it to a variable of the appropriate type.
+hash, or bind it to a variable of the appropriate type. As with C<@()>,
+it's possible for C<%()> to produce its pairs lazily in list context.

=item *

@@ -1240,7 +1433,7 @@
incomplete C<Match> object (which can be modified via the internal C<$/>.
For example:

- $str ~~ / foo # Match 'foo'
+ $str ~~ / foo # Match 'foo'
{ $/ = 'bar' } # But pretend we matched 'bar'
/;
say $/; # says 'bar'
@@ -1556,7 +1749,9 @@

=item *

-Any call to a named C<< <rule> >> within a pattern is known as a I<subrule>.
+Any call to a named C<< <rule> >> within a pattern is known as a
+I<subrule>, whether that rule is actually defined as a C<rule> or
+C<token> or C<parse> or even an ordinary C<method> or C<multi>.

=item *

@@ -1599,9 +1794,9 @@
=item *

The hash entries of a C<Match> object can be referred to using any of the
-standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/«baz»>,
+standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/«baz»>,
etc.), or else via corresponding lexically scoped aliases (C<< $<foo> >>,
-C<$«bar»>, C<< $<baz> >>, etc.) So the previous example also implies:
+C<$«bar»>, C<< $<baz> >>, etc.) So the previous example also implies:

# $<ident> $0<ident>
# __^__ __^__
@@ -2334,10 +2529,10 @@
so too a grammar can collect a set of named rules together:

grammar Identity {
- rule name :w { Name = (\N+) }
- rule age :w { Age = (\d+) }
- rule addr :w { Addr = (\N+) }
- rule desc {
+ parse name { Name = (\N+) }
+ parse age { Age = (\d+) }
+ parse addr { Addr = (\N+) }
+ parse desc {
<name> \n
<age> \n
<addr> \n
@@ -2351,22 +2546,22 @@
Like classes, grammars can inherit:

grammar Letter {
- rule text { <greet> <body> <close> }
+ parse text { <greet> <body> <close> }

- rule greet :w { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
+ parse greet { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}

- rule body { <line>+ }
+ parse body { <line>+? }

- rule close :w { Later dude, $<from>:=(.+) }
+ parse close { Later dude, $<from>:=(.+) }

# etc.
}

grammar FormalLetter is Letter {

- rule greet :w { Dear $<to>:=(\S+?) , $$}
+ parse greet { Dear $<to>:=(\S+?) , $$}

- rule close :w { Yours sincerely, $<from>:=(.+) }
+ parse close { Yours sincerely, $<from>:=(.+) }

}

@@ -2382,14 +2577,15 @@

grammar Perl { # Perl's own grammar

- rule prog { <statement>* }
+ parse prog { <statement>* }

- rule statement { <decl>
+ parse statement {
+ | <decl>
| <loop>
| <label> [<cond>|<sideff>|;]
}

- rule decl { <sub> | <class> | <use> }
+ parse decl { <sub> | <class> | <use> }

# etc. etc. etc.
}
@@ -2439,7 +2635,7 @@

$str.trans( %mapping.pairs.sort );

-Use the .= form to do a translation in place:
+Use the C<.=> form to do a translation in place:

$str.=trans( %mapping.pairs.sort );

Daniel Hulme

unread,

Apr 20, 2006, 6:30:10 AM4/20/06

to perl6-l...@perl.org

> +but rather easier to read. The bare C<*>, C<+> and C<?> quantifiers
> +never backtrack in a C<token> unless some outer rule has specified a
> +C<:panic> option that applies. If you want to prevent even that, use
> +C<*:>, C<+:> or C<?:> to prevent any backtracking into the quantifier.
> +If you want to explicitly backtrack, append either a C<?> or a C<+>
> +to the quantifier. The C<?> forces minimal matching as usual,
> +while the C<+> forces greedy matching. The C<token> declarator is
> +really just short for
> +
> + rule :ratchet { ... }
> +
> +The other is the C<parse> declarator, for declaring non-terminal
> +productions in a grammar. It also does not backtrack unless a
> +C<:panic> is in effect or you explicitly specify a backtracking
> +quantifier. In addition, a C<parse> rule also assumes C<:words>.

I really don't like the second-to-last sentence above ("It also does not...").
It took me a few reads-through to parse it, and it sounds like it means, "Like
c<token>, it does not backtrack unless a C<:panic> is in effect. In addition, it
does not backtrack if you explicitly specify a backtracking quantifier."

Perhaps you could reword the end of that paragraph as:

>>>
Like C<token>, it only backtracks when a C<:panic> is in effect or when you
explicitly specify a backtracking quantifier. Unlike C<token>, it also assumes
C<:words>, making it equivalent to

rule :ratchet :words { ... }
<<<

--
You can't run away forever, but there's nothing wrong with getting a
good head start. You want to shut out the night, you want to shut down
the sun, you want to shut away the pieces of a broken heart.
`Rock and Roll Dreams Come True' (Steinman) http://surreal.istic.org/

signature.asc

Audrey Tang

unread,

Apr 20, 2006, 10:08:54 AM4/20/06

to la...@cvs.develooper.com, perl6-l...@perl.org

la...@cvs.perl.org wrote:
> +=item *
> +
> +Just as C<rx> has variants, so does the C<rule> declarator.
> +In particular, there are two special variants for use in grammars:
> +C<token> and C<parse>.

After a brief discussion on #perl6 with pmichaud and Juerd, it seems
that a verb "parse" at the same space as "sub"/"method"/"rule" feels
quite confusing:

grammar Foo {
parse moose {...}; # calling &parse?
}
my $elk = parse {...}; # calling &parse?

We feel that the token:w form is short enough and better reflect the
similarity:

grammar Foo {
token moose :w {...}
}
my $elk = token:w {...};

If further huffmanization is highly desired, how about allowing adverbs
at the beginning of token/rule forms?

grammar Foo {
token:w moose {...};
rule:P5 foo {...};
}

That would make it stand out, without further consuming the reserved
word space.

Thanks,
Audrey

signature.asc

Patrick R. Michaud

unread,

Apr 20, 2006, 10:24:09 AM4/20/06

to la...@cvs.perl.org, perl6-l...@perl.org

First, let me say I really like the changes to S05. Good work
once again.

Here are my questions and comments.

On Thu, Apr 20, 2006 at 02:07:51AM -0700, la...@cvs.perl.org wrote:
> -(To get rule interpolation use an assertion - see below)
> +However, if C<$var> contains a rule object, rather attempting to
> +convert it to a string, it is called as if you said C<< <$var> >>.

Does this mean it's a capturing rule? Or is it called as
if one had said C<< <?var> >>? (I would prefer it default
to non-capturing.)

> +If it is a string, it is matched literally, starting after where the
> +key left off matching.

> ..

> +If it is a rule object, it is executed as a subrule, with an initial
> +position after the matched key.

> ..

> +If it has the value 1, nothing special happens except that the key match
> +succeeds.

> ..

> +Any other value causes the match to fail. In particular, shorter keys
> +are not tried if a longer one matches and fails.

Is there a way to say to continue with the next shortest key?

> +Note: the effect of a forward-scanning lookbehind at the top level
> +can be achieved with:
> +
> + / .*? prestuff <( mainpat >) /

That should probably be

/ .*? prestuff <( mainpat )> /

> +As with bare hash, the longest key matches according to the longest token
> +rule, but in addition, you may combine multiple hashes under the same
> +longest-token consideration like this:
> +
> + <%statement|%prefix|%term>

This will be interesting from an implementation perspective. :-)

> +It is a syntax error to use an unbalanced C<< <( >> or C<< )> >>.

On #perl6 I think it was discussed that C<< <( >> and C<< )> >>
could be unbalanced -- that the first simply set the "from"
position and the second set the "to/pos" position. I think I
would prefer this.

Assuming we require the balance, what do we do with things like...?

/ aaa <( bbb { return 0; } ccc )> ddd /

And are we excluding the possibility of:

/ aaa <( [ bbb )> ccc
| dd ee )> ff
]
/

(The last example might be the anti-use case that shows that
<( and )> ought to be properly nested and balanced.)

> +Conjecture: Multiple opening angles are matched by a corresponding
> +number of closing angles, and otherwise function as single angles.
> +This can be used to visually isolate unmatched angles inside:
> +
> + <<<Ccode: a >> 1>>>

Does this eliminate the possibility of ever using french angles
as a possible rule syntax character? (It's okay if it does,
I simply wanted to make the observation.)

> +Just as C<rx> has variants, so does the C<rule> declarator.
> +In particular, there are two special variants for use in grammars:
> +C<token> and C<parse>.

I agree with Audrey that C<parse> is probably too useful in other
contexts. C<token:w> works fine for me.

> +With C<:global> or C<:overlap> or C<:exhaustive> the boolean is
> +allowed to return true on the first match.

Nice, nice, nice! Makes things *much* simpler for PGE.

Patrick R. Michaud

unread,

Apr 20, 2006, 10:27:21 AM4/20/06

to la...@cvs.perl.org, perl6-l...@perl.org

On Thu, Apr 20, 2006 at 09:24:09AM -0500, Patrick R. Michaud wrote:
> First, let me say I really like the changes to S05. Good work
> once again.
>
> Here are my questions and comments.
>
> On Thu, Apr 20, 2006 at 02:07:51AM -0700, la...@cvs.perl.org wrote:
> > -(To get rule interpolation use an assertion - see below)
> > +However, if C<$var> contains a rule object, rather attempting to
> > +convert it to a string, it is called as if you said C<< <$var> >>.
>
> Does this mean it's a capturing rule? Or is it called as
> if one had said C<< <?var> >>? (I would prefer it default
> to non-capturing.)

Sorry, I meant C<< <?$var> >> here, except we don't really
have a <?$var> syntax, so my question is just if it's capturing
or non-capturing. (I still prefer non-capturing.)

Larry Wall

unread,

Apr 20, 2006, 12:19:48 PM4/20/06

to perl6-l...@perl.org

On Thu, Apr 20, 2006 at 09:24:09AM -0500, Patrick R. Michaud wrote:

: First, let me say I really like the changes to S05. Good work

: once again.
:
: Here are my questions and comments.
:
: On Thu, Apr 20, 2006 at 02:07:51AM -0700, la...@cvs.perl.org wrote:
: > -(To get rule interpolation use an assertion - see below)
: > +However, if C<$var> contains a rule object, rather attempting to
: > +convert it to a string, it is called as if you said C<< <$var> >>.
:
: Does this mean it's a capturing rule? Or is it called as
: if one had said C<< <?var> >>? (I would prefer it default
: to non-capturing.)

I'd say the intent is non-capturing. In fact, it seems like a machanism
for stealth rule injection. It falls just a wee bit short of a security
hole, though, I think, since an interloper would have to be in the same
process to compile the rule. We probably shouldn't try to run a tainted
rule, on the theory that the interloper tricked some other code into
compiling the stealth rule.

: > +If it is a string, it is matched literally, starting after where the

: > +key left off matching.
: > ..
: > +If it is a rule object, it is executed as a subrule, with an initial
: > +position after the matched key.
: > ..
: > +If it has the value 1, nothing special happens except that the key match
: > +succeeds.
: > ..
: > +Any other value causes the match to fail. In particular, shorter keys
: > +are not tried if a longer one matches and fails.
:
: Is there a way to say to continue with the next shortest key?

Yeah, use <@rules> rather than <%tokens>. :)

Actually, how about we say that '' just succeeds, and a number says to
retry ignoring keys longer than the number?

: > +As with bare hash, the longest key matches according to the longest token

: > +rule, but in addition, you may combine multiple hashes under the same
: > +longest-token consideration like this:
: > +
: > + <%statement|%prefix|%term>
:
: This will be interesting from an implementation perspective. :-)

Has to be done somewhere anyway. I'd rather the rule syntax grok the
notion than to sluff it off to some kind of magical hash constructor.
This way the rule knows exactly which hashes it has to track and cache.
It's also plain to the reader of the rule which syntactic categories
are being lumped together at this state in the parse.

: > +It is a syntax error to use an unbalanced C<< <( >> or C<< )> >>.

:
: On #perl6 I think it was discussed that C<< <( >> and C<< )> >>
: could be unbalanced -- that the first simply set the "from"
: position and the second set the "to/pos" position. I think I
: would prefer this.
:
: Assuming we require the balance, what do we do with things like...?
:
: / aaa <( bbb { return 0; } ccc )> ddd /
:
: And are we excluding the possibility of:
:
: / aaa <( [ bbb )> ccc
: | dd ee )> ff
: ]
: /
:
: (The last example might be the anti-use case that shows that
: <( and )> ought to be properly nested and balanced.)

Lemme think about that some more. I was worrying about accidental )>,
and not thinking about alternation. Certainly your example could
be rewritten as

/ aaa [
| <( bbb )> ccc

| <( dd ee )> ff
]
/

but there are obviously cases where it wouldn't work. On the other
hand, there's perhaps some mental efficiency by lumping in <(...)>
with all the other <...> constructs, none of which can be unbalanced.
I'm inclined to say that the conservative thing is to require balance.
We could relax it later, I suppose.

: > +Conjecture: Multiple opening angles are matched by a corresponding

: > +number of closing angles, and otherwise function as single angles.
: > +This can be used to visually isolate unmatched angles inside:
: > +
: > + <<<Ccode: a >> 1>>>
:
: Does this eliminate the possibility of ever using french angles
: as a possible rule syntax character? (It's okay if it does,
: I simply wanted to make the observation.)

Probably, unless we treat <<...>> as French angles specially, for which
there is something to be said. I was just trying to make <<<...>>> consistent
with our other q<<<...>>> mechanisms, which recently switched to [[[...]]]
policy like POD has always had.

: > +Just as C<rx> has variants, so does the C<rule> declarator.

: > +In particular, there are two special variants for use in grammars:
: > +C<token> and C<parse>.
:
: I agree with Audrey that C<parse> is probably too useful in other
: contexts. C<token:w> works fine for me.

Aesthetically, I hate :w, actually...and the whole point of naming "token"
is that it is *not* a normal parser rule, but a lexer rule.

But I agree that "parse" is probably the wrong word. Earlier versions
had "prod" (short for "production") or "words". Even earlier
versions made ordinary "rule" have these semantics, but then it was
too confusing to talk about rules in general. I was very happy when
I thought of splitting the concepts yesterday.

I will think about that some more today. Consider "parse" a placeholder
for the concept of a plain old ordinary BNF rule.

: > +With C<:global> or C<:overlap> or C<:exhaustive> the boolean is

: > +allowed to return true on the first match.
:
: Nice, nice, nice! Makes things *much* simpler for PGE.

I don't see much point in not having rules be as lazy as possible.

Larry

Patrick R. Michaud

unread,

Apr 20, 2006, 2:14:41 PM4/20/06

to perl6-l...@perl.org

On Thu, Apr 20, 2006 at 09:19:48AM -0700, Larry Wall wrote:
> : > +Any other value causes the match to fail. In particular, shorter keys
> : > +are not tried if a longer one matches and fails.
> :
> : Is there a way to say to continue with the next shortest key?
>
> Yeah, use <@rules> rather than <%tokens>. :)
>
> Actually, how about we say that '' just succeeds, and a number says to
> retry ignoring keys longer than the number?

s/retry/continue trying/, perhaps?

Using '' (instead of 1) as the success value sounds Good, since
null string always matches following a key. Taking "ignoring keys
longer than the number" literally, would we also read this then
that returning 0 tries the (remaining) empty keys of each hash,
and returning -1 fails the matching of <%tokens>?

> [ discussion of unbalanced <( ... )>

> I'm inclined to say that the conservative thing is to require balance.
> We could relax it later, I suppose.

Works for me.

> : > +Just as C<rx> has variants, so does the C<rule> declarator.
> : > +In particular, there are two special variants for use in grammars:
> : > +C<token> and C<parse>.
> :
> : I agree with Audrey that C<parse> is probably too useful in other
> : contexts. C<token:w> works fine for me.
>
> Aesthetically, I hate :w, actually...and the whole point of naming "token"
> is that it is *not* a normal parser rule, but a lexer rule.
>
> But I agree that "parse" is probably the wrong word. Earlier versions
> had "prod" (short for "production") or "words".

Two other ideas (from a short walk)... how about something along
the lines of "phrase" or "sequence"?

Audrey Tang

unread,

Apr 20, 2006, 2:21:41 PM4/20/06

to Patrick R. Michaud, perl6-l...@perl.org

Patrick R. Michaud wrote:
> Two other ideas (from a short walk)... how about something along
> the lines of "phrase" or "sequence"?

Parsec use the word "lexeme" to mean exactly the same thing...

Audrey

signature.asc

Damian Conway

unread,

Apr 20, 2006, 4:32:07 PM4/20/06

to Larry Wall, perl6-l...@perl.org

Larry wrote:

> : I agree with Audrey that C<parse> is probably too useful in other
> : contexts. C<token:w> works fine for me.
>
> Aesthetically, I hate :w, actually...and the whole point of naming "token"
> is that it is *not* a normal parser rule, but a lexer rule.
>
> But I agree that "parse" is probably the wrong word. Earlier versions
> had "prod" (short for "production")

Just to point out to those playing along at home that a "production" is one
branch of an alternation, so it was right to reject that as the keyword.

> or "words".

...which was not very informative. :-)

> Even earlier versions made ordinary "rule" have these semantics, but
> then it was too confusing to talk about rules in general. I was very
> happy when I thought of splitting the concepts yesterday.
>
> I will think about that some more today. Consider "parse" a placeholder
> for the concept of a plain old ordinary BNF rule.

I agree they should be split, but perhaps it's "rules in general" that
should be renamed, since plain old ordinary BNF has laid claim to "rule"
for several decades now? Perhaps we need to bow to historical (rather
than etymological) usage on "regex" too, yielding:

Keyword Implicit adverbs Behaviour

regex (none) Ignores whitespace, backtracks
token :ratchet Ignores whitespace, no backtracking
rule :ratchet :words Skips whitespace, no backtracking

Using C<rule> and C<token> as the typical grammar components would make
Perl 6 grammars *much* more accessible to those already familiar with
grammar-based parsing. And using C<regex> for "plain old backtracking regular
expressions" would make them much more accessible to those already familiar
with Perl 5 regexes.

Damian

0 new messages