Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[svn:perl6-synopsis] r13529 - doc/trunk/design/syn

4 views
Skip to first unread message

la...@cvs.perl.org

unread,
Jan 17, 2007, 9:24:51 PM1/17/07
to perl6-l...@perl.org
Author: larry
Date: Wed Jan 17 18:24:48 2007
New Revision: 13529

Modified:
doc/trunk/design/syn/S05.pod

Log:
We now analyze regex expressions as pattern/action pairs and grammars as
collections of those pairs. The initial-constant-strings approach
is now generalized to initial DFAable patterns as suggested by luqui++.
The idea of "token" is now defined a little more rigorously, and its
epiphenomenal nature explicated, especially with respect to C<token>.


Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod (original)
+++ doc/trunk/design/syn/S05.pod Wed Jan 17 18:24:48 2007
@@ -16,7 +16,7 @@
Date: 24 Jun 2002
Last Modified: 17 Jan 2007
Number: 5
- Version: 44
+ Version: 45

This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them I<regex> rather than "regular
@@ -68,43 +68,7 @@
=back

While the syntax of C<|> does not change, the default semantics do
-change slightly. Instead of representing temporal alternation, C<|>
-now represents logical alternation with longest-token semantics.
-(You may now use C<||> to indicate the old temporal alternation. That is,
-C<|> and C<||> now work within regex syntax much the same as they
-do outside of regex syntax, where they represent junctional and
-short-circuit OR. This includes the fact that C<|> has tighter
-precedence than C<||>.) Every regex in Perl 6 is required to be able to
-return its list of initial constant strings (transitively including the
-initial constant strings of any initial subrule called by that regex).
-A logical alternation using C<|> then takes two or more of these lists
-and dispatches to the alternative that advertises the longest matching
-prefix, not necessarily to the alternative that comes first lexically.
-(However, in the case of a tie between alternatives, the earlier
-alternative does take precedence.) Backtracking into a constant prefix
-(or into a :: that would backtrack over a constant prefix) causes
-the next longest match to be attempted, even if that is specified
-in a different subrule.
-
-Initial constants must take into account case sensitivity (or any other
-canonicalization primitives) and do the right thing even when propagated
-up to rules that don't have the same canonicalization. That is, they
-must continue to represent the set of matches that the lower rule would
-match. If and when the optimizer turns such a list of prefixes into,
-say, a trie, the trie must continue to have the appropriate semantics
-for the originating rule.
-
-The C<||> form has the old short-circuit semantics, and will not
-attempt to match its right side unless all possibilities (including
-all C<|> possibilities) are exhausted on its left. The first C<||>
-in a regex makes constant strings on its left available to the
-outer longest-token matcher, but hides any subsequent tests from
-longest-token matching. Every C<||> establishes a new longest-token
-table. That is, if you use C<|> on the right side of C<||>, that
-right side establishes a new top level for longest-token processing
-for this subexpression and any called subrules. The right side's
-longest-token list is invisible to the left of the C<||> or outside
-the regex containing the C<||>.
+change slightly. See the section below on "Longest-token matching".

=head1 Modifiers

@@ -654,13 +618,12 @@
as a literal, or fails if no key matches. (A C<""> key will match
anywhere, provided no longer key matches.)

-In a context requiring a set of initial constant strings, the keys
-of the hash comprise that set of strings, and any subsequent matching
-performed by the hash values is not considered a part of those strings,
-even if that subsequent match begins by matching more constant string.
-The keys are considered to be canonicalized in the same way as any
-surrounding context, so for instance within a case-insensitive context
-the hash keys must match insensitively also.
+In a context requiring a set of initial token patterns, the initial
+token patterns are taken to be each key plus any initial token pattern
+matched by the corresponding value (if the value is a string or regex).
+The token patterns are considered to be canonicalized in the same way
+as any surrounding context, so for instance within a case-insensitive
+context the hash keys must match insensitively also.

Subsequent matching depends on the hash value:

@@ -1534,6 +1497,107 @@

=back

+=head1 Longest-token matching
+
+Instead of representing temporal alternation, C<|> now represents
+logical alternation with longest-token semantics. (You may now use
+C<||> to indicate the old temporal alternation. That is, C<|> and
+C<||> now work within regex syntax much the same as they do outside
+of regex syntax, where they represent junctional and short-circuit OR.
+This includes the fact that C<|> has tighter precedence than C<||>.)
+
+Historically regex processing has proceeded in Perl via an NFA
+algorithm. This is quite powerful, but many parsers work more
+efficiently by processing rules in parallel rather than one after
+another, at least up to a point. If you look at something like a
+yacc grammar, you find a lot of pattern/action declarations where the
+patterns are considered in parallel, and eventually the grammar decides
+which action to fire off. While the default Perl view of parsing is
+essentially top-down (perhaps with a bottom-up "middle layer" to handle
+operator precedence), it is extremely useful for user understanding
+if at least the token processing proceeds deterministically. So for
+regex matching purposes we define token patterns as those patterns
+containing no whitespace that can be matched without side effects
+or self-reference.
+
+To that end, every regex in Perl 6 is required to be able to
+distinguish its "pure" patterns from its actions, and return its
+list of initial token patterns (transitively including the token
+patterns of any subrule called by the "pure" part of that regex, but
+not including any subrule more than once, since that would involve
+self reference). A logical alternation using C<|> then takes two or
+more of these lists and dispatches to the alternative that matches
+the longest token prefix. This may or may not be the alternative
+that comes first lexically. (However, in the case of a tie between
+alternatives, the textually earlier alternative does take precedence.)
+
+This longest token prefix corresponds roughly to the notion of "token"
+in other parsing systems that use a lexer, but in the case of Perl
+this is largely an epiphenomenon derived automatically from the grammar
+definition. However, despite being automatically calculated, the set of
+tokens can be modified by the user; various
+constructs within a regex declaratively tell the grammar engine that
+it is finished with the pattern part and starting in on the side effects,
+so by inserting such constructs the user controls what is considered
+a token and what is not. The constructs deemed to terminate a token
+pattern and start the "action" part of the pattern include:
+
+=over
+
+=item *
+
+Any :: or ::: backtracking control (but not the : possessive modifier).
+
+=item *
+
+Any {...} action or assertion containing a closure.
+
+=item *
+
+Any part of the regex that might match whitespace, including whitespace
+implicitly matched via C<:sigspace>.
+
+=item *
+
+Any sequential control flow operator such as C<||> or C<&&>.
+
+=back
+
+Subpatterns (captures) specifically do not terminate the token pattern,
+but may require a reparse of the token via NFA to find the location
+of the subpatterns.
+
+Oddly enough, the C<token> keyword specifically does not determine
+the scope of a token, except insofar as a token pattern usually
+doesn't do any matching of whitespace. In contrast, the C<rule>
+keyword (which assumes C<:sigspace>) defines a pattern that tends
+to disqualify itself on the first whitespace. So most of the token
+patterns will end up coming from C<token> declarations. For instance,
+a token declaration such as
+
+ token list_composer { \[ <expr> \] }
+
+considers its "longest token" to be just the left square bracket, because
+the first thing the C<expr> rule will do is traverse optional whitespace.
+
+Initial tokens must take into account case sensitivity (or any other
+canonicalization primitives) and do the right thing even when propagated
+up to rules that don't have the same canonicalization. That is, they
+must continue to represent the set of matches that the lower rule would
+match.
+
+The C<||> form has the old short-circuit semantics, and will not
+attempt to match its right side unless all possibilities (including
+all C<|> possibilities) are exhausted on its left. The first C<||>
+in a regex makes the token patterns on its left available to the
+outer longest-token matcher, but hides any subsequent tests from
+longest-token matching. Every C<||> establishes a new longest-token
+matcher. That is, if you use C<|> on the right side of C<||>, that
+right side establishes a new top level DFA for longest-token processing
+for this subexpression and any called subrules. The right side's
+longest-token list is invisible to the left of the C<||> or outside
+the regex containing the C<||>.
+
=head1 Return values from matches

=head2 Match objects

0 new messages