A rule by any other name...

Allison Randal

unread,

May 9, 2006, 7:51:17 PM5/9/06

to p6l

On Apr 20, 2006, at 1:32 PM, Damian Conway wrote:

> Keyword Implicit adverbs Behaviour
> regex (none) Ignores whitespace, backtracks
> token :ratchet Ignores whitespace, no backtracking
> rule :ratchet :words Skips whitespace, no backtracking
>
> [...and following threads...]

I'm comfortable with the semantic distinction between 'rule' as "thingy
inside a grammar" and 'regex' as "thingy outside a grammar". But, I
think we can find a better name than 'regex'. The problem is both the
'regex' vs. 'regexp' battle, and the fact that everyone knows 'regex(p)'
means "regular expression" no matter how may times we say it doesn't.
(I'm not fond of the idea of spending the next 20 years explaining that
over and over again.) Maybe 'match' is a better keyword.

Then again, from a practical perspective, it seems likely that we'll
want something like ":ratchet is set by default in all rules" turned on
in some grammars and off in other grammars. In which case, the real
distinction is that rules inside a grammar pull default attributes from
their grammar class, while rules outside a grammar have no default
attributes. Which brings us back to a single keyword 'rule' making sense
for both.

I'm not comfortable with the semantic distinction between 'rule' and
'token'. Whitespace skipping is not the defining difference between a
rule and a token in general use of the terms, so the names are misleading.

More importantly, whitespace skipping isn't a very significant option in
grammars in general, so creating two keywords that distinguish between
skipping and no skipping is linguistically infelicitous. It's like
creating two different words for "shirts with horizontal stripes" and
"shirts with vertical stripes". Sure, they're different, but the
difference isn't particularly significant, so it's better expressed by a
modifier on "shirt" than by a different word.

From a practical perspective, both the Perl 6 and Punie grammars have
ended up using 'token' in many places (for things that aren't tokens),
because :words isn't really the semantics you want for parsing computer
languages. (Though it is quite useful for parsing natural language and
other things.) What you want is comment skipping, which isn't the same
as :words.

I suggest making whitespace skipping a default setting on the grammar
class, so the grammars that need whitespace skipping most of the time
can turn it on by default for their rules. That means 'token' and 'rule'
collapse into just 'rule'.

I also suggest a new modifier for comment skipping (or skipping in
general) that's separate from :words, with semantics much closer to
Parse::RecDescent's 'skip'.

Allison

James Mastros

unread,

May 9, 2006, 8:38:37 PM5/9/06

to Allison Randal, p6l

On Tue, May 09, 2006 at 04:51:17PM -0700, Allison Randal wrote:
> I'm comfortable with the semantic distinction between 'rule' as "thingy
> inside a grammar" and 'regex' as "thingy outside a grammar". But, I
> think we can find a better name than 'regex'.

[...]

> Maybe 'match' is a better keyword.

Can I suggest we keep match meaning thing you get when you run a thingy
against a string, and make "matcher" be the thingy that gets run?

100% agree with you, Allison; thanks for putting words to "doesn't feel
right".

-=- James Mastros

Damian Conway

unread,

May 9, 2006, 9:25:26 PM5/9/06

to p6l

Allison wrote:

> I'm comfortable with the semantic distinction between 'rule' as "thingy
> inside a grammar" and 'regex' as "thingy outside a grammar". But, I
> think we can find a better name than 'regex'. The problem is both the
> 'regex' vs. 'regexp' battle,

Is that really an issue? I've never met anyone who *voluntarily* added
the 'p'. ;-)

> and the fact that everyone knows 'regex(p)'
> means "regular expression" no matter how may times we say it doesn't.

Sure. But almost nobody knows what "regular" actually means, and of
those few only a tiny number of pedants actually *care* anymore. So
does it matter?

> (I'm not fond of the idea of spending the next 20 years explaining that
> over and over again.)

Then don't. I teach regexes all the time and I *never* explain what
"regular" means, or why it doesn't apply to Perl (or any other
commonly used) regexes any more.

> Maybe 'match' is a better keyword.

I don't think so. "Match" is a better word for what comes back from
a regex match (what we currently refer to as a Capture, which is
okay too).

> Then again, from a practical perspective, it seems likely that we'll
> want something like ":ratchet is set by default in all rules" turned on
> in some grammars and off in other grammars. In which case, the real
> distinction is that rules inside a grammar pull default attributes from
> their grammar class, while rules outside a grammar have no default
> attributes. Which brings us back to a single keyword 'rule' making sense
> for both.

That's pretty much the Pelr 5 argument for using "sub" for both subroutines
and methods, which we've definitively rejected in Perl 6. If we use
"rule" for both kinds of regexes, we force the reader to constantly
check surrounding context in order to understand the behaviour of the
construct. :-(

> I'm not comfortable with the semantic distinction between 'rule' and
> 'token'. Whitespace skipping is not the defining difference between a
> rule and a token in general use of the terms, so the names are misleading.

True. "Token" is the wrong word for another reason: a token is a
segments component of the input stream, *not* a rule for matching
segmented components of the input stream. The correct term for that is
"terminal". So a suitable keyword might well be "term".

However, terminals do differ from rules in that they do not attempt to
be smart about what they ignore.

> More importantly, whitespace skipping isn't a very significant option in
> grammars in general, so creating two keywords that distinguish between
> skipping and no skipping is linguistically infelicitous. It's like
> creating two different words for "shirts with horizontal stripes" and
> "shirts with vertical stripes". Sure, they're different, but the
> difference isn't particularly significant, so it's better expressed by a
> modifier on "shirt" than by a different word.

I'd *strongly* disagree with that. Whitespace skipping (for suitable
values of "whitespace") is a critical feature of parsers. I'd go so far
as to say that it's *the* killer feature of Parse::RecDescent.

> From a practical perspective, both the Perl 6 and Punie grammars have
> ended up using 'token' in many places (for things that aren't tokens),
> because :words isn't really the semantics you want for parsing computer
> languages. (Though it is quite useful for parsing natural language and
> other things.) What you want is comment skipping, which isn't the same
> as :words.

What you want is *whitespace* skipping (where comments are a special
form of whitespace). What you *really* want is is whitespace skipping
where you get to define what constitutes whitespace in each context
where whitespace might be skipped.

But the defining characteristic of a "terminal" is that you try to match
it exactly, without being smart about what to ignore. That's why I like the
fundamental rule/token distinction as it is currently specified.

> I also suggest a new modifier for comment skipping (or skipping in
> general) that's separate from :words, with semantics much closer to
> Parse::RecDescent's 'skip'.

Note, however, that the recursive nature of Parse::RecDescent's <skip>
directive is a profound nuisance in practice, because you have to
remember to turn it off in every one of the terminals.

In light of all that, perhaps :words could become :skip, which defaults to
:skip(/<ws>/) but allows you to specify :skip(/whatever/).

As for the keywords and behaviour, I think the right set is:

Default Default
Keyword Where Backtracking Skipping

regex anywhere :!ratchet :!skip
rule grammars :ratchet :skip
term grammars :ratchet :!skip

I do agree that a rule should inherit properties from its grammar, so
you can write:

grammar Perl6 is skip(/[<ws>+ | \# <brackets> | \# \N]+/) {
...
}

to allow your grammar to redefine in one place what its rules skip.

Damian

Audrey Tang

unread,

May 9, 2006, 9:33:40 PM5/9/06

to Allison Randal, p6l

Allison Randal wrote:
> More importantly, whitespace skipping isn't a very significant option in
> grammars in general, so creating two keywords that distinguish between
> skipping and no skipping is linguistically infelicitous. It's like
> creating two different words for "shirts with horizontal stripes" and
> "shirts with vertical stripes". Sure, they're different, but the
> difference isn't particularly significant, so it's better expressed by a
> modifier on "shirt" than by a different word.

This is not only "space" skipping; as we discussed, <ws> skips over
comments as well as spaces, because a language (such as Perl 6) can
defined its own <ws> that serves as valid separator. To wit:

void main () {}
void/* this also works */main () {}

Or, in Perl 6:

say time;
say#( this also works )time;

> From a practical perspective, both the Perl 6 and Punie grammars have
> ended up using 'token' in many places (for things that aren't tokens),
> because :words isn't really the semantics you want for parsing computer
> languages. (Though it is quite useful for parsing natural language and
> other things.) What you want is comment skipping, which isn't the same
> as :words.

Currently it's defined, and used, the same as :words.

I think the confusion arises from <ws> being read as "whitespace"
instead of as "word separator". Maybe an explicit <wordsep> can fix
that, or maybe rename it to something else, but the token/rule
distinction of :words is very useful, because it's more usual for
languages to behave like C and Perl 6, instead of:

ex/* this calls exit */it();

which is rarer, and can be treated with separate "token" rules than <ws>.

Audrey

signature.asc

Damian Conway

unread,

May 10, 2006, 4:07:54 AM5/10/06

to p6l

Allison wrote:

>> I've never met anyone who *voluntarily* added
>> the 'p'. ;-)
>

> You've spent too much time in the U.S. ;)

And Australia. I don't know where the silent 'p' comes from but it sure ain't
the New World.

> Picking names that mean what they say is important in Perl. It's why we have
> 'given'/'when' instead of 'switch'/'case'. We don't have to use the same old
> name for things just because everyone else is doing it (even if we started it).
>
> There's nothing about 'regex' that says "backtracking enabled".

Sure there is. About 20 years of computing history. Nowadays "regex" has
virtually nothing to "regular expressions"; it's now just the computing term
for "compact set of instructions for a pattern matching machine".

> But isn't it appealing to stop using an archaic word that has now become
> meaningless?

No. For a start, "regex" isn't archaic. In fact it's a comparative neologism,
having only recently broken awa--both syntactically and semantically--from the
older "regular expression". More importantly, the *concept* hasn't become
meaningless at all; indeed it's grown significantly in meaning over the past
decade. And the word "regex" is now far more strongly associated with that
expanded concept than with the original idea of a "regular expression".

>> That's pretty much the Perl 5 argument for using "sub" for both subroutines

>> and methods, which we've definitively rejected in Perl 6.
>

> Subs and methods have a number of distinguising characteristics. If the only
> distinction between them was one small characteristic change, I might argue
> against using different keywords there too. (I think the choice of using only
> 'sub' made sense for Perl 5 with its simplistic OO semantics, but Perl 6
> provides more intelligent defaults for methods so the separation makes sense
> here.)

I think you're wrong. I think "sub" has proved not to be the right choice in
Perl 5 either. As abstractions, methods and subs are very different. In usage,
they're very different. It's only in implementation that they're similar.
Using the same keyword for two constructs that are used--and which act--very
differently was a rare misstep on Larry's part.

And it's those same enormous abstract and pragmatic differences that we need
two keywords to distinguish when it comes to pattern matching. Think about the
trouble we're going to have translating Perl 5 subs to Perl 6 subs or methods,
precisely because of the lack of semantic marking. The designers of Perl 7
won't thank us if we repeat the mistake with regexes and rules.

> Rules inside and outside grammars are the same class. They have the same
> behaviour aside from :ratchet,

And skipping!

> and :ratchet can be set without the keyword change.

But then you've no way of knowing from *local* context which way it defaults
for a given instance.

> More than that, the current 'rule' and 'regex' can both be used inside
> and outside a grammar. If we were to take the 'sub'/'method' pattern, then
> 'rule' should never be allowed outside a grammar,

I entirely agree.

> and 'regex' should either not be allowed inside a 'grammar',
> or should express some distinctive feature
> inside the grammar (like "non-inherited" or "doesn't operate on the match
> object",

The main distinction is that rules are "ratcheted and skippy" whereas regexes
aren't. But yes regexes they ought not be inherited either.

> but there are better words for those concepts than 'regex').

If you can come up with even one other word that means "backtrackable,
non-skippy, and uninherited", in the same way that "rule" implies "ratcheted,
whitespace-skipping, and heritable", then I'd be more than delighted to
consider it.

Personally, I thought "regex" already fit the bill admirably, since
backtracking, not skipping, and not inheriting is exactly what regexes do in
most current languages (including Perl 5).

>> If we use "rule" for both kinds of regexes, we force the reader to constantly
>> check surrounding context in order to understand the behaviour of the
>> construct. :-(
>

> Context is a Perlish concept. :)

*Local* context is. Having three fundamental behaviours change because of a
namespace declaration 1000 lines earlier doesn't seem very Perlish to me.

> Making different things different is an important design principle, but so is
> making similar things similar.

I disagree. What we've been doing in Perl 6 is making different things
different, and identical things identical (or, more precisely, consolidating
things that turn out to be identical if you look closely enough).

But regexes and rules aren't identical; merely similar. And making
similar things identical is a *bad* idea in language. IANL(inguist) but
it seems to me that most languages evolve towards make similar things as
different as possible, so that they're not accidentally confused.

> I do like 'term' better.

Me too. :-)

> That really isn't "whitespace" skipping, though.

Sure it is. "Whitespace" is just the industry term for "anything we politely
ignore". Comments are whitespace. Spaces, tabs, and newlines are whitespace.
Pod is whitespace. Larry's tuxedos are whitespace. Just because some kinds of
whitespace is neither white nor spacey, doesn't mean their not whitespace. ;-)

> Can you give me some additional characteristics for 'term' beyond just "turn
> off :skip"?

Yep. See below.

> Grammars also need to turn off skipping in rules that aren't terminals,

Very rarely, in my experience. And generally only for that part of a rule that
someone has been too lazy to factor out as a separate terminal.

> And in the current form you have to remember to use 'token' for all the
> terminals. Not really a significant difference in mental effort.

You see, I'd argue that it *is* a significant difference in mental effort.

Parser writers think in terms of rules and terminals, with the terminals doing
the precise matching against the input, and the rules abstracting and
orchestrating the terminals' collective matching and taking care of the
skipping behaviour. So writing:

rule sentence { <noun> <verb> <noun> }

term noun { s?he | they | we | I | you | Larry | Audrey | Guido }

term verb { hugged | helped | hit }

reflects the two distinctive roles far better than:

rule sentence { <noun> <verb> <noun> }

rule noun :!skip { s?he | they | we | I | you | Larry | Audrey | Guido }

rule verb :!skip { hugged | helped | hit }

If nothing else, it's far easier to distinguish the terminals from the
aggregations when the distinguishing keyword is hard on the left, rather than
buried somewhere in the middle of the declaration.

> Including :skip(/<someotherrule>/). Yes, agreed, it's a huge improvement. I'd
> be more comfortable if the default rule to use for skipping was named <skip>
> instead of <ws>. (On IRC <sep> was also proposed, but the connection between
> :skip and <skip> is more immediately obvious.)

Yes, I like <skip> too. I too keep mistakely reading <ws> as "WhiteSpace".

>> As for the keywords and behaviour, I think the right set is:
>>
>> Default Default
>> Keyword Where Backtracking Skipping
>>
>> regex anywhere :!ratchet :!skip
>> rule grammars :ratchet :skip
>> term grammars :ratchet :!skip
>

> And I think the right set is:
>
> rule anywhere :!ratchet :!skip
> rule grammars :ratchet :!skip

But that's *wrong*. Grammar rules absolutely need to skip by default, which
would make your table:

rule anywhere :!ratchet :!skip
rule grammars :ratchet :skip

whereupon we have two fundamental differences between grammar rules and
non-grammar rules. Which is why the external rules (which don't act like
grammatical rules at all, but like standard backtracking non-skipping regexes)
need a different keyword (like "regex").

And removing the "term" keyword (or "token" or whatever) removes the obvious
syntactic marking of a fundamentally important semantic distinction, as I
discussed above.

I'm still utterly convinced my original three-keyword list is the right one
(and that the three keywords in it are the right ones too). Collapsing these
three clearly distiguishable concepts into one keyword and then requiring that
keyword be adverbially modified about 2/3 of the time, seems like a false
economy to me: a loss in readability *and* a signiciant increase in the amount
of code required. :-(

Damian

Juerd

unread,

May 10, 2006, 4:20:40 AM5/10/06

to perl6-l...@perl.org

Damian Conway skribis 2006-05-10 18:07 (+1000):

> > More than that, the current 'rule' and 'regex' can both be used inside
> > and outside a grammar. If we were to take the 'sub'/'method' pattern, then
> > 'rule' should never be allowed outside a grammar,
> I entirely agree.

I don't. While disallowing named methods and rules may be a wise idea
(I'm not sure they are), the anonymous forms are probably very useful to
have around.

my $method = method { ... };
$object.$method(...);

Juerd
--
http://convolution.nl/maak_juerd_blij.html
http://convolution.nl/make_juerd_happy.html
http://convolution.nl/gajigu_juerd_n.html

Allison Randal

unread,

May 10, 2006, 1:10:51 AM5/10/06

to Damian Conway, p6l

On Wed, 10 May 2006, Damian Conway wrote:
> Allison wrote:
>
> I've never met anyone who *voluntarily* added
> the 'p'. ;-)

You've spent too much time in the U.S. ;)

> > and the fact that everyone knows 'regex(p)'

> > means "regular expression" no matter how may times we say it doesn't.
>
> Sure. But almost nobody knows what "regular" actually means, and of
> those few only a tiny number of pedants actually *care* anymore. So
> does it matter?

Picking names that mean what they say is important in Perl. It's why we have

'given'/'when' instead of 'switch'/'case'. We don't have to use the same old
name for things just because everyone else is doing it (even if we started it).

There's nothing about 'regex' that says "backtracking enabled".

> Then don't. I teach regexes all the time and I *never* explain what

> "regular" means, or why it doesn't apply to Perl (or any other
> commonly used) regexes any more.

But isn't it appealing to stop using an archaic word that has now become
meaningless?

> > Maybe 'match' is a better keyword.

>
> I don't think so. "Match" is a better word for what comes back from
> a regex match (what we currently refer to as a Capture, which is
> okay too).

I agree there. I still prefer 'rule'.

> That's pretty much the Perl 5 argument for using "sub" for both subroutines

> and methods, which we've definitively rejected in Perl 6.

Subs and methods have a number of distinguising characteristics. If the only

distinction between them was one small characteristic change, I might argue
against using different keywords there too. (I think the choice of using only
'sub' made sense for Perl 5 with its simplistic OO semantics, but Perl 6
provides more intelligent defaults for methods so the separation makes sense
here.)

Rules inside and outside grammars are the same class. They have the same
behaviour aside from :ratchet, and :ratchet can be set without the keyword
change. More than that, the current 'rule' and 'regex' can both be used inside

and outside a grammar. If we were to take the 'sub'/'method' pattern, then

'rule' should never be allowed outside a grammar, and 'regex' should either not

be allowed inside a 'grammar', or should express some distinctive feature
inside the grammar (like "non-inherited" or "doesn't operate on the match

object", but there are better words for those concepts than 'regex').

> If we use "rule" for both kinds of regexes, we force the reader to constantly
> check surrounding context in order to understand the behaviour of the
> construct. :-(

Context is a Perlish concept. :)

It's worse to force the writer and reader to distinguish between two keywords
when they don't have a sharp difference in meaning, and when the names of the
two keywords don't provide any clues to what the difference is.

Making different things different is an important design principle, but so is
making similar things similar.

> True. "Token" is the wrong word for another reason: a token is a

> segments component of the input stream, *not* a rule for matching
> segmented components of the input stream. The correct term for that is
> "terminal". So a suitable keyword might well be "term".

I do like 'term' better.

> Whitespace skipping (for suitable values of "whitespace") is a critical

> feature of parsers. I'd go so far as to say that it's *the* killer feature of
> Parse::RecDescent.
>

> What you want is *whitespace* skipping (where comments are a special form of
> whitespace). What you *really* want is is whitespace skipping where you get
> to define what constitutes whitespace in each context where whitespace might
> be skipped.

That really isn't "whitespace" skipping, though. Calling it whitespace skipping
conflates two concepts that are only slightly related. I agree that skipping is
an important feature in parsers.

> But the defining characteristic of a "terminal" is that you try to match
> it exactly, without being smart about what to ignore. That's why I like the
> fundamental rule/token distinction as it is currently specified.

Can you give me some additional characteristics for 'term' beyond just "turn
off :skip"? Grammars also need to turn off skipping in rules that aren't
terminals, and the different keyword is entirely inappropriate in those cases.
Since you'd need to use ':!skip' (or whatever syntax) on other rules anyway, it
doesn't make sense to use 'term' anywhere unless it provides some additional
intelligent defaults for terminals.

> > I also suggest a new modifier for comment skipping (or skipping in
> > general) that's separate from :words, with semantics much closer to
> > Parse::RecDescent's 'skip'.
>
> Note, however, that the recursive nature of Parse::RecDescent's <skip>
> directive is a profound nuisance in practice, because you have to
> remember to turn it off in every one of the terminals.

And in the current form you have to remember to use 'token' for all the

terminals. Not really a significant difference in mental effort.

> In light of all that, perhaps :words could become :skip, which defaults to

> :skip(/<ws>/) but allows you to specify :skip(/whatever/).

Including :skip(/<someotherrule>/). Yes, agreed, it's a huge improvement. I'd

be more comfortable if the default rule to use for skipping was named <skip>
instead of <ws>. (On IRC <sep> was also proposed, but the connection between
:skip and <skip> is more immediately obvious.)

> As for the keywords and behaviour, I think the right set is:

>
> Default Default
> Keyword Where Backtracking Skipping
>
> regex anywhere :!ratchet :!skip
> rule grammars :ratchet :skip
> term grammars :ratchet :!skip

And I think the right set is:

rule anywhere :!ratchet :!skip
rule grammars :ratchet :!skip

(Assuming that the universal base grammar class has :ratchet set, and anyone
can unset it with :!ratchet on their grammar or on individual rules. Also
assuming that we make it easy to turn on :skip for a grammar.)

> I do agree that a rule should inherit properties from its grammar, so
> you can write:
>
> grammar Perl6 is skip(/[<ws>+ | \# <brackets> | \# \N]+/) {
> ...
> }
>
> to allow your grammar to redefine in one place what its rules skip.

To quote a friend: Yay! :)

Allison

Ruud H.G. van Tol

unread,

May 10, 2006, 7:33:10 AM5/10/06

to p6l

Allison Randal schreef:
> Damian:

>> "Match" is a better word for what comes back from
>> a regex match (what we currently refer to as a Capture, which is
>> okay too).
>
> I agree there. I still prefer 'rule'.

Maybe matex (mat-ex) for "matching expression" and, within that,
capex/captex (cap-ex/capt-ex) for "capturing expression"?

--
Groet, Ruud

Ruud H.G. van Tol

unread,

May 10, 2006, 7:41:29 AM5/10/06

to p6l

Damian Conway schreef:

> grammar Perl6 is skip(/[<ws>+ | \# <brackets> | \# \N]+/) {
> ...
> }

I think that first "+" is superfluous.

Doubly so if <ws> already stands for the run of all consecutive
word-separators.

--
Groet, Ruud

Patrick R. Michaud

unread,

May 10, 2006, 11:06:03 AM5/10/06

to Damian Conway, p6l

On Wed, May 10, 2006 at 06:07:54PM +1000, Damian Conway wrote:
>
> >Including :skip(/<someotherrule>/). Yes, agreed, it's a huge
> >improvement. I'd be more comfortable if the default rule to
> >use for skipping was named <skip> instead of <ws>.
> >(On IRC <sep> was also proposed, but the connection between
> >:skip and <skip> is more immediately obvious.)
>
> Yes, I like <skip> too. I too keep mistakely reading <ws> as "WhiteSpace".

FWIW, I recently noticed noticed in another language
definition the phrase "intertoken space" as being something
that can occur on either side of any token, but not within
a token. Perhaps some abbreviation or variation of that could
work in place of either "ws" or "skip".

(Somehow "skip" seems too verbish to me, when the other
subrules we tend to see in a rule tend to be nounish. Yes, I
know that "skip" can be a noun as well, it just feels wrong.)

> I'm still utterly convinced my original three-keyword list is the right one
> (and that the three keywords in it are the right ones too).

Having played with regex/token/rule in the perl6 grammar a bit
further, as well as looking at a couple of others, I'm finding
regex/token/rule to be fairly natural. It only becomes unnatural
if I'm trying hard to optimize things -- e.g., by using "token" instead
of "rule" to avoid unnecessary calls to <?ws>. (And it may well turn
out that trying to avoid these calls is a premature or incorrect
optimization anyway -- I won't know until I'm a little farther along
in the grammars I'm work with.)

Pm

Larry Wall

unread,

May 10, 2006, 12:17:21 PM5/10/06

to p6l

On Wed, May 10, 2006 at 11:25:26AM +1000, Damian Conway wrote:
: True. "Token" is the wrong word for another reason: a token is a

: segments component of the input stream, *not* a rule for matching
: segmented components of the input stream. The correct term for that is
: "terminal". So a suitable keyword might well be "term".

There are several problems with that. A small problem is that
"term" is the same length as "rule", and that makes it harder to
tell them apart visually. A larger problem is that, unfortunately,
"term" is one of the more heavily overloaded terms (pun intended)
in computing. Even in Perl 5 culture we use it *heavily* to mean
"non-infix". Calling infix:<*> a "term" really grates for that reason.

The overloading of "token" is much milder, and I'd rather take the
core metaphor of token and extend it to the supertoken, because
the intent is the same. The intent of a token is to present a
simple interface outward. The same is true for the supertoken.
Structurally a supertoken is rather like an object, insofar as it
has a simple outside and a complicated inside. That complicated
inside is expressed by the fact that the supertoken calls out to a
subrule. But the supertoken itself still wants to be treated simply
in its own context, just as any object can be treated as a scalar.
The interface to a postcircumfix requires token parsing on the
outside, despite allowing full expressions on the inside. But as
with the sub/multi/method distinction, the primary motivation is to
distinguish the outward interface, that is, how they are to be used.

So anyway, I think "token" is sufficiently close to what we want
it to mean that we can force it to mean that, and it's sufficiently
orphaned that few people are going to complain about impressing it
into forced labor. And, in fact, the larger cultural meaning of
token implies that it's something simple that represents something
complicated, as in "a token of our appreciation."

Larry

Damian Conway

unread,

May 10, 2006, 4:26:50 PM5/10/06

to p6l

Larry wrote:

> So anyway, I think "token" is sufficiently close to what we want
> it to mean that we can force it to mean that, and it's sufficiently
> orphaned that few people are going to complain about impressing it
> into forced labor.

I'm perfectly fine with that. To quote myself out of context:

But almost nobody knows what [the word] actually means, and of

those few only a tiny number of pedants actually *care* anymore.
So does it matter?

;-)

Damian

Uri Guttman

unread,

May 10, 2006, 5:42:52 PM5/10/06

to Allison Randal, Damian Conway, p6l

>>>>> "AR" == Allison Randal <all...@wgz.org> writes:

AR> Including :skip(/<someotherrule>/). Yes, agreed, it's a huge
AR> improvement. I'd be more comfortable if the default rule to use
AR> for skipping was named <skip> instead of <ws>. (On IRC <sep> was
AR> also proposed, but the connection between :skip and <skip> is more
AR> immediately obvious.)

a small point but why not have both <ws> and <skip> be aliased to each
other? i like the <skip> connection but <ws> is (usually) about skipping
white space which is likely the most commonly skipped text. both names
have value so we should have both. and i think in most cases you won't
see many explicit <skip> or <ws> as they will be implied by the
whitespace in the rule/term/whatever that has skipping enabled.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Allison Randal

unread,

May 10, 2006, 8:58:57 PM5/10/06

to dam...@conway.org, p6l

To summarize a phone call today, the more intelligent defaults we add to
differently named rule keywords the more comfortable I am with having
different names. So, here's what we have so far (posted both as an FYI
and to confirm that we have the coherent solution I think we have):

rule:
- Has :ratchet and :skip turned on by default

- May only be used inside a grammar

- Takes default modifiers (a.k.a. traits) from the grammar in which it
is defined

- Is inherited by subclasses of a grammar

- The default modifiers can be turned off by :!ratchet and :!skip both
for individual rules and for an entire grammar (I'd like to see some
syntax for this)

regex:
- Has no modifiers turned on by default

- May be used inside and outside a grammar

- Inside a grammar, it is not inherited by subclasses of the grammar

- Inside a grammar, it does not take default modifiers from the grammar

- Individual regexen can turn on the :ratchet or :skip modifiers

token:
- Has :ratchet turned on by default

- Is inherited by subclasses of a grammar

- Does not take default modifiers from the grammar

- Individual token rules can turn off the :ratchet modifier with
:!ratchet, and can turn on :skip

- (I'd still like to see more for token, perhaps some optimizations that
are possible when you're certain you have a terminal, like "cannot call
subrules")

skip:
- We keep :words as shorthand for :skip(/<ws>/)

- And :skip is shorthand for :skip(/<skip>/)

- To change skipping behavior: a) override <skip> in your grammar, b)
set :skip(/.../) on an individual rule, or c) set 'is skip(/.../)' on a
grammar

- <ws> is optional whitespace, following skippy behavior (and it always
behaves the same no matter what the current :skip pattern is)

- <sp> is a single character of obligatory whitespace

Allison
--
"E pur si muove!"
-- apocryphally attributed to Galileo Galilei

Patrick R. Michaud

unread,

May 10, 2006, 10:17:34 PM5/10/06

to Allison Randal, dam...@conway.org, p6l

On Wed, May 10, 2006 at 05:58:57PM -0700, Allison Randal wrote:
> To summarize a phone call today, the more intelligent defaults we add to
> differently named rule keywords the more comfortable I am with having
> different names. So, here's what we have so far (posted both as an FYI
> and to confirm that we have the coherent solution I think we have):

> [...]

> skip:
> - We keep :words as shorthand for :skip(/<ws>/)
> - And :skip is shorthand for :skip(/<skip>/)

> [...]

Please, describe these with <?ws> and <?skip> to make clear their
non-capturing semantic. :-)

But Allison's message helps me to crystallize what has been
bugging me about the term ":skip" (and to a lesser extent ":words")
in describing what they do. So, I'll offer my thoughts here
in case anyone wants to pick it up before we go a-changing S05
yet again. (If no-one picks it up, I'll just wait for S05 to
be updated to whatever is decided and implement that. :-)

Whitespace in regexes and rules is metasyntactic, in that it is
not matched literally. Effectively what the :w (or :words or
:skip) option does it to change the metasyntactic meaning of
any whitespace found in the regex. Or, another way of thinking
of it -- as S05 currently stands, 'regex' and 'token' cause
the pattern whitespace to be treated as <?null>, while 'rule'
causes the pattern whitespace to become <?ws>.

So what we're really doing with this option--whatever we
call it--is to specify what the whitespace _in the pattern_
should match. Somehow ":skip" and <?skip> don't carry that
meaning for me.

In some sense it seems to me that the correct adverb is
more along the lines of :ws, :white, or :whitespace, in that
it says what to do with the whitespace in the pattern. It
doesn't have to say anything about whether the pattern's
whitespace is actually matching \s* (although the default
rule for :ws/:white/:whitespace could certainly provide that
semantic).

I can fully see the argument that people will still
confuse :ws and <?ws> with "whitespace in the target",
when in reality they specify the meaning of whitespace
in the regex pattern, so :ws might not be the right choice
for the adverb. But I think that something more closely
meaning "whitespace in the pattern means /this/" would be a
better adverb than :skip.

If someone *really* wants to use "skip", there's always
:ws(/<?skip>/) (or whatever we choose) which means
"whitespace in the regex matches <?skip>".

> - <sp> is a single character of obligatory whitespace

This one has bugged me since the day I first saw it implemented
in PGE. We _already_ have \s, <blank>, and <space> to represent
the notion of "a whitespace character" -- do we really need a
separate <sp> form also? (An idle thought: perhaps "sp" is
better used as an :sp adverb and a corresponding <?sp> regex?)

Pm

Damian Conway

unread,

May 10, 2006, 10:24:15 PM5/10/06

to p6l

Allison admirably summarized:

> rule:
>
> regex:
>
> token:

>
> skip:
> - We keep :words as shorthand for :skip(/<ws>/)
>
> - And :skip is shorthand for :skip(/<skip>/)

...where <skip> defaults to <ws>, but is distinct from it (i.e. it can be
redefined independently).

> - To change skipping behavior: a) override <skip> in your grammar, b)
> set :skip(/.../) on an individual rule, or c) set 'is skip(/.../)' on a
> grammar
>
> - <ws> is optional whitespace,

Not quite. <ws> is semi-optional whitespace. More precisely, it's not optional
between two identifier characters:

token ws { <after \w> \s+ <before \w>
| <after \w> \s* <before \W>
| <after \W> \s*
}

> following skippy behavior (and it always behaves the same no matter
> what the current :skip pattern is)

Damian

Mr Green

unread,

May 11, 2006, 2:34:12 AM5/11/06

to p6l

On 2006-May-10 at 1:38, James Mastros wrote:
>Can I suggest we keep match meaning thing you get when you run a thingy
>against a string, and make "matcher" be the thingy that gets run?

Speaking of the word "match", what I'd really like is to keep it meaning stuff
that matches. Unfortunately it also seems to get used to mean an "attempted
match", which, if it fails, is not a match at all. This leads to the phrase
"successful match", which sounds a bit bizarre and is redundant in ordinary
English. S05 uses "match" in both senses, and more than once I had to, er,
backtrack to figure out which meaning was intended.

Obviously, good words are needed for both meanings: "match" should always stand
for a "successful match" ('cause that's what the word actually means), and some
other term for the act of comparing two things to see whether or not they do
happen to match. (The word "compare" comes to mind.)

-David "grudge match" Green

Ruud H.G. van Tol

unread,

May 11, 2006, 6:23:37 AM5/11/06

to p6l

mr.g...@telus.net schreef:
> James Mastros:

Great, a match to light a language contest.
A match can be partial, a loose matching bolt can crash a(n
aero| )plane.
A match has context, like with clothes: a suiting match, a matching
suit.
:)

--
Groet, Ruud

Audrey Tang

unread,

May 11, 2006, 8:57:53 AM5/11/06

to Patrick R. Michaud, p6l

Patrick R. Michaud wrote:
>> - <sp> is a single character of obligatory whitespace

Hmm, it's literal ' ' (that is, \x20), not "whitespace" in general,
right? For "obligatory whitespace" we have \s.

> This one has bugged me since the day I first saw it implemented
> in PGE. We _already_ have \s, <blank>, and <space> to represent
> the notion of "a whitespace character" -- do we really need a
> separate <sp> form also? (An idle thought: perhaps "sp" is
> better used as an :sp adverb and a corresponding <?sp> regex?)

Well, without /<?sp>/ to stand for /\x20/, it'd have to be written as
/<' '>/, which is a bit suboptimal. Or as /\ /, which is even more
suboptimal...

Audrey

signature.asc

Jonathan Scott Duff

unread,

May 11, 2006, 10:59:26 AM5/11/06

to Allison Randal, dam...@conway.org, p6l

On Wed, May 10, 2006 at 05:58:57PM -0700, Allison Randal wrote:

> rule:
> - Has :ratchet and :skip turned on by default
>
> - May only be used inside a grammar

Should that be

- Must be declared as part of a grammar or role

???

The verb "used" doesn't make much sense to me there. I use a rule
when I'm applying it as a pattern to a string. The situation where
rules can be defined anywhere but must only be used in a grammar
doesn't make sense to me, so I assume that you meant that "rules must
belong to a grammar". (btw, I also assumed that "may only" really
meant "must")

And if we're keeping the correspondence between classes+methods and
grammars+rules, then surely grammars are composable entities just
like classes.

Seeking clarification,

-Scott
--
Jonathan Scott Duff
du...@pobox.com

Patrick R. Michaud

unread,

May 11, 2006, 11:09:29 AM5/11/06

to Audrey Tang, p6l

On Thu, May 11, 2006 at 08:57:53PM +0800, Audrey Tang wrote:
> Patrick R. Michaud wrote:
> >> - <sp> is a single character of obligatory whitespace
>
> Hmm, it's literal ' ' (that is, \x20), not "whitespace" in general,
> right? For "obligatory whitespace" we have \s.

Oops, you're correct, I forgot that <sp> is already \x20.

Allison's proposed definition of <sp> above seems to want to
change that to "obligatory whitespace". That's more of what
I was reacting against.

For summary, here's how I currently read S05's space/whitespace
rules (and what PGE implements, or is expected to implement):

space character: \x20 \o40 <' '> <?sp> <[ ]> <+[ ]> backslash+space
whitespace: \s <?space> <?blank>

> > We _already_ have \s, <blank>, and <space> to represent
> > the notion of "a whitespace character" -- do we really need a
> > separate <sp> form also? (An idle thought: perhaps "sp" is
> > better used as an :sp adverb and a corresponding <?sp> regex?)
>
> Well, without /<?sp>/ to stand for /\x20/, it'd have to be written as

> /<' '>/, which is a bit suboptimal. [...]

I agree, <sp> makes more sense as \x20, so I retract my idle thought.

Thanks,

Pm

Daniel Hulme

unread,

May 11, 2006, 11:21:54 AM5/11/06

to perl6-l...@perl.org

> >Including :skip(/<someotherrule>/). Yes, agreed, it's a huge
> >improvement. I'd be more comfortable if the default rule to use for
> >skipping was named <skip> instead of <ws>. (On IRC <sep> was also
> >proposed, but the connection between :skip and <skip> is more
> >immediately obvious.)

> Yes, I like <skip> too. I too keep mistakely reading <ws> as
> "WhiteSpace".

For another datapoint, I like the idea of "<wb>" as word-boundary. After
all, when you're tokenizing input, you're interested in the boundaries
that separate tokens rather than the whitespace or what you do with it.
Although I like the connection between <skip> and :skip, <skip> to me
isn't very suggestive, and <ws> sounds too much like whitespace. <wb>,
to me at least, is reminiscent of \b, and of Vim's \< \> for word
boundaries.

I'm sure I'll get used to whatever the final name is, though; just
wanted to spread ideas. There are, to my mind, two ways of looking at
whitespace:

1) Whitespace in regexes is ignored other than to delineate tokens in
the regex. :skip() defines which characters in the input string are
skipped over by the matcher (regex engine, whatever you want to call
it).

2) Whitespace in regexes is significant. :skip() defines the meaning of a
block of whitespace in the regular expression.

AFAICS, both these states of mind come out to the same thing in the end
(someone correct me if I'm wrong), but the naming scheme makes much more
sense if you are thinking about it the first way.

--
"For God's sake, please give it up. Fear it no less than the sensual
passion, because it, too, may take up all your time and deprive you of
your health, peace of mind and happiness in life." Wolfgang Bolyai,
urging his son to give up his research on non-Euclidean geometry

signature.asc

Ruud H.G. van Tol

unread,

May 11, 2006, 11:42:01 AM5/11/06

to p6l

Audrey Tang wrote:
> Patrick R. Michaud wrote:

>> - <sp> is a single character of obligatory whitespace
>
> Hmm, it's literal ' ' (that is, \x20), not "whitespace" in
> general, right? For "obligatory whitespace" we have \s.

Are all or some of the following equivalent to <sp>?

U+00A0 No-Break Space
U+202F Narow No-Break Space
U+FEFF Zero Width No-Break Space
U+2060 Word Joiner

Many more here, like the Nut and the Mutton:
http://en.wikipedia.org/wiki/Space_character
(with nice links)

--
Groet, Ruud

Audrey Tang

unread,

May 11, 2006, 12:27:03 PM5/11/06

to Ruud H.G. van Tol, p6l

Ruud H.G. van Tol wrote:
> Are all or some of the following equivalent to <sp>?
>
> U+00A0 No-Break Space
> U+202F Narow No-Break Space
> U+FEFF Zero Width No-Break Space
> U+2060 Word Joiner

No. A05 makes it explicit <sp> is just \x20, and S05 also says that it
matches one "space char", which also means U+0020 SPACE, although more
vaguely.

I think S05 can use this clarification diff:

- / <sp> / # match a space char
+ / <sp> / # match the SPACE character (U+0020)

Thanks,
Audrey

signature.asc

Allison Randal

unread,

May 11, 2006, 2:49:02 PM5/11/06

to dam...@conway.org, p6l

Damian Conway wrote:
>
>> skip:
>> - We keep :words as shorthand for :skip(/<ws>/)
>>
>> - And :skip is shorthand for :skip(/<skip>/)
>
> ...where <skip> defaults to <ws>, but is distinct from it (i.e. it can
> be redefined independently).

It also has the benefit that developers redefining <skip> can call <ws>
as one of the alternates in their skip rule.

I'm tempted to make <skip> default to [\# \N*|<ws>], considering the
number of languages and non-languages that use that commenting form. It
provides a useful distinction between the default forms of :words and
:skip, and an intelligent default. But, there's potential for confusion
if someone is parsing say, a file of phone numbers each pre-pended with
"#". (Of course, it could be argued that if they really only want
whitespace skipped, they should use :words.)

>> - <ws> is optional whitespace,
>
> Not quite. <ws> is semi-optional whitespace. More precisely, it's not
> optional between two identifier characters:
>
> token ws { <after \w> \s+ <before \w>
> | <after \w> \s* <before \W>
> | <after \W> \s*
> }

Right, that's "skippy behavior".

> > following skippy behavior (and it always behaves the same no matter
> > what the current :skip pattern is)

Allison

Allison Randal

unread,

May 11, 2006, 3:02:21 PM5/11/06

to Patrick R. Michaud, Audrey Tang, p6l

Patrick R. Michaud wrote:
> On Thu, May 11, 2006 at 08:57:53PM +0800, Audrey Tang wrote:
>> Patrick R. Michaud wrote:
>>>> - <sp> is a single character of obligatory whitespace
>> Hmm, it's literal ' ' (that is, \x20), not "whitespace" in general,
>> right? For "obligatory whitespace" we have \s.
>
> Oops, you're correct, I forgot that <sp> is already \x20.
>
> Allison's proposed definition of <sp> above seems to want to
> change that to "obligatory whitespace". That's more of what
> I was reacting against.

Read that line above as "all current abbreviations for various forms of
obligatory whitespace remain the same". And I agree with Audrey that the
S05 text needs to be clarified.

Allison

Allison Randal

unread,

May 11, 2006, 3:19:21 PM5/11/06

to p6l

Jonathan Scott Duff wrote:
> On Wed, May 10, 2006 at 05:58:57PM -0700, Allison Randal wrote:
>> rule:
>> - Has :ratchet and :skip turned on by default
>>
>> - May only be used inside a grammar
>
> Should that be
>
> - Must be declared as part of a grammar or role
>
> ???

It is:

- The 'rule' keyword may only be used inside a grammar

> And if we're keeping the correspondence between classes+methods and
> grammars+rules, then surely grammars are composable entities just
> like classes.

The distinction between inheritance and composition isn't as significant
for grammars as it is for classes, since you can create a Match object
instance from a single rule isolation.

Allison

Allison Randal

unread,

May 11, 2006, 3:48:10 PM5/11/06

to Patrick R. Michaud, p6l

Technically, true. But understanding that requires a deep understanding
of what's happening in the grammar. With 'skip' all the average user
needs to understand is "I'm telling the grammar to ignore these things".

As a side note, trying to talk about both whitespace as literal thing
that is matched and whitespace as a metasyntactic thing that is ignored
requires a great deal of circumlocution, which is often a good trigger
for language change to use a different word for one thing or the other.

Allison

Jonathan Scott Duff

unread,

May 11, 2006, 3:48:28 PM5/11/06

to Allison Randal, p6l

On Thu, May 11, 2006 at 12:19:21PM -0700, Allison Randal wrote:
> Jonathan Scott Duff wrote:
> >On Wed, May 10, 2006 at 05:58:57PM -0700, Allison Randal wrote:
> >>rule:
> >>- Has :ratchet and :skip turned on by default
> >>
> >>- May only be used inside a grammar
> >
> >Should that be
> >
> >- Must be declared as part of a grammar or role
> >
> >???
>
> It is:
>
> - The 'rule' keyword may only be used inside a grammar

So, just to be clear, does that mean that the following holds:

# assume no surrounding grammar-context
rule foo { ... } # compile-time error, no grammar
my $ar = rule { ... } # compile-time error, no grammar

grammar Foo;
rule bar { ... } # legal, Foo::bar rule
my $ar = rule { ... } # legal, Foo::ANON rule

# assume no surrounding grammar-context
rule Foo::bar { ... } # legal, Foo::bar rule
my $ar = grammar Foo { rule { ... } } # legal, Foo::ANON rule

And the way to get a grammarless rule is to use either rx or regex with
the appropriate modifiers.

Allison Randal

unread,

May 11, 2006, 4:55:37 PM5/11/06

to p6l

Oh, and since we're calling them "regexes", I suggest calling them
"regular expressions" too, since both "regex(p)" and "regular
expression" have taken on the popular meaning of "pattern matching". If
we're going to be anti-pedantic, let's be consistently anti-pedantic. :)

Allison

Larry Wall

unread,

May 11, 2006, 5:08:02 PM5/11/06

to p6l

On Thu, May 11, 2006 at 01:55:37PM -0700, Allison Randal wrote:
: Oh, and since we're calling them "regexes", I suggest calling them

: "regular expressions" too, since both "regex(p)" and "regular
: expression" have taken on the popular meaning of "pattern matching". If
: we're going to be anti-pedantic, let's be consistently anti-pedantic. :)

Consistency is the hobgoblin of small languages.

Larry