Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

S5: range quantifier woes

13 views
Skip to first unread message

Jonathan Scott Duff

unread,
Sep 17, 2004, 10:57:14 AM9/17/04
to perl6-l...@perl.org

The new range quantifier syntax has been bothering me. For reference,
here's the bit of S5 that talks about it:

> The repetition specifier is now **{...} for maximal matching, with a
> corresponding or **{...}? for minimal matching. Space is allowed on
> either side of the asterisks. The curlies are taken to be a closure
> returning a number or a range.
>
> / value was (\d ** {1..6}?) with ([\w]**{$m..$n}) /
>
> It is illegal to return a list, so this easy mistake fails:
>
> / [foo]**{1,3}

Now for the bothersome parts and some questions and some suggestions in
no particular order:

- for minimal matching the ? is too far away from the operator that it
applies to. It looks like it's doing something to the closure (and
maybe it is) Should that be [foo]**?{$m..$n} instead?
- Must the closure take the exact form of stuff in curlies? What
would these do?
$c = sub { 0..5 };
/[foo]**$c/; # error?
/[foo]**&somesub/; # error?
- Is the rationale behind making [foo]**{1,3} illegal strictly to
catch the semantic error of those migrating from perl 5? Because it
certainly seems like it could be a useful thing otherwise.
- because the closure is executed first, you have to read ahead to the
end of the closure and then look back to see what you were
quantifying when trying to grok the code. This isn't such a big deal
if you just have a range, but it's a closure so all sorts of things
can be in there!
- Bringing a closure into the picture seems to put too much power in
such a simple construct. [foo]**{ destroy_the_world; 0... }
- I've always viewed the minimal matching ? as a kind of modifier on
either the quantifiers. If that illusion is to remain true in Perl6,
I'd want an optional colon [foo]*:? Whitespace would disambiguate the
"modifier colon" from the "no backtrack" or "cut" operator (it would
parse as [ foo ] * :? I also seem to recall already have a whitespace
disambiguation rule for ::). And if we apply this idea to the range
quantifier, that would give us something like these:

[foo]*:5 # match exactly 5 times
[foo]*:{0...} # verbose [foo]*
[foo]*:{1...} # verbose [foo]+
[foo]*:{1..5} # match from 1 to 5 times
[foo]*:{[1,3,5]} # match exactly 1, 3, or 5 times
[foo]*:{@foo} # treat each element of @foo as a
# number and only match that
# many times. (same as previous
# basically)
[foo]*:{&foo} # match based on the return value of &foo
[foo]*:{%foo} # ???

Those last few suddenly make me want junctioned ranges, though I
don't know what I'd use them for :)

- An alternate syntax was proposed on IRC yesterday. I'm not sure if I
remember the specifics right, but the gist of it is to use a ~
character to offset the ranges, so ...

[foo]~5 # match exactly 5 times
[foo]~{0...} # verbose [foo]*
[foo]~{1...} # verbose [foo]+
[foo]~{1..5} # match from 1 to 5 times
[foo]~{[1,3,5]} # match exactly 1, 3, or 5 times
[foo]~{@foo} # treat each element of @foo as a
# number and only match that
# many times. (same as previous
# basically)
[foo]~{&foo} # match based on the return value of &foo
[foo]~{%foo} # ???

And surely these can be made to work:

[foo]~[0...] # [foo]:[0...]
[foo]~[1,3,5] # [foo]:[1,3,5]
[foo]~@foo # [foo]:@foo

Yes, I realize that the "bag" variants (e.g., /[foo]*:{@foo}/) could be
nightmarish for optimization (e.g. you can't assume monotonically
increasing values) And would "minimal match" mean stop when you've
reached the first number in the list or do you have to evaluate the
whole thing and literally find the minimum value? (Similar reasoning
and questions apply for the regular greedy version) These may be really
good arguments for not including that particular variant, but I don't
know that :-)

----

On the whole, I liked the simplicity of the old <$m..$n> (or even
<$m,$n>) and would like something just like it only without the
ambiguity of <$m>. I'd even suggest <+$m> as a disambiguating mechanism
if we weren't using + and - for "character" classes.

-Scott
--
Jonathan Scott Duff
du...@pobox.com

Juerd

unread,
Sep 17, 2004, 11:15:58 AM9/17/04
to Jonathan Scott Duff, perl6-l...@perl.org
Jonathan Scott Duff skribis 2004-09-17 9:57 (-0500):

> [foo]~5 # match exactly 5 times
> [foo]~{0...} # verbose [foo]*
> [foo]~{1...} # verbose [foo]+
> [foo]~{1..5} # match from 1 to 5 times
> [foo]~{[1,3,5]} # match exactly 1, 3, or 5 times
> [foo]~{@foo} # treat each element of @foo as a
> # number and only match that
> # many times. (same as previous
> # basically)
> [foo]~{&foo} # match based on the return value of &foo
> [foo]~{%foo} # ???
> And surely these can be made to work:
> [foo]~[0...] # [foo]:[0...]
> [foo]~[1,3,5] # [foo]:[1,3,5]
> [foo]~@foo # [foo]:@foo

Easier:

Variable Literal

\d~$foo \d~5, \d~1..5 # Range object or integer.

\d~@foo \d~[...] # \d~any(...)

\d~&foo \d~{...} # Depending on what the closure returns.

Obviously, $foo can also be an arrayref or coderef.

I think whitespace around the ~ can be made valid without making
anything ambiguous. The expression at its right side should not have
whitespace in it. If it does have whitespace, [] or {} is needed to
disambiguate.


Juerd

Larry Wall

unread,
Sep 17, 2004, 1:05:14 PM9/17/04
to perl6-l...@perl.org
On Fri, Sep 17, 2004 at 09:57:14AM -0500, Jonathan Scott Duff wrote:
: Now for the bothersome parts and some questions and some suggestions in

: no particular order:
:
: - for minimal matching the ? is too far away from the operator that it
: applies to. It looks like it's doing something to the closure (and
: maybe it is) Should that be [foo]**?{$m..$n} instead?

Yes, I felt that way too, and considered doing exactly what you
suggest, but decided that it doesn't make sense to make odd syntactic
exceptions for infrequently used constructs. We're trying to have
few random exceptions in Perl 6 than in Perl 5.

: - Must the closure take the exact form of stuff in curlies? What


: would these do?
: $c = sub { 0..5 };
: /[foo]**$c/; # error?
: /[foo]**&somesub/; # error?

Yes, those are not allowed. I considered doing that too, and rejected
it for similar reasons.

: - Is the rationale behind making [foo]**{1,3} illegal strictly to


: catch the semantic error of those migrating from perl 5? Because it
: certainly seems like it could be a useful thing otherwise.

Right. The idea is that someday we could allow random lists,
perhaps even immediately if people allow it by pragma, and if
the regex engine actually supports it, which is not a sure thing.
(On the other hand, it can actually be written now as an assertion
on the number of matches of a previous $1, so there's no big pressure
to make it work, and may never be enough pressure.)

: - because the closure is executed first, you have to read ahead to the


: end of the closure and then look back to see what you were
: quantifying when trying to grok the code. This isn't such a big deal
: if you just have a range, but it's a closure so all sorts of things
: can be in there!

Yes, that's a potential problem, just as you can do all sorts of
stuff in the condition of a C<while> statement modifier. Cultural
pressure will tend to work against that.

: - Bringing a closure into the picture seems to put too much power in


: such a simple construct. [foo]**{ destroy_the_world; 0... }

No more power than closures anywhere else in the regex. No more
power than plain old Perl outside the regex. I don't see why this
is any kind of an issue at all. The mere possibility of obfuscation
is not something Perl has ever been designed against. If anything,
the opposite is true. Expressive power can be used either for good
or ill, and Perl has generally opted for more potential goodness.

: - I've always viewed the minimal matching ? as a kind of modifier on


: either the quantifiers. If that illusion is to remain true in Perl6,
: I'd want an optional colon [foo]*:?

By that argument, the * is also a modifier and should have a colon. :-)

: Whitespace would disambiguate the


: "modifier colon" from the "no backtrack" or "cut" operator (it would
: parse as [ foo ] * :? I also seem to recall already have a whitespace
: disambiguation rule for ::). And if we apply this idea to the range
: quantifier, that would give us something like these:
:
: [foo]*:5 # match exactly 5 times
: [foo]*:{0...} # verbose [foo]*
: [foo]*:{1...} # verbose [foo]+
: [foo]*:{1..5} # match from 1 to 5 times
: [foo]*:{[1,3,5]} # match exactly 1, 3, or 5 times
: [foo]*:{@foo} # treat each element of @foo as a
: # number and only match that
: # many times. (same as previous
: # basically)
: [foo]*:{&foo} # match based on the return value of &foo
: [foo]*:{%foo} # ???
:
: Those last few suddenly make me want junctioned ranges, though I
: don't know what I'd use them for :)

I see no simplifications here from the point of view of either the
parser or the human. All I see are pitfalls. *: is rather ambiguous
with existing constructs. ** is completely illegal, just as *? and
+? were before we added the minimal modifier. Again, this is a seldom
used feature, and doesn't deserve special lookahead rules to determine
that the colon doesn't mean backtracking. Also, all the other :foo
modifiers modify the things after them, not the things before them.

: - An alternate syntax was proposed on IRC yesterday. I'm not sure if I


: remember the specifics right, but the gist of it is to use a ~
: character to offset the ranges, so ...

This feature is so completely not worth Yet Another Metacharacter.

: On the whole, I liked the simplicity of the old <$m..$n> (or even


: <$m,$n>) and would like something just like it only without the
: ambiguity of <$m>. I'd even suggest <+$m> as a disambiguating mechanism
: if we weren't using + and - for "character" classes.

**{$m..$n} and **{$m} are precisely one character longer than <+$m..$n>
and <+$m> you are advocating. They have the mnemonic value of *
without the possibility of being confused with *. They have the right
Huffman coding with respect to the common quantifiers. They don't
visually pretend to be subrules when they're not, or entice people
to try to turn them into captures. They indicate visually that Perl
code is potentially being run by use of the braces. The Perl code
cannot be confused with the ** outside. The ** outside cannot be
confused with the Perl code inside. Best of all, there are no extra,
additional, optional, special rules to explain.

Larry

Larry Wall

unread,
Sep 17, 2004, 1:30:37 PM9/17/04
to perl6-l...@perl.org
On Fri, Sep 17, 2004 at 05:15:58PM +0200, Juerd wrote:
: Jonathan Scott Duff skribis 2004-09-17 9:57 (-0500):

Sigh. It's easy to make random suggestions. It's hard to actually
design a language in which easy things are easy and hard things
are possible. Generalized quantification is one of those things
that should merely be possible. It doesn't rate dangling syntax.
It doesn't rate a bunch of optional syntaxes. It doesn't rate a new
bunch of whitespace dependencies. It especially doesn't rate a new
metacharacter! I'm not just picking on you--everyone on this list
should be trying to pick up some of these basic underlying design
principles as we go along, and reminding each other of the design
principles rather than just going off on non-productive tangents.
Otherwise I'll just end up wasting more of my breath in the future
justifying what seem to me to be fairly obvious decisions. Mind you,
it's kind of fun to shoot down bad ideas, but it's probably bad
for me spiritually.

Larry

Dan Hursh

unread,
Sep 18, 2004, 12:44:40 AM9/18/04
to perl6-l...@perl.org
Jonathan Scott Duff wrote:
> - for minimal matching the ? is too far away from the operator that it
> applies to. It looks like it's doing something to the closure (and
> maybe it is) Should that be [foo]**?{$m..$n} instead?

> - Bringing a closure into the picture seems to put too much power in


> such a simple construct. [foo]**{ destroy_the_world; 0... }

This gave me 2 ideas. First:

# not a closue, no runtime side effect, a straight compile
/[foo]**(3..5)/

# The closure is explicit and before the code it affects.
# (Yay end-weight!)
/ {($min, $max) = binNsmall()} [foo]**($min..$max) /

Second, if it is a problem that '?' is too far away, how about this?

[foo]**{5..3} # greedy
[foo]**{3..5} # lazy

Kindof implies

[foo]* <==> [foo]**{*..0} # greedy
[foo]*? <==> [foo]**{0..*} # lazy

but those would probably really be

[foo]**{Inf..0} # greedy
[foo]**{0..Inf} # lazy

But the greedy case would require careful processing, forget it. Might
work if the closure idea were replaced with a semi-special non-closure
syntax. Of couse we have the simpler syntax for those common cases
anyhow, forget it. No, really.

Oh, is there a way to trick this closure syntax into being the '0 or
more' equivalent? Suppose it would have to be returning an infinite
list. That seems familiar, but I forget why.

I guess I'm not seeing the point in having a special closure based
syntax. (Well, aside from Larry saying so. That ain't a bad reason.
I'm just sayin'...) It could be abused in ugly ways and the only power
it provides in the intended case is an arbitrary finite range. Seem
this could be made to work with plain closures and simple range that
allows variables as well as literals. It would be just as possible and
not much less friendly.

Ok, so I guess I'd suggest

/ {($x,$y)=daRange()} <thing>**($x..$y) /

since parens kind of get you a list, which is really all you want to
allow anyhow. It kinda looks a little like a code assertion too, well
if you want to see it that way.

/ {($x,$y)=daRange()} <<thing>>* <($x..$y ~~ +@<<thing>>)> /

So maybe:

/ {($x,$y)=daRange()} <thing>**<($x..$y)> /

or

/ <thing>**<(deRange())> /

I wouldn't beat a drum for any of them though. Well, not more than I
have. Those last two would be easy to mis-read. And actually, now that
I think of it, setting the range in a closure has the requirements of
running at canonical times. That's probably not what we want is it. I
guess we're back to the closure syntax again. And the curly braces of
the closure already have history attached to them. Forget it. No, I
mean it this time. Honest.

Dan

Kurt Hutchinson

unread,
Sep 18, 2004, 3:00:09 PM9/18/04
to perl6-l...@perl.org
Please forgive me if these ideas have been discussed before. I don't
remember having read them elsewhere.

For specifying in-rule repetitions, why not use the rule modifer we
already have for specifying whole-rule repetitions; namely, C<:x>. Allow
:x inside rules like :i and :w, and we get something like this:
rx :w/ three m's\: [:3x m] /
rx :w/ three to five m's\: [:x(3..5) m] /
rx :w/ done at runtime\: [:x($m..$n) m] /

It seems straightforward to me, but I admit I don't know how difficult
that would be for parsing/compiling purposes. However, it seems too
different from :i and :w on some level, to be treated the same.

So instead of a match modifier, how about adding a special named
assertion akin to C<before> and C<after> called C<x>, that takes an
argument:
rx :w/ three m's\: <x(3) m> /
rx :w/ three to five m's: <x(3..5) m> /
rx :w/ done at runtime\: <x($m..$n) m> /

Repitition is a kind of assertion, after all, and it seems like it
should get to play in the same angle-bracket sandbox as the other
assertions.

There's already precedent for using C<x> for repetition as a normal
operator, and now we have the rule modifier as well, so why not extend
it into the rule innards?

I confess that I've purposely left out a small detail, and I'm sure
you've all noticed: non-greedy matching. I couldn't come up with a way
that looked all that nice. I don't know if putting the C<?> after the
argument would even work:
rx :w/ non-greedy m's\: <x(3..5)? m> /
And putting it before the argument (as part of the assertion name?) just
looks weird:
rx :w/ non-greedy m's\: <x?(3..5) m> /

So maybe this proposal is a complete no-go because of non-greediness,
but I thought I'd throw it out there anyway because it seemed to me to
fit in with assertions rather well, and I like the generalized assertion
idea.

Kurt

Luke Palmer

unread,
Sep 18, 2004, 3:18:50 PM9/18/04
to Dan Hursh, perl6-l...@perl.org
Dan Hursh writes:
> Second, if it is a problem that '?' is too far away, how about this?
>
> [foo]**{5..3} # greedy
> [foo]**{3..5} # lazy

Because 5..3 is the empty list. This wasn't a mistake in Perl 5, so
it's staying in Perl 6.

> Oh, is there a way to trick this closure syntax into being the '0 or
> more' equivalent? Suppose it would have to be returning an infinite
> list. That seems familiar, but I forget why.

Um, yes:

[foo]**{3...}

Or:

[foo]**{3..Inf}

These are just regular perl semantics.

Luke

Luke Palmer

unread,
Sep 18, 2004, 3:24:04 PM9/18/04
to Kurt Hutchinson, perl6-l...@perl.org
Kurt Hutchinson writes:
> For specifying in-rule repetitions, why not use the rule modifer we
> already have for specifying whole-rule repetitions; namely, C<:x>. Allow
> :x inside rules like :i and :w, and we get something like this:
> rx :w/ three m's\: [:3x m] /
> rx :w/ three to five m's\: [:x(3..5) m] /
> rx :w/ done at runtime\: [:x($m..$n) m] /

That's an interesting idea. But I'd have to agree with your next
paragraph on the issue.

> It seems straightforward to me, but I admit I don't know how difficult
> that would be for parsing/compiling purposes. However, it seems too
> different from :i and :w on some level, to be treated the same.
>
> So instead of a match modifier, how about adding a special named
> assertion akin to C<before> and C<after> called C<x>, that takes an
> argument:
> rx :w/ three m's\: <x(3) m> /
> rx :w/ three to five m's: <x(3..5) m> /
> rx :w/ done at runtime\: <x($m..$n) m> /
>
> Repitition is a kind of assertion, after all, and it seems like it
> should get to play in the same angle-bracket sandbox as the other
> assertions.

Yes, we considered this. First of all, the "m" there is illegal if
you've got parens. Once you get the closing paren, you need to get a >
or it's a parse error.

But there's this:

<x(3..5, 'm')>

The trouble is that it doesn't look in the slightest like a quantifier
anymore. **{} does, even if it is a big fat ugly one.

> I confess that I've purposely left out a small detail, and I'm sure
> you've all noticed: non-greedy matching. I couldn't come up with a way
> that looked all that nice.

That's okay, none of it looked nice. :-p

> I don't know if putting the C<?> after the argument would even work:
> rx :w/ non-greedy m's\: <x(3..5)? m> /
> And putting it before the argument (as part of the assertion name?) just
> looks weird:
> rx :w/ non-greedy m's\: <x?(3..5) m> /

Both of those are illegal syntactically, so no dice.

> So maybe this proposal is a complete no-go because of non-greediness,
> but I thought I'd throw it out there anyway because it seemed to me to
> fit in with assertions rather well, and I like the generalized assertion
> idea.

I am definitely fond of the generalized assertion idea. I just don't
think it belongs here. (It certainly doesn't half-belong like it did in
A5: / [foo]<3,5> /. That's not an assertion!)

Thanks for your suggestions, though.

Luke

Jonathan Scott Duff

unread,
Sep 18, 2004, 4:17:25 PM9/18/04
to Kurt Hutchinson, perl6-l...@perl.org
On Sat, Sep 18, 2004 at 03:00:09PM -0400, Kurt Hutchinson wrote:
> Repitition is a kind of assertion, after all, and it seems like it
> should get to play in the same angle-bracket sandbox as the other
> assertions.

Once I got to thinking about **{}, the less and less it looked like an
assertion to me. Assertions are more like "nouns" and all of *, +, ?,
and **{} are "verbs" that act upon these nouns. Using the angle
brackets for repetition suddenly makes rules look like "this sentence
no verb". :-)

0 new messages