Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

character classes in p6 rules

6 views

Skip to first unread message

Patrick R. Michaud

unread,

May 11, 2005, 9:00:20 PM5/11/05

to perl6-l...@perl.org

I now have a basic implementation for enumerated character classes in
the grammar engine (i.e., <[xyz]>, <-[xyz]>, <[x..z]>, and <-[x..z]>).

I didn't see it specified anywhere, but are the \d, \D, \s, \S, etc.
metacharacters still supposed to work inside of a enumerated character
class, as they do in Perl 5? Or in p6 do we always use
<+<digit>+[xyz]>, <-<digit>>, <+<sp>>, <-<sp>>, etc.?

(Yes, I know that normally the absence of any spec to the contrary
indicates that we're still using p5 semantics, but this one is worth
verification for me.)

While I'm on the subject, let me just ramble a bit -- there are
times when <alpha>, <digit>, <upper>, etc. give me a bad feeling
-- they look a little too much like subrules to me, especially
when looking at <+<alpha>> and the like. I keep wondering about
things like <+<ident>> and <-<expr>>.

And something like C<< rx / <alpha>* / >> may generate a lot
of not-very-useful one-character captures into $/<alpha> , so that
we'll typically want to get in the habit of writing

rx / <?alpha>* /
rx / <+<alpha>>* /

and then have the engine recognize when this occurs so it
can optimize to a much faster character class op rather than
a lot of calls to a separate subrule.

Plus, <+<alpha>> just looks plain ugly and unbalanced to me.
Somehow I'd like to get rid of those inner angles, so
that we always use <+alpha>, <+digit>, <-sp>, <-punct> to
indicate named character classes, and specify combinations
with constructions like <+alpha+punct-[aeiou]> and <+word-[_]>.
We'd still allow <[abc]> as a shortcut to <+[abc]>.

To me this looks cleaner overall, makes it clear we're doing a
one-character non-capturing match, and may enable a few optimization
possibilities. (I'm sure that with enough effort we can get
equivalent optimizations out of the existing syntax, and we may
need them anyway in the long run, but this might simplify that a
fair bit.)

I haven't thought far ahead to the question of whether
character classes would continue to occupy the same namespace
as rules (as they do now) or if they become specialized kinds
of rules or what. I'll just leave it at this for now and
see what the rest of p6l thinks.

Larry Wall

unread,

May 13, 2005, 12:26:13 AM5/13/05

to perl6-l...@perl.org

On Wed, May 11, 2005 at 08:00:20PM -0500, Patrick R. Michaud wrote:
: Somehow I'd like to get rid of those inner angles, so

: that we always use <+alpha>, <+digit>, <-sp>, <-punct> to
: indicate named character classes, and specify combinations
: with constructions like <+alpha+punct-[aeiou]> and <+word-[_]>.
: We'd still allow <[abc]> as a shortcut to <+[abc]>.

I like it.

: I haven't thought far ahead to the question of whether

: character classes would continue to occupy the same namespace
: as rules (as they do now) or if they become specialized kinds
: of rules or what. I'll just leave it at this for now and
: see what the rest of p6l thinks.

Hmm, well, positive matches can be defined to traverse whatever the
longest sequence matched is, even if it's actually multiple characters
by some reckoning or other. On the other hand, negative matches
can really only skip one character in the current view regardless of
how long the sequences in the class are, which function as a negative
lookahead for the subsequent character skip. In other words, <-alpha>
really means something like [<!alpha> .]

But then it's not entirely clear how character class set theory works.
Another thing we have to work out. Obviously + and - are ordered,
and we probably want & and | for actual set operations. But does
<-[a]> negate only a preceding 'a' or all characters that use 'a'
as the base character along with subsequent combining characters?
We're almost getting into a wildcarding situation there...

In any event, the takehome message here is that characters cannot
be assumed to be constant width any more.

I think this argues that character classes really are rules of a sort.

Larry

0 new messages