Perl 6 regex parser

Jeff 'Japhy' Pinyan

unread,

Jun 26, 2004, 11:14:14 PM6/26/04

to Perl6 Internals

I am currently completing work on an extensible regex-specific parsing
module, Regexp::Parser. It should appear on CPAN by early July (hopefully
under my *new* CPAN ID "JAPHY").

Once it is completed, I will be starting work on writing a subclass that
matches Perl 6 regexes, Regexp::Perl6 (or Perl6::Regexp, or
Perl6::Regexp::Parser). I think this might be of some use to the Perl 6
dev crew, but I'm not sure how.

--
Jeff "japhy" Pinyan ja...@pobox.com http://www.pobox.com/~japhy/
RPI Acacia brother #734 http://www.perlmonks.org/ http://www.cpan.org/
CPAN ID: PINYAN [Need a programmer? If you like my work, let me know.]
<stu> what does y/// stand for? <tenderpuss> why, yansliterate of course.

Luke Palmer

unread,

Jun 27, 2004, 5:30:24 PM6/27/04

to ja...@pobox.com, Perl6 Internals

Jeff 'japhy' Pinyan writes:
> I am currently completing work on an extensible regex-specific parsing
> module, Regexp::Parser. It should appear on CPAN by early July
> (hopefully under my *new* CPAN ID "JAPHY").
>
> Once it is completed, I will be starting work on writing a subclass
> that matches Perl 6 regexes, Regexp::Perl6 (or Perl6::Regexp, or
> Perl6::Regexp::Parser).

Or Regexp::Parser::Perl6 :-)

> I think this might be of some use to the Perl 6 dev crew, but I'm not
> sure how.

It surely would.

The grammar for Perl 6 is going to be specified with Perl 6 patterns.
That presents us a little bootstrapping problem. So the original goal
of Damian's Perl6::Rules was to transform this grammar back into Perl 5
patterns so they can parse the simplified Perl 6 code for Perl 6 and
compile a bootstrap.

But Perl6::Rules failed. Due to serious bugs in the Perl 5 regex engine
having to do with executing code within regexes, Perl6::Rules couldn't
do its job properly. So, for the bootstrap, we need something else, and
you're giving us an in.

My personal, nondivine plan would be to use your module to create a
driver-based parser. That could then be used for the bootstrap instead.

A driver-based parser has a couple of advantages over regexes and even
Parse::RecDescent. First, the parsing algorithm can be easily
customized, so we can play with hybrid models and see how the time
complexity works out. Also, you can suspend the parsing in the middle
of execution, go somewhere else, and continue, which the Perl 6 parser
might just want to do (something like simulated coroutines).

In any case, such a parser would be really, really useful.

Luke

Steve Fink

unread,

Jun 27, 2004, 5:35:23 PM6/27/04

to ja...@pobox.com, Perl6 Internals

On Jun-26, Jeff 'japhy' Pinyan wrote:
> I am currently completing work on an extensible regex-specific parsing
> module, Regexp::Parser. It should appear on CPAN by early July (hopefully
> under my *new* CPAN ID "JAPHY").
>
> Once it is completed, I will be starting work on writing a subclass that
> matches Perl 6 regexes, Regexp::Perl6 (or Perl6::Regexp, or
> Perl6::Regexp::Parser). I think this might be of some use to the Perl 6
> dev crew, but I'm not sure how.

Sounds interesting, but I'm a bit confused about what it is. Clearly,
it parses regexes, but what is the output? A parse tree? Tables and/or
code that implement a matching engine for that regex? PIR? A training
regimen that can be used to condition a monkey to push a "yes" or "no"
button whenever you give it a banana with an input string inscribed on
it?

If it just parses the regex, then I would be interested in it for both
languages/regex and languages/perl6. If it does more, then you're on
your own, because it's been difficult enough to graft the current
regex engine onto the languages/perl6 code; I have no problems with
someone doing the same for a different engine, but I'm not going to be
the one!

Also, I find that regex stack overflows can sometimes trigger the
monkey to begin wildly throwing feces, and I've no desire to
experience that again.

Jeff 'Japhy' Pinyan

unread,

Jun 27, 2004, 8:17:17 PM6/27/04

to Luke Palmer, Perl6 Internals

On Jun 27, Luke Palmer said:

>Jeff 'japhy' Pinyan writes:
>> I am currently completing work on an extensible regex-specific parsing
>> module, Regexp::Parser. It should appear on CPAN by early July
>> (hopefully under my *new* CPAN ID "JAPHY").
>>
>> Once it is completed, I will be starting work on writing a subclass
>> that matches Perl 6 regexes, Regexp::Perl6 (or Perl6::Regexp, or
>> Perl6::Regexp::Parser).
>
>Or Regexp::Parser::Perl6 :-)

I wasn't sure where in the hierarchy of modules it should go.

>The grammar for Perl 6 is going to be specified with Perl 6 patterns.
>That presents us a little bootstrapping problem. So the original goal
>of Damian's Perl6::Rules was to transform this grammar back into Perl 5
>patterns so they can parse the simplified Perl 6 code for Perl 6 and
>compile a bootstrap.
>

>My personal, nondivine plan would be to use your module to create a
>driver-based parser. That could then be used for the bootstrap instead.

If you mean what I think you mean by driver-based, than my module is a
perfect fit. To subclass it, you do this:

package Regexp::NoCode;
use base 'Regexp::Parser';

sub init {
my $self = shift;

$self->SUPER::init();

$self->del_handler('(?{');
$self->del_handler('(??{');
$self->del_handler('(?p{');
}

1;

Now you have a parser that refuses to acknowledge (?{ ... }) and (??{ ...
}) assertions (resulting in an error).

Another example would be to support the '&' metacharacter, which is the
"AND" equivalent of '|':

package Regexp::AndBranch;
use base 'Regexp::Parser';

sub init {
my $self = shift;

$self->SUPER::init();

$self->add_handler('&' => sub {
my ($S) = @_;
return $S->object('and');
});
}

Then you create Regexp::AndBranch::and like so:

package Regexp::AndBranch::and;
@ISA = Regexp::Parser::or; # it behaves like an 'or' branch...

sub new {
my ($class, $rx, $lhs, $rhs) = @_;
my $self = bless {
rx => $rx,
flags => $rx->{flags}[-1],
class => 'branch',
type => 'and',
data => [$lhs, $rhs],
raw => '&',
}, $class;
return $self;
}

It'll inherit the other methods it needs from the 'or' class. Then, when
you want to convert it to an existing construct (specifically, /x&y&z/
would become /(?=x)(?=y)z/, like in vim).

>A driver-based parser has a couple of advantages over regexes and even
>Parse::RecDescent. First, the parsing algorithm can be easily
>customized, so we can play with hybrid models and see how the time
>complexity works out. Also, you can suspend the parsing in the middle
>of execution, go somewhere else, and continue, which the Perl 6 parser
>might just want to do (something like simulated coroutines).

Yeah, you can parse node-by-node:

my $p = Regexp::Parser->new($regex);
while (my $n = $p->next) {
# ...
}

When I finish writing it to work with the current set, I'll post it to
CPAN and alert the group.

Jeff 'Japhy' Pinyan

unread,

Jun 27, 2004, 10:48:24 PM6/27/04

to Steve Fink, Perl6 Internals

On Jun 27, Steve Fink said:

>On Jun-26, Jeff 'japhy' Pinyan wrote:
>> I am currently completing work on an extensible regex-specific parsing
>> module, Regexp::Parser. It should appear on CPAN by early July (hopefully
>> under my *new* CPAN ID "JAPHY").
>>
>> Once it is completed, I will be starting work on writing a subclass that
>> matches Perl 6 regexes, Regexp::Perl6 (or Perl6::Regexp, or
>> Perl6::Regexp::Parser). I think this might be of some use to the Perl 6
>> dev crew, but I'm not sure how.
>
>Sounds interesting, but I'm a bit confused about what it is. Clearly,
>it parses regexes, but what is the output? A parse tree? Tables and/or
>code that implement a matching engine for that regex? PIR? A training
>regimen that can be used to condition a monkey to push a "yes" or "no"
>button whenever you give it a banana with an input string inscribed on
>it?

It creates a tree structure, not identical but similar to the array of
nodes Perl uses internally.

/a+?(bc|d|e)+/

is represented as

[
MINMOD(
PLUS(
EXACT('a')
)
),
PLUS(
ALT(
EXACT('bc'),
ALT(
EXACT('d'),
EXACT('e'),
)
)
)
]

Two signficant differences are that (?ismx-ismx) and (?ismx-ismx:...)
assertions are found explicitly in the parse tree. To Perl, in-place flag
changing doesn't need to stick around, because it has already had its
effect. Same deal with (?:...), really; it only serves to separate part
of the regex for a specific operation, and doesn't have a regex opcode.
For my module, though, it's helpful to keep these things around.

That might change before I release the module, though. I'm not sure yet.
For something like /a(?:b|c)d/, the tree can be

[
EXACT('a'),
ALT(EXACT('b'), EXACT('c')),
EXACT('d'),
]

See, no need for an explicit NOCAPTURE node or something. And if it was
/a(?i:b|c)d/, then I would do

[
EXACT('a'),
ALT(EXACTF('b'), EXACTF('c')),
EXACT('d'),
]

The trick is, when it comes time to create a physical regex from the tree,
I'd need to create (?flag) things on the fly. Which probably isn't too
hard.

Either way, there *will* be (?flag) and (?flag:...) objects created so
that the user knows of their existence during node-by-node parsing. They
just might not get included in the tree.

Jeff 'Japhy' Pinyan

unread,

Jun 29, 2004, 9:16:50 PM6/29/04

to Perl6 Internals

I've just completed the first version of the module. PAUSe seems to be
down, so right now its only accessible from

http://japhy.perlmonk.org/modules/

Once I fine-tune it, I'll get to work on Regexp::Perl6 (or whatever).

Steve Fink

unread,

Jun 30, 2004, 12:51:07 PM6/30/04

to ja...@pobox.com, Perl6 Internals

On Jun-27, Jeff 'japhy' Pinyan wrote:
> On Jun 27, Steve Fink said:
>
> >On Jun-26, Jeff 'japhy' Pinyan wrote:
> >> I am currently completing work on an extensible regex-specific parsing
> >> module, Regexp::Parser. It should appear on CPAN by early July (hopefully
> >> under my *new* CPAN ID "JAPHY").
> >>
> >> Once it is completed, I will be starting work on writing a subclass that
> >> matches Perl 6 regexes, Regexp::Perl6 (or Perl6::Regexp, or
> >> Perl6::Regexp::Parser). I think this might be of some use to the Perl 6
> >> dev crew, but I'm not sure how.
> >
> >Sounds interesting, but I'm a bit confused about what it is. Clearly,
> >it parses regexes, but what is the output? A parse tree? Tables and/or
> >code that implement a matching engine for that regex? PIR? A training
> >regimen that can be used to condition a monkey to push a "yes" or "no"
> >button whenever you give it a banana with an input string inscribed on
> >it?
>
> It creates a tree structure, not identical but similar to the array of
> nodes Perl uses internally.

Ah, good. Then I am interested. When I manage to find some time for
hacking again, I'll graft it onto languages/regex as a replacement for
the ridiculous parser I have there now.

languages/regex is meant to be a language-independent regex engine,
and has a silly stub parser to get basic stuff into it for testing.
languages/perl6 uses the engine too, but provides its own parser. But
nobody's done anything with that parser since Sean O'Rourke stopped
working on it (admittedly, he implemented a surprisingly large portion
of the syntax), and it'd be great to be working with something that's
maintained. (But mostly I like the idea of using a
language-independent front-end with a language-independent backend.)

I should just look at the code, but I'm wondering what you do with
language-specific constructs. Embedded code, for example. How do you
find the end of it? And will you be supporting things like Perl6's

/ $x := (a*b) /

where '$x' is a language-dependent variable name syntax?

Jeff 'Japhy' Pinyan

unread,

Jun 30, 2004, 3:06:14 PM6/30/04

to Steve Fink, Perl6 Internals

On Jun 30, Steve Fink said:

>On Jun-27, Jeff 'japhy' Pinyan wrote:
>
>> It creates a tree structure, not identical but similar to the array of
>> nodes Perl uses internally.
>
>Ah, good. Then I am interested. When I manage to find some time for
>hacking again, I'll graft it onto languages/regex as a replacement for
>the ridiculous parser I have there now.

>I should just look at the code, but I'm wondering what you do with

>language-specific constructs. Embedded code, for example. How do you
>find the end of it?

Use The Source, Luke. I do exactly what Perl's regex compiler does:

/* right after matching "(?{" */
while (count && (c = *RExC_parse)) {
if (c == '\\' && RExC_parse[1]) RExC_parse++;
else if (c == '{') count++;
else if (c == '}') count--;
RExC_parse++;
}
if (*RExC_parse != ')') {
vFAIL("Sequence (?{...}) not terminated or not {}-balanced");
}

Granted, I do it with a recursive regex, not character-by-character, but
that's how I do it. I don't parse the expression, I just make sure it's
properly balanced.

>And will you be supporting things like Perl6's
>
> / $x := (a*b) /
>
>where '$x' is a language-dependent variable name syntax?

If I have a means for matching variable names, then that's not a problem.
I assume my first version of the Perl 6 regex parser will only match
simple variables, like $foo and @bar and %blat. Baby steps.

Jeff 'Japhy' Pinyan

unread,

Jul 1, 2004, 8:03:16 PM7/1/04

to Perl6 Internals

It'll appear shortly at your local mirror. You can get it from my web
site as well.

I'll be writing an extension module tomorrow, and starting next week, I'll
get started on Regexp::Perl6.

Which leads me to a question about Perl 6 regexes. I'm writing an article
on (?{ ... }) and (??{ ... }) for TPJ, and my conclusion translates some
of the regexes I use to Perl 6. I have a question about the backtrack
control assertions : :: and :::.

Do any of them cause the regex to fail entirely? That is, not fail and
try again from a different position in the string, but fail utterly? My
understanding is they don't, which is why there's <commit>, but I just
wanted to be sure of this.

Thanks for your time.

Luke Palmer

unread,

Jul 1, 2004, 9:21:13 PM7/1/04

to ja...@pobox.com, Perl6 Internals

Jeff 'japhy' Pinyan writes:
> It'll appear shortly at your local mirror. You can get it from my web
> site as well.
>
> I'll be writing an extension module tomorrow, and starting next week, I'll
> get started on Regexp::Perl6.

Cool.

> Which leads me to a question about Perl 6 regexes. I'm writing an article
> on (?{ ... }) and (??{ ... }) for TPJ, and my conclusion translates some
> of the regexes I use to Perl 6. I have a question about the backtrack
> control assertions : :: and :::.
>
> Do any of them cause the regex to fail entirely? That is, not fail and
> try again from a different position in the string, but fail utterly? My
> understanding is they don't, which is why there's <commit>, but I just
> wanted to be sure of this.

Yeah, that's right. It's hard to do a commit in a Perl 5 regex, and
perhaps impossible generically.

Luke

Brent 'Dax' Royal-Gordon

unread,

Jul 2, 2004, 12:04:41 AM7/2/04

to ja...@pobox.com, Perl6 Internals

Jeff 'japhy' Pinyan wrote:
> ...I have a question about the backtrack

> control assertions : :: and :::.
>
> Do any of them cause the regex to fail entirely? That is, not fail and
> try again from a different position in the string, but fail utterly? My
> understanding is they don't, which is why there's <commit>, but I just
> wanted to be sure of this.

Nope. Theoretically (IIRC):

: fail this atom
:: fail this group
::: fail this rule
:::: fail this regex

But Larry decided that :::: was going just a bit too far, so he named it
<commit> instead.

--
Brent "Dax" Royal-Gordon <br...@brentdax.com>
Perl and Parrot hacker

Oceania has always been at war with Eastasia.

Jeff 'Japhy' Pinyan

unread,

Jul 4, 2004, 11:57:38 AM7/4/04

to Perl6 Internals

I want to make sure I haven't misunderstood anything. *What* purpose will
my module that will be able to parse Perl 6 regexes into a tree be? You
must be aware that I have no power Damian does not possess, and I cannot
translate *all* Perl 6 regexes to Perl 5 regexes. All I can promise is a
tree structure and limited (albeit correct) translation to Perl 5.

Steve Fink

unread,

Jul 4, 2004, 2:21:01 PM7/4/04

to ja...@pobox.com, Perl6 Internals

On Jul-04, Jeff 'japhy' Pinyan wrote:
> I want to make sure I haven't misunderstood anything. *What* purpose will
> my module that will be able to parse Perl 6 regexes into a tree be? You
> must be aware that I have no power Damian does not possess, and I cannot
> translate *all* Perl 6 regexes to Perl 5 regexes. All I can promise is a
> tree structure and limited (albeit correct) translation to Perl 5.

In general, it could perhaps be used as a piece of the implementation
of Perl6-style regexes for any Parrot-hosted language.

Personally, I could see using it with the current prototype perl6
compiler to take over the parsing whenever a regex is seen. The
resulting tree structure would then be translated into a
languages/regex-style tree, and from there converted into PIR
instructions. The translation step could perhaps be skipped if your
parser uses some extensible factory-like pattern so that I could
produce my preferred regex tree nodes directly -- or if I converted my
regex compiler to use your tree nodes as their native representation.

In order to get the parsing correct, however, I would need the ability
to call back into my native perl6 parser when you encounter perl6 code
during your parse -- and perhaps call you again within that code.

I don't know if this is in the scope of what you were planning for
your parser; now I'm wondering if you were intending to write
something akin to Perl6::Rules in that it translates Perl6 rules into
perl5-edible chunks, and all this business of reentrancy and external
parsing callouts is not at all what you're interested in dealing with.

If so, then I would still find a use for it in providing a better
Perl6-style regex parser for languages/regex. It would be used mostly
for testing, but eventually I hope to get around to plugging
languages/regex into Parrot as a directly-callable compiler. This has
the same reentrancy etc. issues for the host language, but then
they're that language's author's problem, not mine. :-)

I'll go download Regexp::Parser now, just so I'm not speculating quite
so much.

Jeff 'Japhy' Pinyan

unread,

Jul 4, 2004, 5:14:04 PM7/4/04

to Steve Fink, Perl6 Internals

On Jul 4, Steve Fink said:

>Personally, I could see using it with the current prototype perl6
>compiler to take over the parsing whenever a regex is seen. The
>resulting tree structure would then be translated into a
>languages/regex-style tree, and from there converted into PIR
>instructions. The translation step could perhaps be skipped if your
>parser uses some extensible factory-like pattern so that I could
>produce my preferred regex tree nodes directly -- or if I converted my
>regex compiler to use your tree nodes as their native representation.

You certainly can. Each object has an insert() method which determines
HOW it gets placed in the tree.

>In order to get the parsing correct, however, I would need the ability
>to call back into my native perl6 parser when you encounter perl6 code
>during your parse -- and perhaps call you again within that code.

I don't see why not; you can go through the parsing node by node with the
next() method. And Perl 6 regexes don't seem to need an initial once-over
like Perl 5 regexes do.

>I don't know if this is in the scope of what you were planning for your
>parser; now I'm wondering if you were intending to write something akin
>to Perl6::Rules in that it translates Perl6 rules into perl5-edible
>chunks, and all this business of reentrancy and external parsing callouts
>is not at all what you're interested in dealing with.

I think the module can handle it.

>I'll go download Regexp::Parser now, just so I'm not speculating quite
>so much.

R::P v0.03 will be available by the middle of the week; some bugs have
surfaced and some helpful changes are being made. But for the time being,
play with the current version and see what you can do.