I've just committed a new, rewritten version of PGE to the Parrot
repository. It's still somewhat preliminary and many rule features
are still missing. On the other hand, it's now written entirely
in PIR (no more C compiling!) and provides a stronger base platform
for developing the rest of the features. More details are below,
and in compilers/pge/README. Comments, suggestions, and test cases
The long story
(Or, "hey, what took you so long?") Well, I was working along in
PGE late last year, and discovered that P6 rules capturing semantics
weren't very clear. So, a few messages to p6l and a few weeks later
the @Larry team proposes detailed semantics for capturing. (I'll
present these in future messages.) Then Real Life intruded and
kept me away from development for a few weeks.
When I finally got a chance to look at PGE in detail again, I started
adding things in the C parser and discovered that I really wanted dynamic
hashes and arrays to work with. Since we're going to have to handle
Unicode someday anyway, I decided to take a step backwards (rebuild
the basic engine from scratch) so that we can start taking larger leaps
forward. This divorces PGE from compiler issues, and which brings us
to where we are today.
Appearing in this version of PGE:
- basic expressions, quantifiers, alternation
- subpatterns (groups), including nested capturing subpatterns
- \d, \s, \n, \w character classes
Not yet implemented, but coming soon (rough priority order):
- updated test harness/test suite
- cut operations don't always work properly
- character classes
- interpolated variables
- conjunctive matches
- capture aliases
- many, many potential optimizations
The basic design remains the same, there's a rule parser, an expression
tree, and a code generator. I briefly played with executing components
of the expression tree to match strings but decided that it was actually
a bit messier than doing code generation (and likely a bit slower as we
had to test expression characteristics at pattern-match time instead of
compile time). So, the difference now is that we have PIR code generating
PIR instead of a C function generating PIR.
The code generator tries to produce only the code needed for each
component of the match, but I'm sure there will be a lot of optimizations
we can come up with. Fortunately we can optimize in the expression
tree before generating the code.
I also continued using the bsr/ret scheme for calling components
because it avoids much of the register passing overhead and recreating
state information from one expression to the next. But it's easy
enough for us to switch to use Parrot subroutine semantics if we
ever decide we want that.
The test suite is currently broken, I'll be fixing that next.
As always we need more tests; the current tests are in t/p6rules
and are fairly easy to write.
Test, patches, comments, questions, etc. welcomed. Questions/comments
about PGE installation and parrot issues probably belong on
perl6-internals, modification and questions about PGE execution
and internals probably go on perl6-compiler. Unless I hear otherwise
I will probably announce minor changes only to perl6-compiler.
Following some discussion on #perl6, it seems that porting the Parser
part of PGE into Perl 6, using either rx// or rx:P5//, may be a good idea.
Actually, porting the entirety of PGE to Perl 6 may be a good exercise
for Perl 6, too. :)
Very good question -- I can probably come up with a few. Here's
a short list:
1. Benchmarking -- it would be really good to have some sort of
rudimentary stats to know the approximate speed difference between
PGE (which currently has few optimizations) and Perl 5's regular
expression engine. It doesn't have to be anything extensive --
just a couple of "order-of-magnitude" estimations so we can get some
idea of just how hard we'll eventually have to work to get the
performance we're used to, and what sorts of optimizations or support
we'll need to be doing that.
2. Parsing of Perl 5 regular expressions -- PGE is designed so that
it can use other matching syntaxes, such as (say) the Perl 5 regular
expression syntax. The previous version of PGE had a p5 regex
parser in C that I threw together; I'll throw one together for this
version but it would be good to have someone else shepherd that.
Parsing the regexes isn't all that difficult, and one can easily
look at the P6Rule parser for guidance (or ask, too). Once the
regex is parsed then PGE can largely handle the rest.
3. "Token" hashes. The p6 rules syntax (Synopsis 05) says that
a hash appearing in a rule uses the entry having the longest
matching key of the hash. For example, if I have a hash with
entries ( 'a' => 1, 'at' => 3, 'adam' => 4 ), then the longest
matching key for "atomic" would be "at".
This can be very useful for tokenizing input streams if you
want to build, say, a compiler. :-) In fact, I ended up using
hashes in exactly this way for the P6Rule parser. Unfortunately,
Parrot doesn't have a "longest matching key" op, so one can
either do a brute-force search of the keys (ick), or we can build
something a little more optimal. For the time being I created
a simple TokenHash class in PIR that acts somewhat like a Hash
but also has a method for "the longest key matching a
string (starting at position X)" that doesn't require searching
all of the keys. But it's not the greatest implementation, and
this may be one of those things where efficiency needs will
drive us to a lower-level implementation.
I'll probably come up with more tasks, but that's a start.