Compiler writing tools

Luke Palmer

unread,

Feb 2, 2004, 4:09:33 AM2/2/04

to Language List

I've been writing a lot of compiler recently, and figuring as how Perl
6 is aiming to replace yacc, I think I'll share some of my positive and
negative experiences. Perhaps Perl 6 can adjust itself to help me out
a bit. :-)

=over

=item * RegCounter

I have a class called RegCounter which is of immense use, but could be
possibly more elegant. It's a tied hash that, upon access, generates a
new name and stores it in a table for later retrieval under the same
name.

It has a method called C<next> that returns a new RegCounter that shares
the same counter, and puts whatever was in that one's "ret" slot into
whatever argument was given to C<next>, by default "next".

The first <[^a-z]> characters in the name are passed along to the
generated register name, defaulting to a target-specific string (for
instance, I use $P for Parrot programs).

So I can do, for instance:

method if_statement::code($rc) { # $rc is the regcounter
self.item[0].code($rc.next('condition'))
~ "unless $rc{condition}, $rc{Lfalse}\n"
~ self.item[1].code($rc.next)
~ "$rc{Lfalse}:\n"
}

=item * Concatenations

The code example you just saw gets much, much uglier if there is added
complexity. One of my compilers returns lists of lines, the other
concatenates strings, and they're both pretty hard to read -- especially
when there are heredocs all over the place (which happens frequently).

I think $() will help somewhat, as will interpolating method calls, but
for a compiler, I'd really like PHP-like parse switching. That is, I
could do something like (I'll use $< and $> for <? and ?>):

method logical_or_expression::code($rc) {
<<EOC;
null $rc{ret}
$< for @($self.item[0]) -> $item { $>
$item.code($rc.next)
if $rc{next}, $rc{Ldone}
$< } $>
$rc{Ldone}:
EOC
}

For this case, I think it would also be a good idea to have a string
implementation somewhere that stores things as "ropes", a list of
strings, so that immense copying isn't necessary.

=item * Comments

We've already gone over this, but it'd be good to have the ability for
parsers to (somehow) "feed" into one another, so that you can do
comments without putting a <comment> in between every grammar rule (or
mangling things to do that somehow), or search and replace, which has
the disadvantage of being unable to disable comments during parts of the
parse. $Parse::RecDescent::skip works well, but I don't think it's
general enough.

=item * Line Counting

It is I<essential> that the regex engine is capable (perhaps off by
default) of keeping track of your line number.

=back

Luke

Andy Wardley

unread,

Feb 2, 2004, 5:19:25 AM2/2/04

to Luke Palmer, Language List

Luke Palmer wrote:
> I think $() will help somewhat, as will interpolating method calls, but
> for a compiler, I'd really like PHP-like parse switching. That is, I
> could do something like (I'll use $< and $> for <? and ?>):

Check out the new scanner module for Template Toolkit v3. It does this
exactly that. It allows you to specify as many different tag styles as
you like and uses a composite regex to locate them in a source document.
It extracts the intervening text, and then calls back to your code to do
whatever you like with them. It takes care of the surrounding text and
handles things like counting line numbers so that you don't have to worry
about it.

The code is still in development so you'll need to get it from CVS. See:

http://tt3.template-toolkit.org/code.html

Everything is raw and undocumented, but examples/scanner.pl shows an
example of what you want to do. Be warned that I'm working on this
right now, so things are changing often. Having said that, the scanner
is pretty much stable, although the handler object that it interacts
with isn't.

A

Larry Wall

unread,

Feb 2, 2004, 11:33:43 PM2/2/04

to Language List

On Mon, Feb 02, 2004 at 02:09:33AM -0700, Luke Palmer wrote:
: I've been writing a lot of compiler recently, and figuring as how Perl

: 6 is aiming to replace yacc, I think I'll share some of my positive and
: negative experiences. Perhaps Perl 6 can adjust itself to help me out
: a bit. :-)

Perl 6 is designed to be adjusted, but it would be quite an AI feat
for it to adjust itself. :-)

: =over
:
: =item * RegCounter
:
: I have a class called RegCounter which is of immense use, but could be
: possibly more elegant. It's a tied hash that, upon access, generates a
: new name and stores it in a table for later retrieval under the same
: name.
:
: It has a method called C<next> that returns a new RegCounter that shares
: the same counter, and puts whatever was in that one's "ret" slot into
: whatever argument was given to C<next>, by default "next".
:
: The first <[^a-z]> characters in the name are passed along to the
: generated register name, defaulting to a target-specific string (for
: instance, I use $P for Parrot programs).
:
: So I can do, for instance:
:
: method if_statement::code($rc) { # $rc is the regcounter
: self.item[0].code($rc.next('condition'))
: ~ "unless $rc{condition}, $rc{Lfalse}\n"
: ~ self.item[1].code($rc.next)
: ~ "$rc{Lfalse}:\n"

: }

What do you want Perl 6 to do for you here?

: =item * Concatenations

:
: The code example you just saw gets much, much uglier if there is added
: complexity. One of my compilers returns lists of lines, the other
: concatenates strings, and they're both pretty hard to read -- especially
: when there are heredocs all over the place (which happens frequently).
:
: I think $() will help somewhat, as will interpolating method calls, but
: for a compiler, I'd really like PHP-like parse switching. That is, I
: could do something like (I'll use $< and $> for <? and ?>):
:
: method logical_or_expression::code($rc) {
: <<EOC;
: null $rc{ret}
: $< for @($self.item[0]) -> $item { $>
: $item.code($rc.next)
: if $rc{next}, $rc{Ldone}
: $< } $>
: $rc{Ldone}:
: EOC

: }

This seems to me to fall into the category of useful language warpings,
but not necessarily for mandatory public consumption. String literals
are parsed by the main parser in Perl 6, unlike in Perl 5. So a
grammatical munging should be doable. "All is fair if you predeclare" and
all that...

By the way, the first production language I ever wrote was an
inside-out language where control commands were embedded in text that
was to be output by default. So I'm not knocking your proposal.

: For this case, I think it would also be a good idea to have a string

: implementation somewhere that stores things as "ropes", a list of
: strings, so that immense copying isn't necessary.

Well, I suggested something like this early in the design of Parrot,
but it doesn't seem to have flown in the general case. On the other
hand, the string abstraction ought to be big enough to hide alternate
implementations behind it. The whole "is from" notion is built on that
idea.

: =item * Comments

:
: We've already gone over this, but it'd be good to have the ability for
: parsers to (somehow) "feed" into one another, so that you can do
: comments without putting a <comment> in between every grammar rule (or
: mangling things to do that somehow), or search and replace, which has
: the disadvantage of being unable to disable comments during parts of the
: parse. $Parse::RecDescent::skip works well, but I don't think it's
: general enough.

Agreed. I do think you want the comments in the grammar, if for no
other reason than it provides a hook to do something with the comment
if you retarget the grammar from normal compilation to, say, code
translation. I don't think it's out of the realm of possibility for
Perl 6 to support strings with embedded objects as funny characters.
In the limit, a string could be composed of nothing but a stream
of objects. (As a hack, one can embed illegal Unicode characters
(above U+10FFFF) that map an integer to an array of objects, but
maybe we can do better from a GC perspective.)

: =item * Line Counting

:
: It is I<essential> that the regex engine is capable (perhaps off by
: default) of keeping track of your line number.

By all means! A compiler must absolutely never emit an inaccurate line
number if it can help it. Few things are as irritating as "...bailing
out near line 100." If we don't provide an explicit lexical analysis
pass that handles this, then the regex engine must somehow. Though I
haven't really thought much about the *how* part of the somehow.

Larry

Robert Eaglestone

unread,

Feb 3, 2004, 9:55:28 AM2/3/04

to Language List

>: =item * Comments
>:
>: We've already gone over this, but it'd be good to have the ability for

>: parsers to (somehow) "feed" into one another, [...]
>
>... I don't think it's out of the realm of possibility for

>Perl 6 to support strings with embedded objects as funny characters.
>In the limit, a string could be composed of nothing but a stream

>of objects...
>
>[...]
>
>Larry

Paul Graham thinks it likely that the Programming Languages of
the Future (PLF) will model strings as linked lists of 'characters'.
But I like the "stream of objects" concept. It sounds cool!

Luke Palmer

unread,

Feb 3, 2004, 6:26:00 PM2/3/04

to Language List

Larry Wall writes:
> On Mon, Feb 02, 2004 at 02:09:33AM -0700, Luke Palmer wrote:
> : method if_statement::code($rc) { # $rc is the regcounter
> : self.item[0].code($rc.next('condition'))
> : ~ "unless $rc{condition}, $rc{Lfalse}\n"
> : ~ self.item[1].code($rc.next)
> : ~ "$rc{Lfalse}:\n"
> : }
>
> What do you want Perl 6 to do for you here?

Beats me. I was just throwing it out there. Maybe it would spark an
idea somewhere.

> : We've already gone over this, but it'd be good to have the ability for
> : parsers to (somehow) "feed" into one another, so that you can do
> : comments without putting a <comment> in between every grammar rule (or
> : mangling things to do that somehow), or search and replace, which has
> : the disadvantage of being unable to disable comments during parts of the
> : parse. $Parse::RecDescent::skip works well, but I don't think it's
> : general enough.
>
> Agreed. I do think you want the comments in the grammar, if for no
> other reason than it provides a hook to do something with the comment
> if you retarget the grammar from normal compilation to, say, code
> translation. I don't think it's out of the realm of possibility for
> Perl 6 to support strings with embedded objects as funny characters.
> In the limit, a string could be composed of nothing but a stream of
> objects. (As a hack, one can embed illegal Unicode characters (above
> U+10FFFF) that map an integer to an array of objects, but maybe we can
> do better from a GC perspective.)

For implementation we'd surely be better off using some kind of list of
linked objects and strings. The pattern engine that I'm about to
propose to p6i wouldn't have a problem with that, efficiency wise.

But, after all, this is perl6-I<language>, so no more internals talk :-).

And exactly (or fuzzily) how might this be done, syntax wise?

"foo $bar baz"

Stringifies $bar and concatenates it into the string. C<~> does the
same thing. Perhaps Object.aschar or something.

But then there's how you extract such things. C<substr $s, $n, 1> might
return an object, but C<substr $s, $n, 2> would always return a string.
That's almost, but not quite, completely unexpected.

Maybe it'd be better to generalize into the realm of tokens. A token
could consist of a string, which is matched against with the normal
regex stuff. Or it could consist of an object, which would match the
<SomeClass> rule and fail on most else (extensibly, though).

Objects underlying a string containing their stringified representations
still sounds pretty good, though. Especially if we use eg. U+110000 as
the "object character", so objects that don't want to be matched like
ordinary text can treat themselves as embedded objects.

Luke