The Perl 6 Summary -- preprocessors

Dave Whipp

unread,

Jul 21, 2003, 12:56:25 PM7/21/03

to perl6-l...@perl.org

"Piers Cawley" <p6summ...@bofh.org.uk> wrote
> Parsers with Pre-processors
> I didn't quite understand what Dave Whipp was driving at when he
talked
> about overloading the "<ws>" pattern as a way of doing preprocessing
of
> Perl 6 patterns. I didn't understand Luke Palmer's answer either.
Help.

Let me see if I can clarify a bit.

The new Grammar engine in P6 is designed to parse whole files. In fact, I
hope to be able to say:

my $fh = open "<hello.c";
$fh =~ /<Grammars::Languages::C>/;

(Or something similar).

This style of parsing differs from the traditional model in that there is no
preprocessor (and no lexer, either). In C, it is possible to add a "#include
"foo.h" statement almost anywhere. It would get very tedious to allow it in
every prduction rule of the grammar (and also unmaintainable, unreadable,
..., unusable). So it has to be possible to say "when matching this C-file
against this grammar: allow '#include ...' almost anywhere."

It just so happens that "almost anywhere" has a precise definition:
"anywhere where whitespace is legal". It also just so happens that Perl6
patterns have a :w modifier to enable a grammar writer to avoid placing a
<ws> assertion in every production rule. So I have 3 axioms:

* I want to avoid adding preprocesor-rule productions "almost everywhere"
* "almost everywhere" means "anywhere whitespace is legal"
* :w modifier automatically allows whitespace anywhere whitespace is legal
(almost).

I believe that an obvious inference is that :w processing should include the
preprocessor parsing. Then the issue becomes one of mechanism: how do we
tell :w what our (complex) definition of whitespace is; and how to we
implement the preprocessor commands to modify the input-stream to the regex
engine. Even though I'm not sure that my original answers are correct
(perhaps some form of C<temp> would be a better approach), I'll refer you
back to my original post (and Luke's reply) for those details. I also need
to think about how C-macros would be implemented.

> http://xrl.us/mt2

Dave.

Austin Hastings

unread,

Jul 21, 2003, 3:19:11 PM7/21/03

to Dave Whipp, perl6-l...@perl.org

--- Dave Whipp <da...@whipp.name> wrote:
> "Piers Cawley" <p6summ...@bofh.org.uk> wrote
> > Parsers with Pre-processors
> > I didn't quite understand what Dave Whipp was driving at when
> > he talked about overloading the "<ws>" pattern as a way of doing
> > preprocessing of Perl 6 patterns. I didn't understand Luke
> > Palmer's answer either. Help.
>
> Let me see if I can clarify a bit.

...

> I believe that an obvious inference is that :w processing should
> include the preprocessor parsing. Then the issue becomes one of
> mechanism: how do we tell :w what our (complex) definition of
> whitespace is; and how to we implement the preprocessor commands
> to modify the input-stream to the regex engine. Even though I'm not
> sure that my original answers are correct (perhaps some form of
> C<temp> would be a better approach), I'll refer you back to my
> original post (and Luke's reply) for those details. I also need
> to think about how C-macros would be implemented.
>

Actually, IMO this goes back to the conversation we had some time about
about being able to run grammars/patterns against arbitrary objects.

What you really want is to be able to "chain" grammars:

> my $fh = open "<hello.c";
> $fh =~ /<Grammars::Languages::C>/;

grammar Grammars::Languages::C {
method init {
SUPER::init;

$.source = (new Grammars::Language::C::Preprocessor).open($source);
}
...
}

grammar Grammars::Languages::C::Preprocessor {
rule CompilationUnit {
( <Directive> | <UnprocessedStuff> )*
}

rule Directive {
<Hash> ( Include
| Line
| Conditional
| Define
) <Continuation>*
}

rule Hash { /^\s*#\s*/ }
rule Include {...}
rule Line {...}
rule Conditional {...}
rule Define {...}
rule Continuation {...}
rule UnprocessedStuff {...}
}

Except that it would probably be even better to do this arbitrarily.

> my $fh = open "<hello.c";
> $fh =~ /<Grammars::Languages::C>/;

$fh =~ /<Grammars::Languages::C(input_method =
Grammars::Languages::C::Preprocessor)>/;

(Of course, in reality the C grammar would automatically use the
preprocessor as its input method without having to be told. But it
should be able to do so as two separate grammars.)

Likewise:

my $fh = open "<perl.1.gz";

$fh =~ /<Grammars::Languages::Runoff::Nroff(input_method
= Grammars::Languages::Runoff::tbl(input_method
= Grammars::Language::Runoff::eqn(input_method
= IO::Gunzip)))>/;

=Austin

David Storrs

unread,

Jul 21, 2003, 4:09:57 PM7/21/03

to perl6-l...@perl.org

On Mon, Jul 21, 2003 at 12:19:11PM -0700, Austin Hastings wrote:

> Likewise:
>
> my $fh = open "<perl.1.gz";
>
> $fh =~ /<Grammars::Languages::Runoff::Nroff(input_method
> = Grammars::Languages::Runoff::tbl(input_method
> = Grammars::Language::Runoff::eqn(input_method
> = IO::Gunzip)))>/;

Very cool.

Assuming this ran successfully, what would the match object contain?
Or, more specifically, how would you get the pieces out? Using $1, $2
etc would get cumbersome...is there an easier way?

--Dks

Dave Whipp

unread,

Jul 21, 2003, 5:36:56 PM7/21/03

to perl6-l...@perl.org

"Austin Hastings" <austin_...@yahoo.com> wrote:
> What you really want is to be able to "chain" grammars:
>
> > my $fh = open "<hello.c";
> > $fh =~ /<Grammars::Languages::C>/;
>
> grammar Grammars::Languages::C {
> method init {
> SUPER::init;
>
> $.source = (new Grammars::Language::C::Preprocessor).open($source);
> }
> ...
> }

I find myself wondering if this is covered by the P6 equiv of TieHandle.
I.e. is it just an input stream filter?

Dave.

Austin Hastings

unread,

Jul 21, 2003, 6:14:46 PM7/21/03

to David Storrs, perl6-l...@perl.org

I thought about that briefly when composing my original reply.
Obviously, you want the act of invoking the grammar to "do something".
That is, when you say:

my $fh =~ /<Grammar>/;

you want it to DTRT.

So what's TRT? IMO, for complex grammars it's going to build a parse
tree, complete with associated structures (symbol tables, etc). Thus,
the "Match" object will be the head of a really, REALLY big data
structure.

Technically $0 should probably refer to the entire file (as a String)
and to the parse tree (as Object). But an "extra" method for this
GrammarParseTree object would have been added -- C<getPreprocessed> or
some such -- which returns the entire file after cc -E.

So having Grammar::TopLevel call getc() (or whatever) which then
(because C<input_method = Grammar::BottomLevel>) accesses its own
internal string from Grammar::BottomLevel which presumably has been
remapped to the "right" format -- that's what a preprocessor does, no?

Obviously, the engine should be smart enough to read far enough ahead
to generate correct output, but no smarter.

=Austin

Austin Hastings

unread,

Jul 21, 2003, 6:18:25 PM7/21/03

to Dave Whipp, perl6-l...@perl.org

--- Dave Whipp <da...@whipp.name> wrote:

> "Austin Hastings" <austin_...@yahoo.com> wrote:
> > $.source = (new
> Grammars::Language::C::Preprocessor).open($source);
>

> I find myself wondering if this is covered by the P6 equiv of
> TieHandle.
> I.e. is it just an input stream filter?
>

Doubtful.

Do you want to do this at the grammar level, or the file level?

If you want it at the file level, you C<tie> a translator (ooh! my
first p6 idiom!) to the file handle.

If you want it at the grammar level (i.e., regardless of source,
preprocess it for me) you want to have a composition mechanism.

I can see coding tr(1) as a translator (duh!) but I think your C parser
is always going to want the preprocessor attached. Finally! TRIMTOWTDI!

(RI=really is)

=Austin

Dave Whipp

unread,

Jul 21, 2003, 6:45:13 PM7/21/03

to perl6-l...@perl.org

"Austin Hastings" <austin_...@yahoo.com>

> > I.e. is it just an input stream filter?
> Doubtful.
>
> Do you want to do this at the grammar level, or the file level?
>
> If you want it at the file level, you C<tie> a translator (ooh! my
> first p6 idiom!) to the file handle.
>
> If you want it at the grammar level (i.e., regardless of source,
> preprocess it for me) you want to have a composition mechanism.

I could play with semantics, and suggest that the regex engine always
matches a "stream" of characters. But I won't :-).

Instead, lets try the A6 pipeline syntax inside an assertion:

$fh =~ / < <Grammar.Lang.C.Preprocessor> ==> <Grammar.Lang.C> > /;

This might provide a starting point the defining the associated L0
structure. And I can think of some intersting generalizations for the LHS.

Dave.

Luke Palmer

unread,

Jul 21, 2003, 7:14:19 PM7/21/03

to Austin_...@yahoo.com, da...@whipp.name, perl6-l...@perl.org

> grammar Grammars::Languages::C::Preprocessor {
> rule CompilationUnit {
> ( <Directive> | <UnprocessedStuff> )*
> }
>
> rule Directive {
> <Hash> ( Include
> | Line
> | Conditional
> | Define
> ) <Continuation>*
> }
>
> rule Hash { /^\s*#\s*/ }
> rule Include {...}
> rule Line {...}
> rule Conditional {...}
> rule Define {...}
> rule Continuation {...}
> rule UnprocessedStuff {...}
> }

We're not quite in the world of ACME::DWIM, so you can't just replace
the important stuff with ... . :-)

You're not outputting a parse tree, you're just outputting more text
to be parsed with another, text-based, grammar. It seems to me like
it's a big s//ubstitution, of sorts.

Or maybe not, maybe we make an output stream which will be fed to
Grammars::Languages::C. Here's an implementation a #include
processor, using as little made-up syntax as possible.

use Pattern::Common;

grammar Preprocessor {
rule include {
:w(/\h*/)
\# include "<Pattern::Common::filename>" $$
{ $0 := (<<< open "< $filename") ~~ /<main>/ }
}

rule main {
$0 := ( [<include> | .]* )
}
}

Now, to chain them, you do the (seemingly more natural):

$fh ~~ /<Preprocessor>/ ~~ /<C>/

I've been trying to figure out how one would do Preprocessor lazily,
such that it would not process any more text until C said it needed
it. Perhaps something with ArrayString (the hypothetical class from
A5)? Coroutines?

Luke

Benjamin Goldberg

unread,

Jul 23, 2003, 8:25:55 PM7/23/03

to perl6-l...@perl.org

Luke Palmer wrote:
>
> > grammar Grammars::Languages::C::Preprocessor {
> > rule CompilationUnit {
> > ( <Directive> | <UnprocessedStuff> )*
> > }
> >
> > rule Directive {
> > <Hash> ( Include
> > | Line
> > | Conditional
> > | Define
> > ) <Continuation>*
> > }
> >
> > rule Hash { /^\s*#\s*/ }
> > rule Include {...}
> > rule Line {...}
> > rule Conditional {...}
> > rule Define {...}
> > rule Continuation {...}
> > rule UnprocessedStuff {...}
> > }
>
> We're not quite in the world of ACME::DWIM, so you can't just replace
> the important stuff with ... . :-)
>
> You're not outputting a parse tree, you're just outputting more text
> to be parsed with another, text-based, grammar. It seems to me like
> it's a big s//ubstitution, of sorts.

Hmm... well, think about yacc for a moment. You could either have the
handlers for rules assign to $$, based on $1, etc., and thus build a
huge structure, OR, you could have the handlers for rules print stuff to
stdout (which might possibly be a pipe to another process).

ISTM that we also want our grammers to be able to do things like that...

We want to be able to produce a tree, *and*, we want to be able to act
as a filter. And probably also some combination thereof -- some rules
produce trees, and other rules within the same grammer (which contain
the tree-producing ones) somehow process and output those trees.

> Or maybe not, maybe we make an output stream which will be fed to
> Grammars::Languages::C. Here's an implementation a #include
> processor, using as little made-up syntax as possible.
>
> use Pattern::Common;
>
> grammar Preprocessor {
> rule include {
> :w(/\h*/)
> \# include "<Pattern::Common::filename>" $$
> { $0 := (<<< open "< $filename") ~~ /<main>/ }
> }
>
> rule main {
> $0 := ( [<include> | .]* )
> }
> }

Alas, this doesn't work right with a fairly common idiom:

#ifdef HAVE_FOO_H
#include <foo.h>
#endif

Nor the even more common:

#ifndef SYS_SOMEFILE_H
#define SYS_SOMEFILE_H
lots of stuff, possibly including some #includes, and some typedefs.
#endif /* SYS_SOMEFILE_H */

--
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "$@[$a%6
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

Austin Hastings

unread,

Jul 23, 2003, 9:22:22 PM7/23/03

to Luke Palmer, da...@whipp.name, perl6-l...@perl.org

> We're not quite in the world of ACME::DWIM, so you can't just replace
> the important stuff with ... . :-)

Maybe, but the C preprocessor isn't important, here, for itself. Otherwise I
could cheat:

grammar Grammar::Language::C::Preprocessor {
rule CompilationUnit {
FIRST { static $cheat = open "|/bin/cc -E|"; }
<.*> { $cheat.print $1; $cheat.inputRecordSep(""); return <$cheat>; }
}
}

> You're not outputting a parse tree, you're just outputting more text
> to be parsed with another, text-based, grammar. It seems to me like
> it's a big s//ubstitution, of sorts.

It would, unless you were trying to do something other than preprocess C
files with your C preprocessor grammar. (e.g., build a C interpreter).

I think the right thing here is for "filter" type grammars to make sure that
the match objects they return can instantiate as a collection of the "right
things".

So if I want to treat the parse tree as a parse tree, great. But if I want
to treat it as an input stream, or an array of char, or whatever it takes to
compose grammars, that should be okay, too.

This goes back to your suggestion about mixing in interfaces. I still think
that's worth while, and here's another good example of why: if
Grammar::Language::C::Preprocessor implements all the methods necessary to
qualify (even nominally) as a stream or string, then so be it: it's a stream
or string. Now we can compose grammars infinitely, as long as we understand
(at the developer level) what the hell we're doing.

There should be no question of the utility of this. The C preprocessor has
shown up in an awful lot of strange places. Imake and XRDB spring
immediately to mind, of course.

Someone else pointed out the probable utility of the pipeline operators for
this kind of thing. They're right, too. That idiom reads too well to pass
up.

$fh =~ <Grammar::Language::C <== Grammar::Language::C::Preprocessor>;

In fact, it might legitimately work the other way around, too:

$fh =~ <Grammar::Language::C::Preprocessor ==> Grammar::Language::C>;

After all, "invoking" the C grammar on another grammar is a pretty obvious
idiom for composition.

This is just a question of idiomry, I think. (Or is it idiocy?)

=Austin