a smarter form of whitespace

Allison Randal

unread,

Jul 4, 2006, 3:57:16 PM7/4/06

to perl6-c...@perl.org

I'm writing a parser for a language that treats a double newline as a
statement terminator. It works if I make every rule a 'regex' (to turn
off smart whitespace). But I want spaces and tabs to act as smart
whitespace, and newlines to act as literal whitespace. I've
overloaded <ws> to match only spaces and tabs, but the grammar still
consumes newlines where it shouldn't consume newlines. For a simple
repeatable example, take the following grammar:

------

token start { ^<emptyline>*$ }

regex emptyline { ^^ $$ \n }

token ws { [<sp> | \t]* }

------

If I match this against a string of 7 newlines, it returns 7 <emptyline>
matches, and each match is a single newline. This is the behavior I want
for newlines.

I would like to add smart whitespace matching for spaces and tabs. But,
if I change <emptyline> to a 'rule' and match it against the same string
of 7 newlines, it returns a single <emptyline> match and the matched
string is 7 newlines. I've tried several variations on the <ws> rule,
but it seems to boil down to: no matter what the <ws> rule matches, if
:sigspace is on, it treats newlines as ignorable whitespace.

Is this a bug or a feature?

Thanks,
Allison

Allison Randal

unread,

Jul 5, 2006, 5:46:38 PM7/5/06

to Nathan Gray, perl6-c...@perl.org

Nathan Gray wrote:
>
> Overloading <ws> and other builtins was fixed in parrot and pugs
> approaching midnight (hackathon time) on 2006-06-29. If your parrot
> and pugs are both more recent than that, I'm not sure where the bug
> is.

I have the latest checkout of Parrot (I'm not using Pugs).

It may not be a bug. The design question is: should <ws> match a newline
even when it's been overloaded to match only spaces and tabs? (I'm
thinking "No", but could be wrong.)

Allison

Patrick R. Michaud

unread,

Jul 5, 2006, 10:05:16 PM7/5/06

to Allison Randal, perl6-c...@perl.org

On Tue, Jul 04, 2006 at 12:57:16PM -0700, Allison Randal wrote:
> ------
>
> token start { ^<emptyline>*$ }
>
> regex emptyline { ^^ $$ \n }
>
> token ws { [<sp> | \t]* }
>
> ------

The above grammar doesn't have a "grammar" statement; as a result
the regexes are being installed into the '' namespace.

> If I match this against a string of 7 newlines, it returns 7 <emptyline>
> matches, and each match is a single newline. This is the behavior I want
> for newlines.

I tried it with a grammar statement and it seems to work:

----

$ cat ar.pg
grammar XYZ;

token start { ^<emptyline>*$ }

rule emptyline { ^^ $$ \n }

token ws { [<sp> | \t]* }

$ ./parrot compilers/pge/pgc.pir ar.pg >ar.pir
$ cat xyz.pir
.sub main :main
load_bytecode 'PGE.pbc'
load_bytecode 'ar.pir'
load_bytecode 'dumper.pbc'
load_bytecode 'PGE/Dumper.pbc'

$P0 = find_global 'XYZ', 'start'
$P1 = $P0("\n\n\n\n\n\n\n", 'grammar' => 'XYZ')

'_dumper'($P1)
.end
$ ./parrot xyz.pir
"VAR1" => PMC 'XYZ' => "\n\n\n\n\n\n\n" @ 0 {
<emptyline> => ResizablePMCArray (size:7) [
PMC 'XYZ' => "\n" @ 0,
PMC 'XYZ' => "\n" @ 1,
PMC 'XYZ' => "\n" @ 2,
PMC 'XYZ' => "\n" @ 3,
PMC 'XYZ' => "\n" @ 4,
PMC 'XYZ' => "\n" @ 5,
PMC 'XYZ' => "\n" @ 6
]
}
$

-----

Pm

Nathan Gray

unread,

Jul 5, 2006, 1:04:06 PM7/5/06

to Allison Randal, perl6-c...@perl.org

On Tue, Jul 04, 2006 at 12:57:16PM -0700, Allison Randal wrote:

> I'm writing a parser for a language that treats a double newline as a
> statement terminator. It works if I make every rule a 'regex' (to turn
> off smart whitespace). But I want spaces and tabs to act as smart
> whitespace, and newlines to act as literal whitespace. I've
> overloaded <ws> to match only spaces and tabs, but the grammar still
> consumes newlines where it shouldn't consume newlines. For a simple
> repeatable example, take the following grammar:

Overloading <ws> and other builtins was fixed in parrot and pugs

approaching midnight (hackathon time) on 2006-06-29. If your parrot
and pugs are both more recent than that, I'm not sure where the bug
is.

-kolibrie

Allison Randal

unread,

Jul 6, 2006, 3:29:12 AM7/6/06

to Patrick R. Michaud, perl6-c...@perl.org

Patrick R. Michaud wrote:
> On Tue, Jul 04, 2006 at 12:57:16PM -0700, Allison Randal wrote:
>> ------
>>
>> token start { ^<emptyline>*$ }
>>
>> regex emptyline { ^^ $$ \n }
>>
>> token ws { [<sp> | \t]* }
>>
>> ------
>
> The above grammar doesn't have a "grammar" statement; as a result
> the regexes are being installed into the '' namespace.

The original did have a 'grammar' statement, I just didn't paste it into
the email.

> $ cat xyz.pir
> .sub main :main
> load_bytecode 'PGE.pbc'
> load_bytecode 'ar.pir'
> load_bytecode 'dumper.pbc'
> load_bytecode 'PGE/Dumper.pbc'
>
> $P0 = find_global 'XYZ', 'start'
> $P1 = $P0("\n\n\n\n\n\n\n", 'grammar' => 'XYZ')

What the original didn't have is the 'grammar' named argument when
calling the start rule. When I replace the previous line with:

$P1 = $P0("\n\n\n\n\n\n\n")

then your sample code exhibits the same problem. I assume this means
that the reason overriding <ws> wasn't working is because it was calling
the default version of <ws> in the root namespace. But, if it was
defaulting to the root namespace, why was it able to find any of the
rules? Shouldn't it have complained that it couldn't find <emptyline>?

Thanks,
Allison

Patrick R. Michaud

unread,

Jul 6, 2006, 10:54:12 AM7/6/06

to Allison Randal, perl6-c...@perl.org

On Thu, Jul 06, 2006 at 12:29:12AM -0700, Allison Randal wrote:
> >$ cat xyz.pir
> >.sub main :main
> > load_bytecode 'PGE.pbc'
> > load_bytecode 'ar.pir'
> > load_bytecode 'dumper.pbc'
> > load_bytecode 'PGE/Dumper.pbc'
> >
> > $P0 = find_global 'XYZ', 'start'
> > $P1 = $P0("\n\n\n\n\n\n\n", 'grammar' => 'XYZ')
>
> What the original didn't have is the 'grammar' named argument when
> calling the start rule. When I replace the previous line with:
>
> $P1 = $P0("\n\n\n\n\n\n\n")
>
> then your sample code exhibits the same problem. I assume this means
> that the reason overriding <ws> wasn't working is because it was calling
> the default version of <ws> in the root namespace. But, if it was
> defaulting to the root namespace, why was it able to find any of the
> rules? Shouldn't it have complained that it couldn't find <emptyline>?

At the moment (and this may be incorrect), PGE looks for named rules
via inheritance, and if not found that way it looks in the available
symbol tables using the find_name opcode.

So, the match was able to find the rules because they are in the
current namespace, but when it came time to find the rule for <?ws>
there was a "ws" method available (the default) and so that one
was used.

Again, this may not be the correct behavior; I've been using S12 as
the guide here, in that a method call first considers methods from
the class hierarchy and fails over to subroutine dispatch.

Pm