Question to the list: BNF interface notation

Jeffrey Kegler

unread,

Nov 5, 2012, 1:26:58 PM11/5/12

to Marpa Parser Mailing LIst

I intend in the BNF interface to introduce parentheses, which will indicate that the semantics do not see the enclosed symbols. This can be seen a generalization of the current keep/discard separators issue in sequences. I need this generalization is needed for some other features I am working on.

What I intend is that, for example, in the rule

sum ::= term (op_add) term action => do_addition

the semantics would see the two 'term' values, but would not see the op_add. You'd write the semantics like this:

do_addition {

   my (undef, $a, $b) = @_;

   return $a + $b;

}

On the other hand if the rule were written as follows

sum ::= term op_add term action => do_addition

the traditional semantics would be written

do_addition {

   my (undef, $a, undef, $b) = @_;

   return $a + $b;

}

Reasons for the discard-if-parenthesized convention: Parenthesis traditionally suggest hidden or optional. Also, I may allow parentheses for grouping, in which case the semantics of the parenthesized part would become quite complex. If an alternative to discarding a complex parenthesized group was allowed, it would need special additional notation anyway.

Possible reasons to dislike it: In Perl 5 regexes, parentheses by default mean capture and, for regexes, capture is roughly analogous to semantic visibility. I regret breaking from the Perl convention, but parentheses in Perl regexes *are* ugly and semantically overloaded, and I think the notation suggested above will be more graceful.

Feedback from this list has changed my mind in the past, and I'd like to hear from you.

-- jeffrey

rns

unread,

Nov 5, 2012, 1:31:58 PM11/5/12

to marpa-...@googlegroups.com

Sounds good. One question: how about adding a question mark

sum ::= term (op_add)? term action => do_addition

to specify that the symbols enclosed in parent are optional? This would sit well with your intent to introduce parens for grouping, too.

Peter Stuifzand

unread,

Nov 5, 2012, 3:50:19 PM11/5/12

to marpa-...@googlegroups.com

Using parentheses for hiding feels wrong to me. Some thoughts:

1. My thought sometime ago was to use the parentheses for grouping (or subrules) to be able to include more in a rule than would be possible in the original 'rules' argument. As an example you could think of something like "term (op_add term)*". The "*" can't be specified without adding more rules.
http://peterstuifzand.nl/2012/04/04/rewriting-marpa-rules.html

2. Parentheses in Perl 5 are used to capture the surrounded bit. This will do the opposite.

3. In mathematics parentheses are used for grouping. I guess this is a familiar syntax for most people.

4. What will this do when the user encloses more than one term in parentheses? Will this hide all terms? For example:

hash_reference ::= rvalue (ARROW LBR) expression (RBR)

Which would parse an expression like this:

$tree->{left}

This would hide the uppercase tokens for the action. It's hard for me to not see this as a group.

--
Peter Stuifzand | peterstuifzand.nl | @pstuifzand

Ron Savage

unread,

Nov 5, 2012, 4:28:48 PM11/5/12

to marpa-...@googlegroups.com

Hi Jeffrey

On 06/11/12 05:26, Jeffrey Kegler wrote:
> I intend in the BNF interface to introduce parentheses, which will
> indicate that the semantics do not see the enclosed symbols. This can be
> seen a generalization of the current keep/discard separators issue in
> sequences. I need this generalization is needed for some other features
> I am working on.
>
> What I intend is that, for example, in the rule
>
> |sum ::= term (op_add) term action => do_addition|

Or?

|sum ::= term [op_add] term action => do_addition|

Yes [] are overloaded, but often indicate optional items...

--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

rns

unread,

Nov 5, 2012, 4:38:54 PM11/5/12

to marpa-...@googlegroups.com

A couple of notes to 2. and 3. below; not to argue, just notes.

Parens can be thought of as being used for hiding in zero-width Look-Around Assertions, e.g.

(?=pattern) and (?<=pattern) and friends in Perl.

When used in math for grouping, parens can denote a part of the formula that needs to be factored out to take a new symbol.

Jeffrey Kegler

unread,

Nov 5, 2012, 6:54:27 PM11/5/12

to marpa-...@googlegroups.com

@Peter: Yes, whatever the bracketing it would hide all the terms inside it.

x ::= {a b c} d {e}

would hide everything but the "d".

In Perl 6, Larry uses curly brackets for non-capturing groups. This means that character classes (which I am going to need) become

<[
... ]>

. That seems to me a bit heavy weight, since I expect in future versions of Marpa users will use character classes a lot.

What about curlies for non-capturing groups, as in the above?

-- jeffrey

Jeffrey Kegler

unread,

Nov 5, 2012, 6:56:43 PM11/5/12

to marpa-...@googlegroups.com

@Ron: As in my email to Peter, the problem with square brackets is that
I'll want them for character classes, including as an alternative way of
indicating single characters in difficult cases.

-- jeffrey

Jeffrey Kegler

unread,

Nov 5, 2012, 7:00:32 PM11/5/12

to marpa-...@googlegroups.com

@rns: The notation you cite is an example of why I characterized Perl 5 regex parens as ugly, overloaded and confusing. Perl 5 was a victim of its own success and the need for compatibility, and had very little choice except to extend its syntax via its paren operators. Hopefully, we'll avoid that here.

Yes, eventually I expect to add quantifiers like '?', '*' and '+' for parens.

-- jeffrey

Ron Savage

unread,

Nov 5, 2012, 7:34:27 PM11/5/12

to marpa-...@googlegroups.com

Hi Jeffrey

On 06/11/12 10:56, Jeffrey Kegler wrote:
> @Ron: As in my email to Peter, the problem with square brackets is that
> I'll want them for character classes, including as an alternative way of
> indicating single characters in difficult cases.

Yep - Makes sense.

Ron Savage

unread,

Nov 5, 2012, 7:38:09 PM11/5/12

to marpa-...@googlegroups.com

Hi Jeffrey

On 06/11/12 10:54, Jeffrey Kegler wrote:
> @Peter: Yes, whatever the bracketing it would hide all the terms inside it.
>
> |x ::= {a b c} d {e}|
>
> would hide everything but the "d".
>
> In Perl 6, Larry uses curly brackets for non-capturing groups. This
> means that character classes (which I am going to need) become |<[ ...
> ]>|. That seems to me a bit heavy weight, since I expect in future
> versions of Marpa users will use character classes a lot.
>
> What about curlies for non-capturing groups, as in the above?

Nope. I've come round to (...)? with the ? to indicate 0 or 1.

In other words, a visual indicator of intent.

And now I can choose to capture that 'thing' if I wish.

The reason I wouldn't opt for a Perl 6 syntax by default is that I don't
see many people moving to Perl 6 for a long, long time.

> -- jeffrey
>
> Peter Stuifzand wrote:
>> Using parentheses for hiding feels wrong to me. Some thoughts:
>>
>> 1. My thought sometime ago was to use the parentheses for grouping (or
>> subrules) to be able to include more in a rule than would be possible
>> in the original 'rules' argument. As an example you could think of
>> something like "term (op_add term)*". The "*" can't be specified
>> without adding more rules.
>> http://peterstuifzand.nl/2012/04/04/rewriting-marpa-rules.html
>>
>> 2. Parentheses in Perl 5 are used to capture the surrounded bit. This
>> will do the opposite.
>>
>> 3. In mathematics parentheses are used for grouping. I guess this is a
>> familiar syntax for most people.
>>
>> 4. What will this do when the user encloses more than one term in
>> parentheses? Will this hide all terms? For example:
>>
>> hash_reference ::= rvalue (ARROW LBR) expression (RBR)
>>
>> Which would parse an expression like this:
>>
>> $tree->{left}
>>
>> This would hide the uppercase tokens for the action. It's hard for me
>> to not see this as a group.
>>
>> --

>> Peter Stuifzand | peterstuifzand.nl <http://peterstuifzand.nl> |
>> @pstuifzand <http://twitter.com/pstuifzand>

>>
>>
>> On Mon, Nov 5, 2012 at 7:31 PM, rns <sor...@gmail.com
>> <mailto:sor...@gmail.com>> wrote:
>>
>> Sounds good. One question: how about adding a question mark
>>

>> sum ::= term (op_add)*?* term action => do_addition

Jeffrey Kegler

unread,

Nov 5, 2012, 8:56:58 PM11/5/12

to marpa-...@googlegroups.com

Note that there are 3 questions here. Grouping, quantification (that
is, ?, *, + and friends), and semantic visibility. "Semantic
visibility" is what I'm concerned with at the moment -- it means hiding
the symbol from the semantics. (This is important because it needs to
be done for sane handling of sequence separators, whitespace, etc.) I
am leaving grouping and quantification for later, although of course
anything I decide now might paint me into a corner when I need to add
syntax for them.

The syntax must be brackets and we've got these four choices: (), [],
<>, and {}. In a BNF-based language angle brackets are taken for
indicating symbols. I'll want character classes and standard regular
expression notation for these is []. That leaves () and {}.

{} is available, but if I use it for this purpose, it can't later be
used for semantics. () has the problem that it means "capture" in Perl
regular expressions. but that could be seen as a feature -- regular
expression captures and Marpa semantics are very different beasts, and
reminding people of that might be a good thing.

-- jeffrey

Ron Savage wrote:

rns

unread,

Nov 5, 2012, 9:55:16 PM11/5/12

to marpa-...@googlegroups.com

Can't help thinking of \..\, as in \a, \b, \c for escaping a, b, c, \a b c\ for hiding a b c.

rns

unread,

Nov 5, 2012, 10:05:36 PM11/5/12

to marpa-...@googlegroups.com

On Tuesday, November 6, 2012 2:00:35 AM UTC+2, Jeffrey Kegler wrote:

@rns: The notation you cite is an example of why I characterized Perl 5 regex parens as ugly, overloaded and confusing. Perl 5 was a victim of its own success and the need for compatibility, and had very little choice except to extend its syntax via its paren operators. Hopefully, we'll avoid that here.

Fair enough.

Yes, eventually I expect to add quantifiers like '?', '*' and '+' for parens.

That's great to hear.

Ruslan Zakirov

unread,

Nov 6, 2012, 2:04:35 AM11/6/12

to marpa-...@googlegroups.com

On Tue, Nov 6, 2012 at 12:50 AM, Peter Stuifzand
<peter.s...@gmail.com> wrote:
> Using parentheses for hiding feels wrong to me. Some thoughts:

Agree. There are plenty of variants:

{xx}, <xxx>, -xxx- (cross out font in some markup languages), xxx/n
(no capture modifier).

--
Best regards, Ruslan.

Ruslan Zakirov

unread,

Nov 6, 2012, 2:13:27 AM11/6/12

to marpa-...@googlegroups.com

On Tue, Nov 6, 2012 at 5:56 AM, Jeffrey Kegler
<jeffre...@jeffreykegler.com> wrote:
> The syntax must be brackets and we've got these four choices: (), [], <>,
> and {}. In a BNF-based language angle brackets are taken for indicating
> symbols. I'll want character classes and standard regular expression
> notation for these is []. That leaves () and {}.

You can use quoting with -,_,`,',", for example -xxx-
You can use !#^-~ as prefix or suffix, for example -xxx, xxx!

Using prefix or suffix instead of brackets allows you to apply these
modifiers to groups along with quantification.

--
Best regards, Ruslan.

Peter Stuifzand

unread,

Nov 6, 2012, 4:24:38 AM11/6/12

to marpa-...@googlegroups.com

How often would non-capturing be used? It doesn't seem that useful to me.

And if you are going to add + and * later to these types of expressions you'll also have to add a capturing variant. That way you'll go down the same complexity hole that Perl went, where you have many overloaded variants of the parentheses.

--
Peter Stuifzand | peterstuifzand.nl | @pstuifzand

Jeffrey Kegler

unread,

Nov 6, 2012, 5:16:53 AM11/6/12

to marpa-...@googlegroups.com

Re usefulness: I'm going somewhere I think is worthwhile, and it is a necessary step on the way. For now, you'll have to trust me.

You can use quoting with -,_,`,',", for example -xxx-

_,'," are needed for other purposes. minus signs and back ticks would be unreadable for bracketing, but minus signs might work as prefixes, as in -(x).

The reason I connect grouping and semantic visibility is that, once I have complex grouping, its semantics become hard to define. Consider

x ::= a (b|c|(d e)*) f

What value is passed to the semantics for (b|c|(d e)*) ? Best for it to have no value by default, so that in something like.

x ::= a -(b|c|(d e)*) f

the minus sign is really redundant.

One variation might be that minus indicates "hide" and plus indicates "visible", and that parentheses group and hide by default, so that in

x ::- a b (c) +(d) -e -f

a and b are visible by default, c is hidden by default, d is explicitly visible, and e and f are explicitly hidden.

-- jeffrey

Peter Stuifzand wrote:

Ruslan Shvedov

unread,

Nov 6, 2012, 5:23:04 AM11/6/12

to marpa-...@googlegroups.com

On Tue, Nov 6, 2012 at 12:16 PM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:

Re usefulness: I'm going somewhere I think is worthwhile, and it is a necessary step on the way. For now, you'll have to trust me.

You can use quoting with -,_,`,',", for example -xxx-
_,'," are needed for other purposes. minus signs and back ticks would be unreadable for bracketing, but minus signs might work as prefixes, as in -(x).

The reason I connect grouping and semantic visibility is that, once I have complex grouping, its semantics become hard to define. Consider

x ::= a (b|c|(d e)*) f

What value is passed to the semantics for (b|c|(d e)*) ?

x ::= a (b|c|(d e)*) f

will probably be rewritten as

x ::= a SR1 f
SR1 ::= b|c|(d e)*

so the semantics of (b|c|(d e)*) should be that of SR1?

The value of the rule, which

Jeffrey Kegler

unread,

Nov 6, 2012, 5:29:02 AM11/6/12

to marpa-...@googlegroups.com

@Ruslan: Yes. I just sent one thought.

Of the choices for prefix,

# is taken, for comments.

! and ~ are traditionally toggles, and I'd like direct indicators, for those who've forgotten the default hidden/visible category. That way if you can't remember what I decided about parentheses, hidden or visible, you just do -(x) and you know x is hidden. Or +(x) and you know x is visible.

^ -- caret is nice, but there is no opposite, as in the case with + and -.

Prefix + and - also have the advantage that, in this context, they aren't overloaded -- none of the other traditional meanings really conflict.

-- jeffrey

Jeffrey Kegler

unread,

Nov 6, 2012, 5:36:51 AM11/6/12

to marpa-...@googlegroups.com

x ::= a (b|c|(d e)*) f

will probably be rewritten as

x ::= a SR1 f

SR1 ::= b|c|(d e)*

Yes, exactly. If your parenthesized expression has semantics, you'd really be pretty much forced to break it out into a separate rule in order to write those semantics reasonably. So, therefore, I want parentheses to mean "hidden from the semantics", and rule-broken-out-separately to mean "visible to the semantics". When used for grouping, parentheses pretty much have to imply "hidden" anyway, so they might as well actually indicate "hidden".

-- jeffrey

Peter Stuifzand

unread,

Nov 6, 2012, 5:39:57 AM11/6/12

to marpa-...@googlegroups.com

Yes, the semantics become hard when rewriting rules, especially with positional arguments. Another solution to that problem would be named arguments. That way you can name the parts of the rhs that you would like to capture. This solution has other problems.

--
Peter Stuifzand | peterstuifzand.nl | @pstuifzand

Jeffrey Kegler

unread,

Nov 6, 2012, 5:47:35 AM11/6/12

to marpa-...@googlegroups.com

Right. I really like the idea of named arguments, but I can't figure out how to implement them efficiently in the Perl context. I'd practically have to call one method in order to unpack the arguments of another, which is kind of ridiculous. And this is one place where efficiency becomes quite visible -- most of the time in Marpa::R2 applications is typically *not* spent parsing, which is done in highly efficient Perl. Most of the time, in some cases almost all, is spent in the semantics.

-- jeffrey

Peter Stuifzand wrote:

Ruslan Shvedov

unread,

Nov 6, 2012, 6:18:30 AM11/6/12

to marpa-...@googlegroups.com

On Tue, Nov 6, 2012 at 12:29 PM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:

@Ruslan: Yes. I just sent one thought.

Of the choices for prefix,

# is taken, for comments.

! and ~ are traditionally toggles, and I'd like direct indicators, for those who've forgotten the default hidden/visible category. That way if you can't remember what I decided about parentheses, hidden or visible, you just do -(x) and you know x is hidden. Or +(x) and you know x is visible.

^ -- caret is nice, but there is no opposite, as in the case with + and -.

Prefix + and - also have the advantage that, in this context, they aren't overloaded -- none of the other traditional meanings really conflict.

FWIW, - is used by W3C to mean "doesn't match' [1] but that's probably close.


[1] A - B

matches any string that matches A but does not match B.

— http://www.w3.org/TR/REC-xml/#sec-notation

Ruslan Shvedov

unread,

Nov 6, 2012, 6:27:14 AM11/6/12

to marpa-...@googlegroups.com

On Tue, Nov 6, 2012 at 12:36 PM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:

x ::= a (b|c|(d e)*) f

will probably be rewritten as

x ::= a SR1 f

SR1 ::= b|c|(d e)*

Yes, exactly. If your parenthesized expression has semantics, you'd really be pretty much forced to break it out into a separate rule in order to write those semantics reasonably. So, therefore, I want parentheses to mean "hidden from the semantics", and rule-broken-out-separately to mean "visible to the semantics". When used for grouping, parentheses pretty much have to imply "hidden" anyway, so they might as well actually indicate "hidden".

That'd look akin to factoring the parenthesized expression out (of the semantics) and substituting it with 'semantic null'.

Another idea would be to use parens for grouping and then (a b c)? for zero or one occurrence of a b c and use another question mark to mean hidden from semantics, just like non-greedy modifier in Perl, unless you have other uses for it, e.g.

(a b c)?? = 0 or 1 * (a b c), hidden from semantics

(a b c)*? = 0..Inf * (a b c), hidden from semantics

(a b c)+? = 1..Inf * (a b c), hidden from semantics

* meaning non-distributive "times of occurrence"

That would allow use parens for grouping, quantifiers for, well, quantifying, and second ? for semantic hiding.

Ruslan Shvedov

unread,

Nov 6, 2012, 6:45:16 AM11/6/12

to marpa-...@googlegroups.com

On Tue, Nov 6, 2012 at 12:47 PM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:

Right. I really like the idea of named arguments, but I can't figure out how to implement them efficiently in the Perl context. I'd practically have to call one method in order to unpack the arguments of another, which is kind of ridiculous. And this is one place where efficiency becomes quite visible -- most of the time in Marpa::R2 applications is typically *not* spent parsing, which is done in highly efficient Perl. Most of the time, in some cases almost all, is spent in the semantics.

On a related note, I'm now working on what I call EBNF (BNF + quantified grouping) with embedded (%{ %})actions, where I decided to allow

'(' identifier ':' rhs ')' quantifier? action?

syntax, where identifier serves as the subrule name. I even have thoughts to grow this into a sort of language for programming with EBNF, just as we do with regular expressions, with EBNF used instead of REs for more expressiveness.

$output =~ parse(

$rules => q{

S ::= s1 s2 | s3 s4

%{ action(S) %}

s1 ::=

s2

%{

# action(s1 -> s2)
... application code here ...

%}

|

(s3 s5)+

%{

# action(s1 -> s3 s5)
... application code here ...

%}

},

$input

);

RegExp::Grammar was a nice attempt at that, but it lacked practical general BNF parsing.

Spaghetti code as it is at the moment, but that't my first cut at it. It now parses arithmetic expressions grammar, I hope to add actions (that will me set up as Marpa::R2::Recognizer's closures arg) today or tomorrow and then go further to re-implementing bc. This sort of works already for BNF with quantifiers, embedded actins and lexer rules—here is the reversing diff example from Parse::RecDescent tutorial.

Jeffrey Kegler

unread,

Nov 6, 2012, 10:51:41 AM11/6/12

to marpa-...@googlegroups.com

If you're tackling issues like interfacing with a semantics, you might consider writing a BNF-to-C Marpa compiler. That is, take the BNF and, instead of running it in Perl, compile it into C that uses the libmarpa library directly. It'd run several times as fast as the Perl. Here is what a pure C/libmarpa program looks like -- it's something I created for memory leak checking, and has only stub semantics.

Another possibility, as a first step, is a BNF to Marpa::R2::Thin interface compiler. Much of the benefit of the BNF-to-C, but easier to do. I use a very simple prototype of one to create the code that compiles the Stuifzand interface -- the code is in Marpa::R2's lib/Marpa/R2/meta directory.

Of course, the easiest way to approach may be to get it working in the main Marpa::R2 interface first. That's the way I did it.

-- jeffrey

rns

unread,

Nov 6, 2012, 4:39:06 PM11/6/12

to marpa-...@googlegroups.com

On Tuesday, November 6, 2012 5:51:45 PM UTC+2, Jeffrey Kegler wrote:

If you're tackling issues like interfacing with a semantics, you might consider writing a BNF-to-C Marpa compiler. That is, take the BNF and, instead of running it in Perl, compile it into C that uses the libmarpa library directly. It'd run several times as fast as the Perl. Here is what a pure C/libmarpa program looks like -- it's something I created for memory leak checking, and has only stub semantics.

Another possibility, as a first step, is a BNF to Marpa::R2::Thin interface compiler. Much of the benefit of the BNF-to-C, but easier to do. I use a very simple prototype of one to create the code that compiles the Stuifzand interface -- the code is in Marpa::R2's lib/Marpa/R2/meta directory.

Of course, the easiest way to approach may be to get it working in the main Marpa::R2 interface first. That's the way I did it.

Good advice, all three, it's nice to have spare performance. Re: Marpa::R2 first—good point, I'll follow suit.

Jeffrey Kegler

unread,

Nov 10, 2012, 11:42:13 AM11/10/12

to Marpa Parser Mailing LIst

I decided to go with my original idea here: to have parentheses both group and hide symbols from the semantics. This behavior makes sense within Marpa and in fact in my trials with it, seems elegant. It is unfortunate that it is the opposite of Perl's parentheses-mean-capture behavior, but even this has its good aspect -- semantic evaluation within Marpa is NOT the same as capture. It is good to use familiar conventions, but it is also good to remind users that they are not using an regular expression engine, and cannot expect the same behaviors.

By the way, one project someone might consider is a Marpa interface which *IS* a drop-in replacement for a subset of regex semantics.

Many thanks for the feedback

-- jeffrey

Ron Savage

unread,

Nov 10, 2012, 5:23:31 PM11/10/12

to marpa-...@googlegroups.com

Hi Jeffrey

On 11/11/12 03:42, Jeffrey Kegler wrote:
> By the way, one project someone might consider is a Marpa interface
> which *IS* a drop-in replacement for a subset of regex semantics.

So, firstly Marpa takes over Perl, then it takes over the world!

Jeffrey Kegler

unread,

Nov 10, 2012, 6:52:33 PM11/10/12

to marpa-...@googlegroups.com

Just to be clear, when it comes to regular expressions as strictly
defined (concatenation, alternation, the Kleene star and combinations of
these three), Perl regexes are significantly faster than Marpa and will
remain so. But regexes get used for a lot of things which go beyond
regular expressions and, once regexes gets outside their native turf,
Marpa very quickly turns the tables.

-- jeffrey

Ron Savage wrote:

Reply all

Reply to author

Forward