Trying to build a Reia grammar with Neotoma

30 views
Skip to first unread message

Tony

unread,
Nov 29, 2009, 4:05:44 PM11/29/09
to Neotoma
I created a branch of Reia where I'm trying to add a Neotoma-based parser:

http://github.com/tarcieri/reia/tree/neotoma

I created an absurdly minimal grammar based on the "arithmetic" example:

http://github.com/tarcieri/reia/blob/neotoma/src/compiler/reia_parse.peg

I hope to eventually convert over this Yecc grammar (and more).  This grammar is based on the capabilities of the new compiler:

http://github.com/tarcieri/reia/blob/neotoma/src/compiler/reia_parse.yrl

--

All that said, I now have the PEG grammar spitting out the proper nodes for integers:

1> reia_parse:parse("42").
{integer,1,42}

So the next step I'm trying is to build the appropriate parse tree for the "+" operator.  Right now this produces:

2> reia_parse:parse("2+2").
[{integer,1,2},"+",{integer,1,2}]

The leex-based parser produces: {binary_op,1,'+',{integer,1,2},{integer,1,2}}

What's the appropriate transform here?

--

Another example:

3> reia_parse:parse("2+4+6").
[{integer,1,2},"+",[{integer,1,4},"+",{integer,1,6}]]

leex-based parser produces: {binary_op,1,'+',{integer,1,2},{binary_op,1,'+',{integer,1,4},{integer,1,6}}}

--

Here's one that has me scratching my head:

4> reia_parse:parse("2+").
{{integer,1,2},"+",{{line,1},{column,2}}}

Why isn't this an error? (and what's up with Idx being in the node there?)

The yecc grammar regards this as a syntax error.

--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Nov 29, 2009, 4:54:11 PM11/29/09
to neoto...@googlegroups.com
I'd love to help with the conversion when I have a moment, but for now I
suppose I could answer your questions.
> All that said, I now have the PEG grammar spitting out the proper
> nodes for integers:
>
> 1> reia_parse:parse("42").
> {integer,1,42}
>
> So the next step I'm trying is to build the appropriate parse tree for
> the "+" operator. Right now this produces:
>
> 2> reia_parse:parse("2+2").
> [{integer,1,2},"+",{integer,1,2}]
>
In the transformation function, you'll need to rearrange those terms (in
binary_op, maybe?). It's likely in yecc, you used a matchspec to do it.
Something like this should work:

{binary_op, extract_line(Idx), list_to_atom(lists:nth(2,Node)),
lists:nth(1,Node), lists:nth(3,Node)}

extract_line would look like so, of course (a good addition to the
standard library):

extract_line({{line,Line},_}) -> Line.
> Here's one that has me scratching my head:
>
> 4> reia_parse:parse("2+").
> {{integer,1,2},"+",{{line,1},{column,2}}}
>
> Why isn't this an error? (and what's up with Idx being in the node there?)
>
> The yecc grammar regards this as a syntax error.
>
Your grammar must be lacking an assertion of some sort that allows that
to be an incomplete parse. In general, PEG parsers don't consume the
whole input if a portion of is valid but the tail is not. Is "2+" a
valid expression? If not, you may need some disambiguation (i.e.
ordered choice or factoring), or an EOI assertion (right now "!.", but I
may introduce an EOI symbol) to register that as a syntax error.

Cheers,

Sean

Tony

unread,
Nov 29, 2009, 5:51:49 PM11/29/09
to neoto...@googlegroups.com
On Sun, Nov 29, 2009 at 2:54 PM, Sean Cribbs <seanc...@gmail.com> wrote:
I'd love to help with the conversion when I have a moment, but for now I suppose I could answer your questions.

Awesome... in the meantime I'll see how far I can get on my own.
 
In the transformation function, you'll need to rearrange those terms (in binary_op, maybe?). It's likely in yecc, you used a matchspec to do it.  Something like this should work:

{binary_op, extract_line(Idx), list_to_atom(lists:nth(2,Node)), lists:nth(1,Node), lists:nth(3,Node)}

extract_line would look like so, of course (a good addition to the standard library):

extract_line({{line,Line},_}) -> Line.

Both of these seem like good things to define as macros (that's how Erlang's own yecc grammar implements ?line).  Something like:

{binary_op, ?line, list_to_atom(?elem(2)), ?elem(1), ?elem(3)}

Seems like it'd be no problem, so long as you can stick some arbitrary Erlang code into the top of the output file somehow.  Or you could even define your own standard library of macros to do this stuff.
 
Your grammar must be lacking an assertion of some sort that allows that to be an incomplete parse.  In general, PEG parsers don't consume the whole input if a portion of is valid but the tail is not.  Is "2+" a valid expression?

I certainly don't intend for it to be.
 
 If not, you may need some disambiguation (i.e. ordered choice or factoring), or an EOI assertion (right now "!.", but I may introduce an EOI symbol) to register that as a syntax error.

Okay, let me see if I can figure out how to do this...

--
Tony Arcieri
Medioh/Nagravision

Tony Arcieri

unread,
Nov 30, 2009, 12:18:58 PM11/30/09
to neoto...@googlegroups.com
On Sun, Nov 29, 2009 at 3:51 PM, Tony <bas...@gmail.com> wrote:
{binary_op, extract_line(Idx), list_to_atom(lists:nth(2,Node)), lists:nth(1,Node), lists:nth(3,Node)}

Regarding this, it seems like other fully transformed pushdowns (not sure if PEGs call them that) are "bubbling up" through the same transform code.  I ended up having to implement the transform like this:

http://github.com/tarcieri/reia/blob/neotoma/src/compiler/reia_parse.peg
`
if
  is_list(Node) ->
    [Left, Op, Right] = Node,
    {binary_op, line(Idx), list_to_atom(Op), Left, Right};
  true -> Node
end
`;
I hope there's a better way to write transforms than this.

 If not, you may need some disambiguation (i.e. ordered choice or factoring), or an EOI assertion (right now "!.", but I may introduce an EOI symbol) to register that as a syntax error.

Okay, let me see if I can figure out how to do this...

I'm still confused on this... I found some examples of ! . but couldn't get them to work in my grammar.  And I'm still not clear on why:

add_expr <- integer add_op add_expr / integer
would match "2+" if "PEG parsers don't consume the whole input if a portion of is valid but the tail is not"

--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Nov 30, 2009, 2:58:56 PM11/30/09
to neoto...@googlegroups.com
Tony Arcieri wrote:
> `
> if
> is_list(Node) ->
> [Left, Op, Right] = Node,
> {binary_op, line(Idx), list_to_atom(Op), Left, Right};
> true -> Node
> end
> `;
> I hope there's a better way to write transforms than this.
Why not use case and take advantage of pattern matching? Also, since
you're wanting + and - to be atoms, you could push them down into that rule.

`
case Node of
[Left, Op, Right] -> {binary_op, line(Idx), Op, Left, Right};
_ -> Node
end
`;

add_op <- "+" / "-" `list_to_atom(Node)`;
>
> add_expr <- integer add_op add_expr / integer
> would match "2+" if "PEG parsers don't consume the whole input if a
> portion of is valid but the tail is not"
>
That rule looks fine. However, just for disambiguation, you might try
putting parens around the first part of the choice. If that makes it
work, I'll dig through and see if I can find the bug in neotoma.

Sean

Sean Cribbs

unread,
Nov 30, 2009, 3:02:03 PM11/30/09
to neoto...@googlegroups.com
FWIW, I tried your most recent grammar and got what seems to be correct
results:

4> reia_parse:parse("2").
{integer,1,2}
5> reia_parse:parse("2+").
{{integer,1,2},"+",{{line,1},{column,2}}}
6> reia_parse:parse("2+2").
{binary_op,1,'+',{integer,1,2},{integer,1,2}}

Sean

Tony Arcieri

unread,
Nov 30, 2009, 4:32:07 PM11/30/09
to neoto...@googlegroups.com
On Mon, Nov 30, 2009 at 1:02 PM, Sean Cribbs <seanc...@gmail.com> wrote:
FWIW, I tried your most recent grammar and got what seems to be correct results:
5> reia_parse:parse("2+").
{{integer,1,2},"+",{{line,1},{column,2}}}

This is the one that has me confused... what's with Idx in the 3rd slot in the list?

--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Nov 30, 2009, 4:52:41 PM11/30/09
to neoto...@googlegroups.com
That's an incomplete parse, of course. Internally, neotoma represents intermediate results as {Result, RemainingInput, NewIndex}, which is what it returned to you.  So, what it parsed was "2", not "2+".  It seems to me that you would need some kind of lookahead somewhere, but you might still come out with an incomplete parse.  Another option is to introduce a semantic check in a rule at a higher level and throwing your own error.

Sean

Tony Arcieri

unread,
Nov 30, 2009, 4:55:51 PM11/30/09
to neoto...@googlegroups.com
On Mon, Nov 30, 2009 at 12:58 PM, Sean Cribbs <seanc...@gmail.com> wrote:
Why not use case and take advantage of pattern matching?

Yeah case is definitely the way to go there.
 
Also, since you're wanting + and - to be atoms, you could push them down into that rule.

Cool.
 
That rule looks fine.  However, just for disambiguation, you might try putting parens around the first part of the choice.

Like add_expr <- (integer add_op add_expr) / integer ?
 
--
Tony Arcieri
Medioh/Nagravision

Tony Arcieri

unread,
Nov 30, 2009, 4:57:31 PM11/30/09
to neoto...@googlegroups.com
On Mon, Nov 30, 2009 at 2:52 PM, Sean Cribbs <seanc...@gmail.com> wrote:
That's an incomplete parse, of course. Internally, neotoma represents intermediate results as {Result, RemainingInput, NewIndex}, which is what it returned to you.  So, what it parsed was "2", not "2+".  It seems to me that you would need some kind of lookahead somewhere, but you might still come out with an incomplete parse.  Another option is to introduce a semantic check in a rule at a higher level and throwing your own error.

Okay, so I can disambiguate that because it's coming out as a tuple as opposed to a list?

Is there any way I can force incomplete parses to be an error short of throwing my own exception?

--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Nov 30, 2009, 5:22:56 PM11/30/09
to neoto...@googlegroups.com
Actually, I just thought of something.  In some PEGs I've seen that have repetition or right-recursion, it's often reduced down to a head+tail case.  So instead, you would have something like this:

add_expr <- head:integer tail:(add_op integer)* !.
`
lists:foldl(fun([Op, Int], Acc) -> {binary_op, line(Idx), Op, Acc, Int} end,
            proplists:get_value(head, Node),
            proplists:get_value(tail, Node))
`;

Then the add_op is mandatory if there at all (because the tail is followed by the EOF assertion).  Here's a quick test of that:

10> reia_parse:parse("1+2+").     
{fail,{expected,{no_match,43},{{line,1},{column,4}}}}

Sean

Tony Arcieri

unread,
Nov 30, 2009, 5:25:21 PM11/30/09
to neoto...@googlegroups.com
On Mon, Nov 30, 2009 at 3:22 PM, Sean Cribbs <seanc...@gmail.com> wrote:
Actually, I just thought of something.  In some PEGs I've seen that have repetition or right-recursion

As an aside, should I be using a right-recursive grammar with Neotoma?  I do that with leex because it's more efficient to construct lists in Erlang right recursively.
 
--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Nov 30, 2009, 5:55:49 PM11/30/09
to neoto...@googlegroups.com
I know that left-recursion is pretty much impossible, but in cases where you're using it for repetition, it's better to use * and +.  Right-recursion is ok.

Sean

Tony Arcieri

unread,
Dec 1, 2009, 1:29:13 AM12/1/09
to neoto...@googlegroups.com
Okay, I think this is my bad on something all along:

If I get a tuple out, that basically just means the parser wants more tokens, right?

If so, that's quite easy for me to wire up, and I may even have a Reia shell with a Neotoma parser in a bit here...
--
Tony Arcieri
Medioh/Nagravision

Tony Arcieri

unread,
Dec 1, 2009, 2:05:15 AM12/1/09
to neoto...@googlegroups.com
On Mon, Nov 30, 2009 at 11:29 PM, Tony Arcieri <to...@medioh.com> wrote:
If so, that's quite easy for me to wire up, and I may even have a Reia shell with a Neotoma parser in a bit here...

All right, well you can now play with the Neotoma-based parser using Reia's interactive shell.

My toplevel grammar seems a bit hacked though:

expressions <- head:expr tail:(separator expr?)*
`[{head, Head}, {tail, Tail}] = Node, [Head|[Expr || [_Separator,[_|_] = Expr] <- Tail]]`;
Trying to allow an arbitrary number of separators between expressions, and for the last part of the input to be an arbitrary length sequence of separator tokens.

Already confirmed this gibs if you have an integer followed by an arbitrary number of separators and another integer e.g. "2;;;;;;;;;4"

--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Dec 1, 2009, 8:12:37 AM12/1/09
to neoto...@googlegroups.com

> Trying to allow an arbitrary number of separators between expressions,
> and for the last part of the input to be an arbitrary length sequence
> of separator tokens.
>
> Already confirmed this gibs if you have an integer followed by an
> arbitrary number of separators and another integer e.g. "2;;;;;;;;;4"
If you have an arbitrary number of separators in between, either put
separator+ in the tail of your rule, or ("\n" / ";")+ down in separator
(depending on where you reuse the separator NT).

Sean

Tony Arcieri

unread,
Dec 2, 2009, 12:03:55 AM12/2/09
to neoto...@googlegroups.com
Well, if you're talking about a pure kleene star representation of what I'm trying to do, it looks like:

separator* expr (separator+ expr)* separator*;

That is to say:

- Valid input must consit of at least one expression
- The input may begin with an arbitrary number of separator tokens (including none of them)
- Input can consist of multiple expressions, but they must be separated by at least one separator token
- The end of the input can contain an arbitrary number of separator tokens (including none)

imagine:

;;;;;;;42;;;;;;;;;42;;;;;42

or

42;42;42

or

42;;;;;;;;;;;;;;;;;;;;;;;

(where any of those ";" could also be "\n"s)

I've tried that specific representation, and also:

expressions <- head:separator* expr tail:(separator+ expr)* separator*;

Neither of them work:

neotoma error: nonterminal 'add_op' has no reduction. (found at {{line,7},{column,21}})

Line 7 looks like:


add_expr <- integer add_op add_expr / integer

How would you go about specifying this given those rules?
--
Tony Arcieri
Medioh/Nagravision

Tony Arcieri

unread,
Dec 2, 2009, 12:04:59 AM12/2/09
to neoto...@googlegroups.com
On Tue, Dec 1, 2009 at 10:03 PM, Tony Arcieri <to...@medioh.com> wrote:
Well, if you're talking about a pure kleene star representation of what I'm trying to do, it looks like:

By the way, despite my problems, it's awesome to be able to use kleene stars in grammars.  It's very frustrating with a yacc-like tool to try to decompose kleene stars into various pushdowns manually.

--
Tony Arcieri
Medioh/Nagravision

Sean Cribbs

unread,
Dec 2, 2009, 8:18:48 AM12/2/09
to neoto...@googlegroups.com

>
> expressions <- head:separator* expr tail:(separator+ expr)* separator*;
>
Minor point, but you'll want the "head" label on your first expr, not
the separator.
> Neither of them work:
>
> neotoma error: nonterminal 'add_op' has no reduction. (found at
> {{line,7},{column,21}})
>
> Line 7 looks like:
>
> add_expr <- integer add_op add_expr / integer
>
Actually that's an error saying it can't find the rule for 'add_op'. Do
you have a syntax error in your grammar?

> How would you go about specifying this given those rules?
What you have looks fine to me. Also, be aware that * and + in PEG are
extremely greedy and do not give back like they would in a regular
expression, unless you manually add some complicated lookahead.

Sean

Tony Arcieri

unread,
Dec 2, 2009, 4:09:15 PM12/2/09
to neoto...@googlegroups.com
On Wed, Dec 2, 2009 at 6:18 AM, Sean Cribbs <seanc...@gmail.com> wrote:
Actually that's an error saying it can't find the rule for 'add_op'.  Do you have a syntax error in your grammar?

Okay, I'm trying this now:

expressions <- separator* expr? (separator+ expr)* separator*
`
[_, Expr, Exprs, _] = Node,
lists:flatten([Expr]) ++ [Ex || [_Sep, Ex] <- Exprs]
`;

Slight change from what I had before: you can have 0 or more expressions.  This way empty files are still valid.

This... seems to be working.

--
Tony Arcieri
Medioh/Nagravision
Reply all
Reply to author
Forward
0 new messages