Parser trying to extend with a SyntaxNode subclass weirdness.

302 views
Skip to first unread message

James Harton

unread,
Feb 28, 2012, 8:46:03 PM2/28/12
to treet...@googlegroups.com
Hi.

I'm throwing together a toy language at https://github.com/jamesotron/CrimsonScript - so far it has only a simple description of nil and integer literals:

    rule literal
       integer_literal / nil_literal <Literal>
    end

    rule integer_literal
      hex_integer_literal / octal_integer_literal / binary_integer_literal / decimal_integer_literal / zero_integer_literal <IntegerLiteral>
    end

    rule nil_literal
      "nil" <NilLiteral>
    end

    rule zero_integer_literal
      '-'? '0'
    end

    rule decimal_integer_literal
      '-'? [1-9] [0-9]*
    end

    rule binary_integer_literal
      '-'? '0b' [0-1]+
    end

    rule octal_integer_literal
      '-'? '0o' [0-7]+
    end

    rule hex_integer_literal
      '-'? '0x' [0-9a-fA-F]+
    end


However when I try and parse `0` I get:


    1.9.3-p0 :001 > Crimson::Parser.parse('0')
    TypeError: wrong argument type Class (expected Module)
    from (eval):72:in `extend'
    from (eval):72:in `_nt_integer_literal'
    from (eval):24:in `_nt_literal'
    from /Users/jnh/.rvm/gems/ruby-1.9.3-p0@crimsonscript/gems/treetop-1.4.10/lib/treetop/runtime/compiled_parser.rb:18:in `parse'
    from /Users/jnh/Dev/Toys/CrimsonScript/lib/crimson/parser.rb:11:in `parse'
    from (irb):1
    from /Users/jnh/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'


I'm not sure what I'm doing wrong, but when I compile the grammar to ruby I can see that it's calling #extend on the result of  _nt_zero_integer_literal.  I'm not sure why that's happening.  Is this a bug?


-- 
James Harton
@jamesotron
+64226803869

markus

unread,
Feb 28, 2012, 10:02:21 PM2/28/12
to treet...@googlegroups.com

>
> 1.9.3-p0 :001 > Crimson::Parser.parse('0')
> TypeError: wrong argument type Class (expected Module)
> from (eval):72:in `extend'
> from (eval):72:in `_nt_integer_literal'
> from (eval):24:in `_nt_literal'
>
> from /Users/jnh/.rvm/gems/ruby-1.9.3-p0@crimsonscript/gems/treetop-1.4.10/lib/treetop/runtime/compiled_parser.rb:18:in `parse'
> from /Users/jnh/Dev/Toys/CrimsonScript/lib/crimson/parser.rb:11:in
> `parse'
> from (irb):1
> from /Users/jnh/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
>
>
>
>
> I'm not sure what I'm doing wrong, but when I compile the grammar to
> ruby I can see that it's calling #extend on the result of
> _nt_zero_integer_literal. I'm not sure why that's happening. Is
> this a bug?

Simple answer: IntegerLiteral, NilLiteral, etc. need to be modules, and
you've most likely defined them as classes.

More complex answers are available if needed. :)

-- MarkusQ


Douglas Camata

unread,
Feb 28, 2012, 11:37:19 PM2/28/12
to treet...@googlegroups.com
One thing I've learned in my little experience with treetop (maybe I'm doing it the wrong way, but it worked for me), when you define a rule that's just an aggregation of other two, like some kind of inheritance, you must use a parenthesis to define the class for it. For example, let's fix your rule integer_literal:

    rule integer_literal
      (hex_integer_literal / octal_integer_literal / binary_integer_literal / decimal_integer_literal / zero_integer_literal) <IntegerLiteral>
    end

Try it and post some feedback, as an treetop apprentice, I'd like to share experiences. 

Clifford Heath

unread,
Feb 29, 2012, 12:47:45 AM2/29/12
to treet...@googlegroups.com
Douglas is quite correct. The class specification for a rule binds tighter than the alternation "/" operator:

rule foo
a <A_In_Foo> / b <B_In_Foo>
end

rule a
something <A>
end

rule b
something_else <B>
end

In the case where "foo" matches "something_else", the SyntaxNode for "something_else"
will have two mixed-in modules (both "B" and "B_In_Foo"). Use parentheses around a list
of alternates if you want to mix in the same module regardless of which possibility is matched.

Clifford Heath.

> --
> You received this message because you are subscribed to the Google Groups "Treetop Development" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/treetop-dev/-/i618WdSTO9sJ.
> To post to this group, send email to treet...@googlegroups.com.
> To unsubscribe from this group, send email to treetop-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/treetop-dev?hl=en.

markus

unread,
Feb 29, 2012, 1:03:20 AM2/29/12
to treet...@googlegroups.com
> when you define a rule that's just an aggregation of other two, like
> some kind of inheritance, you must use a parenthesis to define the
> class for it.

Good catch; this would likely be the next snag he'd hit, once the
class/module problem is cleared up (the "0" case doesn't hit it because
it's the last alternative, but anything else would).

-- MarkusQ


James Harton

unread,
Feb 29, 2012, 3:30:37 AM2/29/12
to treet...@googlegroups.com
Thanks folks for the feedback and explanation. I've switched over to modules so at least it's stopped complaining about that.
I've also wrapped the rules for interger_literal and literal in parens.
I'm still a little confused about the fact that the parser still refuses to recognise any of my patterns:

1.9.3p0 :001 > Crimson::Parser.parse('12345')
Exception: Parse error: Expected one of -, 0x, 0o, 0b at line 1, column 1 (byte 1) after .
from /Users/jnh/Dev/Toys/CrimsonScript/lib/crimson/parser.rb:13:in `parse'
from (irb):1
from /Users/jnh/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
1.9.3p0 :002 > Crimson::Parser.parse('nil')
Exception: Parse error: Expected one of -, 0x, 0o, 0b, 0 at line 1, column 1 (byte 1) after .
from /Users/jnh/Dev/Toys/CrimsonScript/lib/crimson/parser.rb:13:in `parse'
from (irb):2
from /Users/jnh/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
-- 
James Harton
@jamesotron
+64226803869

--
You received this message because you are subscribed to the Google Groups "Treetop Development" group.

Mark Wilden

unread,
Feb 29, 2012, 8:55:15 AM2/29/12
to treet...@googlegroups.com
On Wed, Feb 29, 2012 at 12:30 AM, James Harton <james...@gmail.com> wrote:

> I'm still a little confused about the fact that the parser still refuses to
> recognise any of my patterns:

This is just general advice: Take a step back. Can you get some
example code to work? If so, you might want to try morphing the
example into the code you want, step by step. The other approach is to
start with the barest possible grammar, get that to work, and
gradually add on to it.

What is unproductive, in my experience, is to write a bunch of code,
throw it at the system, and debug the error messages you get. Try to
start from a happy place, and change that to where you want to get to.

This isn't quite as off-topic as it might seem, as I've found that
parsing (with Citrus rather than Treetop) really requires a go-slow
approach.

///ark
Web Applications Developer
California Academy of Sciences

markus

unread,
Feb 29, 2012, 10:37:32 AM2/29/12
to treet...@googlegroups.com
On Wed, 2012-02-29 at 21:30 +1300, James Harton wrote:
> Thanks folks for the feedback and explanation. I've switched over to
> modules so at least it's stopped complaining about that.
> I've also wrapped the rules for interger_literal and literal in
> parens.
> I'm still a little confused about the fact that the parser still
> refuses to recognise any of my patterns:
>
>
> 1.9.3p0 :001 > Crimson::Parser.parse('12345')
> Exception: Parse error: Expected one of -, 0x, 0o, 0b at line 1,
> column 1 (byte 1) after .
> from /Users/jnh/Dev/Toys/CrimsonScript/lib/crimson/parser.rb:13:in
> `parse'
> from (irb):1
> from /Users/jnh/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'
> 1.9.3p0 :002 > Crimson::Parser.parse('nil')
> Exception: Parse error: Expected one of -, 0x, 0o, 0b, 0 at line 1,
> column 1 (byte 1) after .
> from /Users/jnh/Dev/Toys/CrimsonScript/lib/crimson/parser.rb:13:in
> `parse'
> from (irb):2
> from /Users/jnh/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `<main>'

When I make those modifications to a grammar consisting of the rules you
originally posted, it appears to work:

irb(main):009:0> SimpleParser.new.parse('12345')
=> SyntaxNode+Literal+IntegerLiteral+DecimalIntegerLiteral0 offset=0,
"12345":
SyntaxNode offset=0, ""
SyntaxNode offset=0, "1"
SyntaxNode offset=1, "2345":
SyntaxNode offset=1, "2"
SyntaxNode offset=2, "3"
SyntaxNode offset=3, "4"
SyntaxNode offset=4, "5"

Did you perhaps change anythings else since then?

-- M

Bernhard

unread,
Jun 27, 2012, 4:02:26 PM6/27/12
to treet...@googlegroups.com
One motivation to use subclasses is the approach in http://thingsaaronmade.com/blog/a-quick-intro-to-writing-a-parser-using-treetop.html.

It allows to clean the parse tree such that it only has the semantically relevant information and no longer keep the syntactical sugar.

Clifford Heath

unread,
Jun 27, 2012, 7:00:01 PM6/27/12
to treet...@googlegroups.com
On 28/06/2012, at 6:02 AM, Bernhard wrote:
> One motivation to use subclasses is …
> to clean the parse tree such that it only has the semantically relevant information and no longer keep the syntactical sugar.

This is a really bad idea in general. PEG parsers produce
lots of garbage - all the hidden memorised attempts at
almost every character. If you keep the parse tree, whether
it's semantically-structured or not, you keep the garbage.

Sometimes a parse tree is like your semantic model, as in
the s-expressions example, but usually you will write a better
program if you design a semantic model first, then construct
that *after* parsing your input.

Clifford Heath.

Bernhard

unread,
Jun 28, 2012, 4:29:27 AM6/28/12
to treet...@googlegroups.com


On Thursday, June 28, 2012 1:00:01 AM UTC+2, Clifford Heath wrote:
On 28/06/2012, at 6:02 AM, Bernhard wrote:
> One motivation to use subclasses is …
> to clean the parse tree such that it only has the semantically relevant information and no longer keep the syntactical sugar.

This is a really bad idea in general. PEG parsers produce
lots of garbage - all the hidden memorised attempts at
almost every character. If you keep the parse tree, whether
it's semantically-structured or not, you keep the garbage.

I fully agree with this. This is the reason why I want to clean the parse tree as it was proposed by Aaron in http://thingsaaronmade.com/blog/a-quick-intro-to-writing-a-parser-using-treetop.html.
 

Sometimes a parse tree is like your semantic model, as in
the s-expressions example, but usually you will write a better
program if you design a semantic model first, then construct
that *after* parsing your input.

Yes, this is the usual approach. I am still struggling with finding an appropriate generic strategy to transform the parse tree to my semantic model. I would like to construct my semantic model and populate it by retrieving the relevant information from the parse tree (sometimes called the target driven transformation). But this is very difficult.

But querying the parse tree is also not recommended. Therefore, I don't really know how to proceed.

One approach I used in another system was, to annotate the grammar properly to distinguish, which parts 
  • should be represented in the result tree as (DOM-Nodes resp. XML-Elements). In this case, the rule-name is the name of the XML-Element
  • should be ignored 
  • should be XML-Attributes - in this case the rule-Name is the name of the XML-Attribute.

My task is to query tons of LaTeX - Files in order to generate particular extracts in XML.
 
any suggestion is highly welcome.

Bernhard

Clifford Heath

unread,
Jun 28, 2012, 4:49:06 AM6/28/12
to treet...@googlegroups.com
On 28/06/2012, at 6:29 PM, Bernhard wrote:
> This is the reason why I want to clean the parse tree as it was proposed by Aaron in http://thingsaaronmade.com/blog/a-quick-intro-to-writing-a-parser-using-treetop.html.

It's pretty ugly digging around inside bits of the Treetop structures
assuming you know how to clean it. In fact, I think it's unlikely that
the example code frees very much garbage anyway.

> I am still struggling with finding an appropriate generic strategy to transform the parse tree to my semantic model.

The normal preferred way is to ask the parse tree to descend itself
and build whatever answer you require. Whether you use custom
classes or just methods depends on how hard that is.

> I would like to construct my semantic model and populate it by retrieving the relevant information from the parse tree (sometimes called the target driven transformation). But this is very difficult.

Right, what I said. Why is it so difficult in your case?

> My task is to query tons of LaTeX - Files in order to generate particular extracts in XML.

I think the only engine that can properly parse TeX is TeX - since
you can define new grammar rules that modify the TeX parser.
But if you assume that isn't being done… well, it's still pretty hard.
LaTeX is a real challenge. I admire you for attempting it.

Clifford Heath.
Message has been deleted

Bernhard

unread,
Jul 27, 2012, 4:59:55 PM7/27/12
to treet...@googlegroups.com


Am Donnerstag, 28. Juni 2012 10:49:06 UTC+2 schrieb Clifford Heath:
On 28/06/2012, at 6:29 PM, Bernhard wrote:
> This is the reason why I want to clean the parse tree as it was proposed by Aaron in http://thingsaaronmade.com/blog/a-quick-intro-to-writing-a-parser-using-treetop.html.

It's pretty ugly digging around inside bits of the Treetop structures
assuming you know how to clean it. In fact, I think it's unlikely that
the example code frees very much garbage anyway.

> I am still struggling with finding an appropriate generic strategy to transform the parse tree to my semantic model.

The normal preferred way is to ask the parse tree to descend itself
and build whatever answer you require. Whether you use custom
classes or just methods depends on how hard that is.

> I would like to construct my semantic model and populate it by retrieving the relevant information from the parse tree (sometimes called the target driven transformation). But this is very difficult.

Right, what I said. Why is it so difficult in your case?


The difficulty in my case is the fact that treetop adds nodes where I did not expect them
and does not provide nodes where I expect them.

I switched to a simpler task  to grep some patterns out of markdown file.

My main objective is, to find a generic approach how to solve such kind of issues.

With this grammar

grammar TraceInMarkdown

   rule top
       document 
   end

   rule document
      ( (noMarkupText / trace / markupAbort)* )
   end

   rule noMarkupText
      [^\[]+  
   end

   rule markupAbort
      "["     
   end

rule trace
traceId s? traceHead s? traceBody traceUpLink   
end
rule traceId 
  "[" label "]" 
end
rule label
[a-zA-Z]+ "_" [a-zA-Z]+ "_" [0-9]+ 
end
rule traceHead
'**' (!'*' . / '\*')+ '**' 
end
rule traceBody
  "{" (nestedBody / [^{}])+ "}" 
end
rule nestedBody
  "{" (nestedBody / [^{}])+ "}" 
end
rule traceUpLink
   "(" (","? s? label)* ")" 
end

   rule s
      [ \t]+  
   end
end

applied to 
text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "

should deliver an xml - file such as:

<top>
 <document>
     <noMarkupText>text 1</noMarkupText>
     <trace>
        <traceId>foo_bar_1234</traceId>
        <traceHead>** lorem ipsum header **</traceHead>
        <traceBody>{lorem <nestedBody>ipsum<nestedBody> [nothing] }</traceBody>

etc.

But e.g. The parse tree does not deliver a node for document.

SyntaxNode offset=0, "...234) text2 [ text3] ":
 SyntaxNode offset=0, "text 1 ":
   SyntaxNode offset=0, "t"
   SyntaxNode offset=1, "e"
   SyntaxNode offset=2, "x"
   SyntaxNode offset=3, "t"
   SyntaxNode offset=4, " "
   SyntaxNode offset=5, "1"
   SyntaxNode offset=6, " "
 SyntaxNode+Trace0 offset=7, "..._1234, bar_foo_1234)" (traceUpLink,traceId,traceHead,traceBody):
   SyntaxNode+TraceId0 offset=7, "[foo_bar_1234]" (label):
     SyntaxNode offset=7, "["
     SyntaxNode+Label0 offset=8, "foo_bar_1234":
       SyntaxNode offset=8, "foo":
         SyntaxNode offset=8, "f"
         SyntaxNode offset=9, "o"
         SyntaxNode offset=10, "o"
       SyntaxNode offset=11, "_"
       SyntaxNode offset=12, "bar":
         SyntaxNode offset=12, "b"
         SyntaxNode offset=13, "a"
         SyntaxNode offset=14, "r"
       SyntaxNode offset=15, "_"
       SyntaxNode offset=16, "1234":
         SyntaxNode offset=16, "1"
         SyntaxNode offset=17, "2"
         SyntaxNode offset=18, "3"
         SyntaxNode offset=19, "4"
     SyntaxNode offset=20, "]"
   SyntaxNode offset=21, " ":
     SyntaxNode offset=21, " "
   SyntaxNode+TraceHead1 offset=22, "...orem ipsum header **":

As you see the root of the SyntaxTree does not provide "document".
At the same time. there is no "trace" ...

In order to keep the business logic out of the grammar, 
I tried to extend SyntaxNode with helper Methods, which should
provide what I want. Methods such as "child" which should deliver 
the child nodes as specified in the production rule.

But even if I am at a current node, I can hardly figure out 
the corresponding rule.

I also had an approach in which each rule had a method
delivering the related information. But there were
additional nodes which did not correspond to a rule 
and therefore raised NoMethodError Exceptions.

I somehow could hack it to make it work, but it is pure try and error
without "knowing what I am doing".


 
> My task is to query tons of LaTeX - Files in order to generate particular extracts in XML.

I think the only engine that can properly parse TeX is TeX - since
you can define new grammar rules that modify the TeX parser.
But if you assume that isn't being done… well, it's still pretty hard.
LaTeX is a real challenge. I admire you for attempting it.


Yes, difficult enough. But I want to grep out some stuff from LaTeX files.

I therefore do not need a full blown LaTeX-Parser.


My problem as of now is, that I am struggling with Treetop _and_ with Latex.

so I went to the simpler task with markdown.

 
Bernhard

markus

unread,
Jul 27, 2012, 6:14:13 PM7/27/12
to treet...@googlegroups.com
B --

> >> I would like to construct my semantic model and populate it by
> retrieving the relevant information from the parse tree (sometimes
> called the target driven transformation). But this is very difficult.
> >
> > Right, what I said. Why is it so difficult in your case?
>
> The difficulty in my case is the fact that treetop adds nodes where I
> did not expect them and does not provide nodes where I expect them.

I'm not sure how helpful this will be, but I'll offer it in the hopes
that it at least provides some insight.

It feels as if you are trying to solve the wrong problem, and in the
process making things harder on yourself than need be. Specifically, it
sounds as if you are trying to have treetop produce the results you want
(and thus you care about where it does/does not produce nodes) rather
than having it produce the AST and then having the AST produce the
results you want.

This is a subtle distinction, but it has powerful consequences, because
it breaks the coupling between the structure of the grammar and the
structure of the results..

I'll try to walk you to the place where I think you ought to be starting
by going through a series of not-right-but-at-least-less-wrong stages.
First, with your grammar in g.tt I write a test rig like so:


require "treetop"

text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
[nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "

p Treetop.load('g.tt').new.parse(text)



This produces the AST, as you showed in your e-mail. But now rather
than having the AST as the output, suppose I define a property "as_xml"
on each node, with a (clearly incorrect) default implementation, and
print that as my result, like so:


require "treetop"

text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
[nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "

class Treetop::Runtime::SyntaxNode
def as_xml
if elements
elements.map { |e| e.as_xml }.join
else
text_value
end
end
end

p Treetop.load('g.tt').new.parse(text).as_xml


Now the output is just the original input string, put it is being
produced by parsing the input, producing an AST, which we then walk,
reproducing the source..

We can recover some of the structure by adding the option of tagging the
xml like so (this isn't intended as an example of great code, just a
conceptual exercise to lead you to the easier way of thinking of
things):

class Treetop::Runtime::SyntaxNode
def as_xml
if elements
elements.map { |e| e.as_xml }.join
else
text_value
end
end
def wrap(tag,body)
"<#{tag}>#{body}</#{tag}>"
end
end

...and then mark up the grammar to use it:

rule top
document { def as_xml; wrap('top',super); end }
end

rule document
(noMarkupText / trace / markupAbort)* { def as_xml;
wrap('document',super); end }
end

rule noMarkupText
[^\[]+ { def as_xml; wrap('noMarkupText',super); end }
end

...which would give us:

"<top><document><noMarkupText>text 1 </noMarkupText>[foo_bar_1234] **
lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234,
bar_foo_1234)<noMarkupText> text2 </noMarkupText>[<noMarkupText> text3]
</noMarkupText></document></top>"

...which is starting to look at least somewhat like your goal:

> <top>
> <document>
> <noMarkupText>text 1</noMarkupText>
> <trace>
> <traceId>foo_bar_1234</traceId>
> <traceHead>** lorem ipsum header **</traceHead>
> <traceBody>{lorem <nestedBody>ipsum<nestedBody> [nothing] }</traceBody>
>
> etc.

You could get the rest of the way (up to indentation) by decorating the
rest of the grammar, and could even do indentation by fussing with the
contents.

But there's nothing saying these methods have to produce strings; you
could just as well produce a tree of objects that has the structure you
want, and have them emit themselves as xml on demand, or have them
update a dictionary, etc. Or, going the other way (as in the obligatory
calculator example) you could have the methods produce a single number.

The key is to decouple the AST and the result by adding methods to the
syntax tree, so that the structure of one isn't bound to the structure
of the other.

I hope that helps.

-- MarkusQ


Bernhard

unread,
Jul 27, 2012, 7:02:44 PM7/27/12
to treet...@googlegroups.com
Markus,

thanks a lot for your help. 

On Saturday, July 28, 2012 12:14:13 AM UTC+2, Markus wrote:
B --

I'm not sure how helpful this will be, but I'll offer it in the hopes
that it at least provides some insight.


It _is_ helpful. Thanks a lot.
 
It feels as if you are trying to solve the wrong problem, and in the
process making things harder on yourself than need be.  Specifically, it
sounds as if you are trying to have treetop produce the results you want
(and thus you care about where it does/does not produce nodes) rather
than having it produce the AST and then having the AST produce the
results you want.

This impression might come up, but it is not my intention to have treetop
produce the intended results. My Intention is indeed to get a clean AST as 
easy as possible. 


This is a subtle distinction, but it has powerful consequences, because
it breaks the coupling between the structure of the grammar and the
structure of the results..

Yes, fully agree with this.
 

I'll try to walk you to the place where I think you ought to be starting
by going through a series of not-right-but-at-least-less-wrong stages.
First, with your grammar in g.tt I write a test rig like so:

thanks. 

require "treetop"

text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
[nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "

p Treetop.load('g.tt').new.parse(text)

yes this is the same as my test rig.
 

This produces the AST, as you showed in your e-mail.  But now rather
than having the AST as the output, suppose I define a property "as_xml"
on each node, with a (clearly incorrect) default implementation, and
print that as my result, like so:


require "treetop"

text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
[nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "

class Treetop::Runtime::SyntaxNode
  def as_xml
    if elements
      elements.map { |e| e.as_xml }.join
    else
      text_value
    end
  end
end

p Treetop.load('g.tt').new.parse(text).as_xml


Now the output is just the original input string, put it is being
produced by parsing the input, producing an AST, which we then walk,
reproducing the source..

ok so far. I get the same and I do understand the approach.
 

We can recover some of the structure by adding the option of tagging the
xml like so (this isn't intended as an example of great code, just a
conceptual exercise to lead you to the easier way of thinking of
things):

class Treetop::Runtime::SyntaxNode
  def as_xml
    if elements
      elements.map { |e| e.as_xml }.join
    else
      text_value
    end
  end
  def wrap(tag,body)
    "<#{tag}>#{body}</#{tag}>"
  end
end

...and then mark up the grammar to use it:  

   rule top
     document { def as_xml; wrap('top',super); end }
   end
 
   rule document
      (noMarkupText / trace / markupAbort)* { def as_xml;
wrap('document',super); end }
   end


Here we are at the problem: I annotated this single rule as you 
describe. And I receive

(rdb:1) p r2.as_xml
"text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "

no tag for documentation. The reason is, that the parse tree does not have a node here. And this is 
what drives me to desperation: Why not? This is one of the cases where TT does not behave 
intuitively.
 
   rule noMarkupText
      [^\[]+ { def as_xml; wrap('noMarkupText',super); end }
   end

...which would give us:

"<top><document><noMarkupText>text 1 </noMarkupText>[foo_bar_1234] **
lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234,
bar_foo_1234)<noMarkupText> text2 </noMarkupText>[<noMarkupText> text3]
</noMarkupText></document></top>"

I tried it and the result is:

(rdb:1) p r2.as_xml
"<noMarkupText>text 1 </noMarkupText>[foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234, bar_foo_1234)<noMarkupText> text2 </noMarkupText>[<noMarkupText> text3] </noMarkupText>"

if the thing with the document rule was not there, I would not be in trouble, since then there is a clear strategy. But now what erver I define in the rule "documentation" it will not
be used.
 

...which is starting to look at least somewhat like your goal:

yes
 

You could get the rest of the way (up to indentation) by decorating the
rest of the grammar, and could even do indentation by fussing with the
contents.


yes .. if all notes were really delivered.
 
But there's nothing saying these methods have to produce strings; you
could just as well produce a tree of objects that has the structure you
want, and have them emit themselves as xml on demand, or have them
update a dictionary, etc.  Or, going the other way (as in the obligatory
calculator example) you could have the methods produce a single number.


I finally would like to push the stuff to nokogiri in order to transform 
and eventually serialize it to the intended output format.
 
The key is to decouple the AST and the result by adding methods to the
syntax tree, so that the structure of one isn't bound to the structure
of the other.


Yes ... but hopefully you see my problem with TT on this way. BTW, I deal with model
transformations for long time. This is the reason why I started with TT even in these 
tasks.
 
I hope that helps.


oh yes, thanks a lot.

-- B.

Clifford Heath

unread,
Jul 27, 2012, 7:11:12 PM7/27/12
to treet...@googlegroups.com
On 28/07/2012, at 9:02 AM, Bernhard wrote:
> Here we are at the problem: I annotated this single rule as you
> describe. And I receive
>
> (rdb:1) p r2.as_xml
> "text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "
>
> no tag for documentation. The reason is, that the parse tree does not have a node here. And this is
> what drives me to desperation: Why not?

You do get a node for 'document', but not for 'top', since that only contains one sub-rule.

When a rule contains only a single node, no new node is created.
Any code block associated is put into a module which is mixed in
to the existing node from the sub-rule.

If you defined "top" like this you'd get a new node which is a sequence of two nodes

rule top
document ''
end

This has the null rule '' which matches without consuming any characters.
It means that 'top' is defined as a sequence, so must create a new node.

Clifford Heath.

Bernhard

unread,
Jul 27, 2012, 7:18:49 PM7/27/12
to treet...@googlegroups.com
Does not work for me.

changed

    rule top
        document '' { def as_xml; wrap('top',super); end }
    end

    rule document
       ( (noMarkupText / trace / markupAbort)* )
        { def as_xml;
            wrap('document',super); 
          end
        }
    end


u see still no "document"

the result is

"<top><noMarkupText>text 1 </noMarkupText>[foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234, bar_foo_1234)<noMarkupText> text2 </noMarkupText>[<noMarkupText> text3] </noMarkupText></top>"

markus

unread,
Jul 27, 2012, 7:30:18 PM7/27/12
to treet...@googlegroups.com
B --

> ...and then mark up the grammar to use it:
>
> rule top
> document { def as_xml; wrap('top',super); end }
> end
>
> rule document
> (noMarkupText / trace / markupAbort)* { def as_xml;
> wrap('document',super); end }
> end
>
>
>
> Here we are at the problem: I annotated this single rule as you
> describe. And I receive
>
>
> (rdb:1) p r2.as_xml
> "text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
> [nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "
>
>
> no tag for documentation. The reason is, that the parse tree does not
> have a node here. And this is
> what drives me to desperation: Why not? This is one of the cases where
> TT does not behave
> intuitively.

Going back to the test rig that just dumps the AST, I see:

SyntaxNode+Top0+Document0 offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
SyntaxNode offset=0, "t"
SyntaxNode offset=1, "e"
SyntaxNode offset=2, "x"

...which, as I would expect, has the document and top nodes merged;
which is to say, their is one SyntaxNode and it's getting both modules
added to it, with top being "outer" and document being "inner".

If you want, you can force there to be two nodes by adding a null string
to top (so that it isn't just a synonym for document)

rule top
document '' { def as_xml; wrap('top',super); end }
end

...which should give you:

SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3] " (document):
SyntaxNode+Document0 offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
SyntaxNode offset=0, "t"
SyntaxNode offset=1, "e"
SyntaxNode offset=2, "x"

...but before you do that, did you try marking up both top and document
as I showed? When I do that (without adding a '' to force top to get
its own node) I see the generated xml as I pasted it. In ruby, super
will go up the module chain, so for instance:

module A
def foo
:foo
end
end

module B
def foo
[super,super]
end
end

class C
include A
include B
def foo
super.to_s.reverse
end
end

p C.new.foo

...produces:

"]oof: ,oof:["

...:foo from A, doubled by B, and converted to a string & reversed by C.

In case it matters, I'm using ruby 1.9.2 & Treetop v1.4.10.

-- M


Bernhard

unread,
Jul 28, 2012, 9:39:46 AM7/28/12
to treet...@googlegroups.com


On Saturday, July 28, 2012 1:30:18 AM UTC+2, Markus wrote:


Going back to the test rig that just dumps the AST, I see:

SyntaxNode+Top0+Document0 offset=0, "...234) text2 [ text3] ":
  SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
    SyntaxNode offset=0, "t"
    SyntaxNode offset=1, "e"
    SyntaxNode offset=2, "x"


amazing. I do not get this. I am using ruby 1.8.7. I tried it with 1.9.2 as well but not difference. By whatever reason I cannot make ruby-debug work on my Mountain Lion Mac, so I returned to 1.8.7

I crated a gist for the two cases such that you can see the results


you can switch the revisions also on the github page and observe the differences.

you see that the parsetree of the one without the null strings does not reflect the rule "documentation"

SyntaxNode+Top0 offset=0, "...234) text2 [ text3] ":
  SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
    SyntaxNode offset=0, "t"
    SyntaxNode offset=1, "e"
    SyntaxNode offset=2, "x"
    SyntaxNode offset=3, "t"
    SyntaxNode offset=4, " "
    SyntaxNode offset=5, "1"
    SyntaxNode offset=6, " "
  SyntaxNode+Trace0 offset=7, "..._1234, bar_foo_1234)" (traceHead,traceBody,traceUpLink,traceId):
    SyntaxNode+TraceId0 offset=7, "[foo_bar_1234]" (label):



 
...which, as I would expect, has the document and top nodes merged;
which is to say, their is one SyntaxNode and it's getting both modules
added to it, with top being "outer" and document being "inner".


With your advice to use "super" in a generic method as_xml, it would pave a way for me. thanks.
 
If you want, you can force there to be two nodes by adding a null string
to top (so that it isn't just a synonym for document)

   rule top
     document '' { def as_xml; wrap('top',super); end }
   end

...which should give you:

SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3] " (document):
  SyntaxNode+Document0 offset=0, "...234) text2 [ text3] ":
    SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
      SyntaxNode offset=0, "t"
      SyntaxNode offset=1, "e"
      SyntaxNode offset=2, "x"


Did not work either :-(
I tried it with both. There was no difference.

What I do not like (but certainly would accept as long as the stuff works) is the 
stereotypical block which I have to add on every node, which does not really
add further information. If SyntaxTree would provide a method to retrieve the
name of the rule, It could all be done in the generic method as_xml.

I tried it as follows. It seems to work well, even if it is somehow heuristic.

    # indicates if a meaningful name for the node in the AST
    # is available

    def has_rule_name?
        not (extension_modules.nil? or extension_modules.empty?)
    end

    

    # returns a meaning name for the node in the AST
    def rule_name
        if has_rule_name? then
            extension_modules.first.name.split("::").last.gsub(/[0-9]/,"")
            else
            "###"
        end
    end

 Bernhard

markus

unread,
Jul 28, 2012, 2:03:22 PM7/28/12
to treet...@googlegroups.com
B --

> SyntaxNode+Top0+Document0 offset=0, "...234) text2 [ text3]
> ":
> SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
> SyntaxNode offset=0, "t"
> SyntaxNode offset=1, "e"
> SyntaxNode offset=2, "x"
>
>
>
> amazing. I do not get this. I am using ruby 1.8.7. I tried it with
> 1.9.2 as well but not difference. By whatever reason I cannot make
> ruby-debug work on my Mountain Lion Mac, so I returned to 1.8.7

I think you're right that this is the core difference. What version of
treetop are you using? I'm not aware of any version that should do what
you are seeing, but I don't track the details too closely.

> I crated a gist for the two cases such that you can see the results
>
>
> https://gist.github.com/3192266/d1d0244b56f3770005b9026279a7e9d93f47d949 - the one with the null srings
>
> https://gist.github.com/3192266/2b2f2bbe36df9dbb3b211f987deddc3a4453a562 - the one without the null strings

Hmm. I think you switched the descriptions.

So the one without the null string doesn't generate separate nodes for
top and document (which it shouldn't) but it also doesn't accrete their
modules onto the shared node (which it should).

But with the null string, it looks as if everything is working as
expected. From your second gist:

SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3] " (document):
SyntaxNode+Document1+Document0 offset=0, "...234) text2 [ text3] ":
SyntaxNode offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
SyntaxNode offset=0, "t"
SyntaxNode offset=1, "e"

This looks like we want, with an extra bonus node between document and
noMarkupText because you also put a '' inside the definition of
document.

The as_xml stuff should work fine on this tree.

> you see that the parsetree of the one without the null strings does
> not reflect the rule "documentation"

Yeah. My only idea is that it's a treetop version difference, but
that's just an unfounded guess.


>
> SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3]
> " (document):
> SyntaxNode+Document0 offset=0, "...234) text2 [ text3] ":
> SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
> SyntaxNode offset=0, "t"
> SyntaxNode offset=1, "e"
> SyntaxNode offset=2, "x"
>
>
>
> Did not work either :-(

Actually, from the gist

https://gist.github.com/3192266/2b2f2bbe36df9dbb3b211f987deddc3a4453a562

it appears that it did.


> What I do not like (but certainly would accept as long as the stuff
> works) is the stereotypical block which I have to add on every node,
> which does not really add further information.

I was giving you the simplest to implement/understand version, but there
are several things you can do to declutter the grammar once you get it
working.

For example, if you change your helpers to look like so:

class Treetop::Runtime::SyntaxNode
def as_xml
if elements
elements.map { |e| e.as_xml }.join
else
text_value
end
end
end

def wrap_with(tag)
define_method(:as_xml) { "<#{tag}>#{super()}</#{tag}>" }
end

...then the grammar rule annotations can be correspondingly reduced:

rule top
document '' { wrap_with 'top' }
end

rule document
(noMarkupText / trace / markupAbort)*
{ wrap_with 'document' }
end

rule noMarkupText
[^\[]+ { wrap_with 'noMarkupText' }
end

which is far less obtrusive. As it stands, it is still redundant, in
that the tags names are always the same as the rule names, but is really
an artifact of how we got here. The rule names should really reflect
the grammatical constructs they define, not the tags that are used to
mark up the output.

Consider a somewhat contrived example:

rule literal
digit+ '.' digit+ { wrap_with 'float' } /
digit { wrap_with 'integer' } /
'"' [^"]* '"' { wrap_with 'string' }
end

The rule name (literal) describes their grammatical role, while the
tagging deals with their semantics (we'd probably want to call it
something other than wrap_with in this case). There's no reason the two
should be bound together, and good reasons for keeping them separate--as
in this example, a single rule might be capable of representing things
with differing semantics and different rules might likewise be capable
of producing things with the same semantics.

> If SyntaxTree would provide a method to retrieve the
> name of the rule, It could all be done in the generic method as_xml.

Yeah, but (IMHO) you really don't want to go there. :)

-- M


Bernhard

unread,
Jul 28, 2012, 2:44:40 PM7/28/12
to treet...@googlegroups.com
Hi


On Saturday, July 28, 2012 8:03:22 PM UTC+2, Markus wrote:

> amazing. I do not get this. I am using ruby 1.8.7. I tried it with
> 1.9.2 as well but not difference. By whatever reason I cannot make
> ruby-debug work on my Mountain Lion Mac, so I returned to 1.8.7

I think you're right that this is the core difference.  What version of
treetop are you using?  I'm not aware of any version that should do what
you are seeing, but I don't track the details too closely.

nor do I, but as it does not seem to be the root cause, I stick with 1.8.7
 

> I crated a gist for the two cases such that you can see the results
>
>
> https://gist.github.com/3192266/d1d0244b56f3770005b9026279a7e9d93f47d949  - the one with the null srings
>
> https://gist.github.com/3192266/2b2f2bbe36df9dbb3b211f987deddc3a4453a562   - the one without the null strings

Hmm.  I think you switched the descriptions.

grr. sorry.
 

So the one without the null string doesn't generate separate nodes for
top and document (which it shouldn't) but it also doesn't accrete  their
modules onto the shared node (which it should).

But with the null string, it looks as if everything is working as
expected.  From your second gist:

SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3] " (document):
  SyntaxNode+Document1+Document0 offset=0, "...234) text2 [ text3] ":
    SyntaxNode offset=0, "...234) text2 [ text3] ":
      SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
        SyntaxNode offset=0, "t"
        SyntaxNode offset=1, "e"
 
This looks like we want, with an extra bonus node between document and
noMarkupText because you also put a '' inside the definition of
document.

The as_xml stuff should work fine on this tree.

Yes it does, and I guess, I continue this path.

> you see that the parsetree of the one without the null strings does
> not reflect the rule "documentation"

Yeah.  My only idea is that it's a treetop version difference, but
that's just an unfounded guess.

I am using treetop 1.4.10, this is what "gem install treetop" gave me
Yes, you are right, even if in most cases this is the same.
But triggered by you Idea, I could use "wrap_with" not as 
a wrap instruction but as semantic notation.

I then could even use this to query the parse tree, If want
to implement something which picks particluar nodes.

A so called Targetdriven transformation builds the target
model not by traversing the source tree but by querying
the source model.

I guess you really helped me out.
 

Consider a somewhat contrived example:

  rule literal
    digit+ '.' digit+ { wrap_with 'float' } /
    digit             { wrap_with 'integer' } /
    '"' [^"]* '"'     { wrap_with 'string' }
  end

The rule name (literal) describes their grammatical role, while the
tagging deals with their semantics (we'd probably want to call it
something other than wrap_with in this case).  There's no reason the two
should be bound together, and good reasons for keeping them separate--as
in this example, a single rule might be capable of representing things
with differing semantics and different rules might likewise be capable
of producing things with the same semantics.

> If SyntaxTree would provide a method to retrieve the
> name of the rule, It could all be done in the generic method as_xml.

Yeah, but (IMHO) you really don't want to go there.  :)

the literal example is strong. So I agree.

Thanks Markus, you helped me a lot.

--B

markus

unread,
Jul 28, 2012, 3:32:41 PM7/28/12
to treet...@googlegroups.com
B --

> But triggered by you Idea, I could use "wrap_with" not as
> a wrap instruction but as semantic notation.

> I then could even use this to query the parse tree, If want
> to implement something which picks particluar nodes.
>
> A so called Targetdriven transformation builds the target
> model not by traversing the source tree but by querying
> the source model.

Yeah, I like that line of thinking.

> Thanks Markus, you helped me a lot.

Glad I was able to, & best wishes going forward. If you get stuck
again, don't hesitate to ping the list. We aren't always able to help,
but we try to always try. :)

-- M



Clifford Heath

unread,
Jul 28, 2012, 7:34:34 PM7/28/12
to treet...@googlegroups.com
On 28/07/2012, at 11:39 PM, Bernhard wrote:
> By whatever reason I cannot make ruby-debug work on my Mountain Lion Mac, so I returned to 1.8.7

Install the 'debugger' gem - it's a modified version of ruby-19 that works.

It's a mess, but this worked for me. Pry does do, and is supposed to be pretty
good, but I haven't made the switch.

Clifford Heath.
Reply all
Reply to author
Forward
0 new messages