Simple answer: IntegerLiteral, NilLiteral, etc. need to be modules, and
you've most likely defined them as classes.
More complex answers are available if needed. :)
-- MarkusQ
rule foo
a <A_In_Foo> / b <B_In_Foo>
end
rule a
something <A>
end
rule b
something_else <B>
end
In the case where "foo" matches "something_else", the SyntaxNode for "something_else"
will have two mixed-in modules (both "B" and "B_In_Foo"). Use parentheses around a list
of alternates if you want to mix in the same module regardless of which possibility is matched.
Clifford Heath.
> --
> You received this message because you are subscribed to the Google Groups "Treetop Development" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/treetop-dev/-/i618WdSTO9sJ.
> To post to this group, send email to treet...@googlegroups.com.
> To unsubscribe from this group, send email to treetop-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/treetop-dev?hl=en.
Good catch; this would likely be the next snag he'd hit, once the
class/module problem is cleared up (the "0" case doesn't hit it because
it's the last alternative, but anything else would).
-- MarkusQ
--You received this message because you are subscribed to the Google Groups "Treetop Development" group.
> I'm still a little confused about the fact that the parser still refuses to
> recognise any of my patterns:
This is just general advice: Take a step back. Can you get some
example code to work? If so, you might want to try morphing the
example into the code you want, step by step. The other approach is to
start with the barest possible grammar, get that to work, and
gradually add on to it.
What is unproductive, in my experience, is to write a bunch of code,
throw it at the system, and debug the error messages you get. Try to
start from a happy place, and change that to where you want to get to.
This isn't quite as off-topic as it might seem, as I've found that
parsing (with Citrus rather than Treetop) really requires a go-slow
approach.
///ark
Web Applications Developer
California Academy of Sciences
When I make those modifications to a grammar consisting of the rules you
originally posted, it appears to work:
irb(main):009:0> SimpleParser.new.parse('12345')
=> SyntaxNode+Literal+IntegerLiteral+DecimalIntegerLiteral0 offset=0,
"12345":
SyntaxNode offset=0, ""
SyntaxNode offset=0, "1"
SyntaxNode offset=1, "2345":
SyntaxNode offset=1, "2"
SyntaxNode offset=2, "3"
SyntaxNode offset=3, "4"
SyntaxNode offset=4, "5"
Did you perhaps change anythings else since then?
-- M
On 28/06/2012, at 6:02 AM, Bernhard wrote:
> One motivation to use subclasses is …
> to clean the parse tree such that it only has the semantically relevant information and no longer keep the syntactical sugar.
This is a really bad idea in general. PEG parsers produce
lots of garbage - all the hidden memorised attempts at
almost every character. If you keep the parse tree, whether
it's semantically-structured or not, you keep the garbage.
Sometimes a parse tree is like your semantic model, as in
the s-expressions example, but usually you will write a better
program if you design a semantic model first, then construct
that *after* parsing your input.
On 28/06/2012, at 6:29 PM, Bernhard wrote:
> This is the reason why I want to clean the parse tree as it was proposed by Aaron in http://thingsaaronmade.com/blog/a-quick-intro-to-writing-a-parser-using-treetop.html.
It's pretty ugly digging around inside bits of the Treetop structures
assuming you know how to clean it. In fact, I think it's unlikely that
the example code frees very much garbage anyway.
> I am still struggling with finding an appropriate generic strategy to transform the parse tree to my semantic model.
The normal preferred way is to ask the parse tree to descend itself
and build whatever answer you require. Whether you use custom
classes or just methods depends on how hard that is.
> I would like to construct my semantic model and populate it by retrieving the relevant information from the parse tree (sometimes called the target driven transformation). But this is very difficult.
Right, what I said. Why is it so difficult in your case?
> My task is to query tons of LaTeX - Files in order to generate particular extracts in XML.
I think the only engine that can properly parse TeX is TeX - since
you can define new grammar rules that modify the TeX parser.
But if you assume that isn't being done… well, it's still pretty hard.
LaTeX is a real challenge. I admire you for attempting it.
Yes, difficult enough. But I want to grep out some stuff from LaTeX files.
I therefore do not need a full blown LaTeX-Parser.
My problem as of now is, that I am struggling with Treetop _and_ with Latex.
so I went to the simpler task with markdown.
B --
I'm not sure how helpful this will be, but I'll offer it in the hopes
that it at least provides some insight.
It feels as if you are trying to solve the wrong problem, and in the
process making things harder on yourself than need be. Specifically, it
sounds as if you are trying to have treetop produce the results you want
(and thus you care about where it does/does not produce nodes) rather
than having it produce the AST and then having the AST produce the
results you want.
This is a subtle distinction, but it has powerful consequences, because
it breaks the coupling between the structure of the grammar and the
structure of the results..
I'll try to walk you to the place where I think you ought to be starting
by going through a series of not-right-but-at-least-less-wrong stages.
First, with your grammar in g.tt I write a test rig like so:
require "treetop"
text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
[nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "
p Treetop.load('g.tt').new.parse(text)
This produces the AST, as you showed in your e-mail. But now rather
than having the AST as the output, suppose I define a property "as_xml"
on each node, with a (clearly incorrect) default implementation, and
print that as my result, like so:
require "treetop"
text="text 1 [foo_bar_1234] ** lorem ipsum header ** {lorem {ipsum}
[nothing] }(foo_bar_1234, bar_foo_1234) text2 [ text3] "
class Treetop::Runtime::SyntaxNode
def as_xml
if elements
elements.map { |e| e.as_xml }.join
else
text_value
end
end
end
p Treetop.load('g.tt').new.parse(text).as_xml
Now the output is just the original input string, put it is being
produced by parsing the input, producing an AST, which we then walk,
reproducing the source..
We can recover some of the structure by adding the option of tagging the
xml like so (this isn't intended as an example of great code, just a
conceptual exercise to lead you to the easier way of thinking of
things):
class Treetop::Runtime::SyntaxNode
def as_xml
if elements
elements.map { |e| e.as_xml }.join
else
text_value
end
end
def wrap(tag,body)
"<#{tag}>#{body}</#{tag}>"
end
end
...and then mark up the grammar to use it:
rule top
document { def as_xml; wrap('top',super); end }
end
rule document
(noMarkupText / trace / markupAbort)* { def as_xml;
wrap('document',super); end }
end
rule noMarkupText
[^\[]+ { def as_xml; wrap('noMarkupText',super); end }
end
...which would give us:
"<top><document><noMarkupText>text 1 </noMarkupText>[foo_bar_1234] **
lorem ipsum header ** {lorem {ipsum} [nothing] }(foo_bar_1234,
bar_foo_1234)<noMarkupText> text2 </noMarkupText>[<noMarkupText> text3]
</noMarkupText></document></top>"
...which is starting to look at least somewhat like your goal:
You could get the rest of the way (up to indentation) by decorating the
rest of the grammar, and could even do indentation by fussing with the
contents.
But there's nothing saying these methods have to produce strings; you
could just as well produce a tree of objects that has the structure you
want, and have them emit themselves as xml on demand, or have them
update a dictionary, etc. Or, going the other way (as in the obligatory
calculator example) you could have the methods produce a single number.
The key is to decouple the AST and the result by adding methods to the
syntax tree, so that the structure of one isn't bound to the structure
of the other.
I hope that helps.
Going back to the test rig that just dumps the AST, I see:
SyntaxNode+Top0+Document0 offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
SyntaxNode offset=0, "t"
SyntaxNode offset=1, "e"
SyntaxNode offset=2, "x"
SyntaxNode+Top0 offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":SyntaxNode offset=0, "t"SyntaxNode offset=1, "e"SyntaxNode offset=2, "x"
SyntaxNode offset=3, "t"SyntaxNode offset=4, " "SyntaxNode offset=5, "1"SyntaxNode offset=6, " "SyntaxNode+Trace0 offset=7, "..._1234, bar_foo_1234)" (traceHead,traceBody,traceUpLink,traceId):SyntaxNode+TraceId0 offset=7, "[foo_bar_1234]" (label):
...which, as I would expect, has the document and top nodes merged;
which is to say, their is one SyntaxNode and it's getting both modules
added to it, with top being "outer" and document being "inner".
If you want, you can force there to be two nodes by adding a null string
to top (so that it isn't just a synonym for document)
rule top
document '' { def as_xml; wrap('top',super); end }
end
...which should give you:
SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3] " (document):
SyntaxNode+Document0 offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
SyntaxNode offset=0, "t"
SyntaxNode offset=1, "e"
SyntaxNode offset=2, "x"
# indicates if a meaningful name for the node in the AST
# is available
def has_rule_name?
not (extension_modules.nil? or extension_modules.empty?)
end
# returns a meaning name for the node in the AST
def rule_name
if has_rule_name? then
extension_modules.first.name.split("::").last.gsub(/[0-9]/,"")
else
"###"
end
end
> amazing. I do not get this. I am using ruby 1.8.7. I tried it with
> 1.9.2 as well but not difference. By whatever reason I cannot make
> ruby-debug work on my Mountain Lion Mac, so I returned to 1.8.7
I think you're right that this is the core difference. What version of
treetop are you using? I'm not aware of any version that should do what
you are seeing, but I don't track the details too closely.
> I crated a gist for the two cases such that you can see the results
>
>
> https://gist.github.com/3192266/d1d0244b56f3770005b9026279a7e9d93f47d949 - the one with the null srings
>
> https://gist.github.com/3192266/2b2f2bbe36df9dbb3b211f987deddc3a4453a562 - the one without the null strings
Hmm. I think you switched the descriptions.
So the one without the null string doesn't generate separate nodes for
top and document (which it shouldn't) but it also doesn't accrete their
modules onto the shared node (which it should).
But with the null string, it looks as if everything is working as
expected. From your second gist:
SyntaxNode+Top1+Top0 offset=0, "...234) text2 [ text3] " (document):
SyntaxNode+Document1+Document0 offset=0, "...234) text2 [ text3] ":
SyntaxNode offset=0, "...234) text2 [ text3] ":
SyntaxNode+NoMarkupText0 offset=0, "text 1 ":
SyntaxNode offset=0, "t"
SyntaxNode offset=1, "e"
This looks like we want, with an extra bonus node between document and
noMarkupText because you also put a '' inside the definition of
document.
The as_xml stuff should work fine on this tree.
> you see that the parsetree of the one without the null strings does
> not reflect the rule "documentation"
Yeah. My only idea is that it's a treetop version difference, but
that's just an unfounded guess.
Consider a somewhat contrived example:
rule literal
digit+ '.' digit+ { wrap_with 'float' } /
digit { wrap_with 'integer' } /
'"' [^"]* '"' { wrap_with 'string' }
end
The rule name (literal) describes their grammatical role, while the
tagging deals with their semantics (we'd probably want to call it
something other than wrap_with in this case). There's no reason the two
should be bound together, and good reasons for keeping them separate--as
in this example, a single rule might be capable of representing things
with differing semantics and different rules might likewise be capable
of producing things with the same semantics.
> If SyntaxTree would provide a method to retrieve the
> name of the rule, It could all be done in the generic method as_xml.
Yeah, but (IMHO) you really don't want to go there. :)