Ability to parse a stream?

23 views
Skip to first unread message

Justin Tulloss

unread,
Feb 17, 2010, 6:59:07 PM2/17/10
to Treetop Development
Hello,

I'm playing with PDoc, which is a treetop based JavaScript
documentation tool, to generate docs for a large JavaScript code base.
I'd like to be able to parse the input in chunks, say, one file at a
time.

Currently, PDoc will concatenate all the files together and try to
parse them in one shot. This quickly fills up all usable memory and
brings whatever machine it's running on to a halt.

It appears that it is forced to do this because Treetop's
CompiledParser class doesn't have a way to progressively feed in
input. Is this true, or is there a way I could accomplish this?

Thanks,
Justin

Clifford Heath

unread,
Feb 17, 2010, 8:37:15 PM2/17/10
to treet...@googlegroups.com
On 18/02/2010, at 10:59 AM, Justin Tulloss wrote:
> It appears that it is forced to do this because Treetop's
> CompiledParser class doesn't have a way to progressively feed in
> input. Is this true, or is there a way I could accomplish this?

Treetop doesn't expect much of the input object. I believe it's
just these methods: length, size, [], index, line_of, column_of.

The point is, when Treetop asks for input bytes that aren't yet
available from a stream, you could block the thread until those
bytes are available. You'll have to specify a large length and set
Treetop to not consume_all_input.

But I suspect that's not what you wanted. Instead, set the
consume_all_input option to false on the parser, and parse the
first section of the input. Deal with that, and parse again until
you're done. Like this:

p=MyParser.new(input)
p.root = "top_rule_to_use_if_not_default"
p.consume_all_input = false
begin
result = p.parse(input, :index => p.index || 0)
# Do something with result
end until p.index == input.length

Clifford Heath.

Justin Tulloss

unread,
Feb 17, 2010, 11:15:34 PM2/17/10
to treet...@googlegroups.com
Hi Clifford,

Thanks for your response. I'm pretty new to this so I'm not really certain what I'm doing. Forgive my ignorance :).

On Feb 17, 2010, at 5:37 PM, Clifford Heath wrote:

But I suspect that's not what you wanted. Instead, set  the
consume_all_input option to false on the parser, and parse the
first section of the input. Deal with that, and parse again until
you're done. Like this:

p=MyParser.new(input)
p.root = "top_rule_to_use_if_not_default"
p.consume_all_input = false
begin
 result = p.parse(input, :index => p.index || 0)
 # Do something with result
end until p.index == input.length


My problem is that I want "result" to be the cumulative result of all the different parsing phases. There's a syntax tree generated, and I want it to be built iteratively. Does that make sense? In the example you have above, the "Do something with result" step would be to merge it with all the previous results.

Is there any way to do something like that? Am I approaching this wrong?

Thanks,
Justin

Clifford Heath

unread,
Feb 18, 2010, 12:39:52 AM2/18/10
to treet...@googlegroups.com
On 18/02/2010, at 3:15 PM, Justin Tulloss wrote:
> My problem is that I want "result" to be the cumulative result of
> all the different parsing phases. There's a syntax tree generated,
> and I want it to be built iteratively. Does that make sense? In the
> example you have above, the "Do something with result" step would be
> to merge it with all the previous results.
>
> Is there any way to do something like that? Am I approaching this
> wrong?

Yes, I think you're approaching it wrongly. The Treetop
syntax tree is much too detailed generally and uses too
much memory.

What we do instead is to add methods to the rules,
which get added to the appropriate nodes, to build
the data structure we want - *after* Treetop has worked
its magic - and then the Treetop tree can be gc'd.

Make these methods recursive, so you can just call e.g.
"value" on the top node, and it builds the whole structure
you need. Don't try to write code to walk the syntax tree
from the outside, as some newbies have tried.

Clifford Heath.

Justin Tulloss

unread,
Feb 18, 2010, 2:18:31 PM2/18/10
to treet...@googlegroups.com
On Feb 17, 2010, at 9:39 PM, Clifford Heath wrote:

> On 18/02/2010, at 3:15 PM, Justin Tulloss wrote:
>> My problem is that I want "result" to be the cumulative result of
>> all the different parsing phases. There's a syntax tree generated,
>> and I want it to be built iteratively. Does that make sense? In the
>> example you have above, the "Do something with result" step would
>> be to merge it with all the previous results.
>>
>> Is there any way to do something like that? Am I approaching this
>> wrong?
>

> What we do instead is to add methods to the rules,
> which get added to the appropriate nodes, to build
> the data structure we want - *after* Treetop has worked
> its magic - and then the Treetop tree can be gc'd.

I'm not sure I understand. It sounds like I still need to generate the
entire Treetop tree in one pass, which is the problem in the first
place since I can't fit the syntax tree in available memory (which
seems ridiculous. how does 4.3 M of source turn into 3G of tree?). Is
there any way to do this tree iteration in a way that allows me to
throw things out as I go along?

Thanks,
Justin

Clifford Heath

unread,
Feb 18, 2010, 5:00:27 PM2/18/10
to treet...@googlegroups.com
On 19/02/2010, at 6:18 AM, Justin Tulloss wrote:
> I'm not sure I understand. It sounds like I still need to generate
> the entire Treetop tree in one pass,

No. I meant that you should parse many times, in as small chunks as
possible, and build your desired data structure from the results of
each pass, discarding those results after you've used them. Build your
structure by adding node methods that descend the tree, not by walking
the tree from the outside.

Don't try to use the Treetop syntax tree as more than just a lexical
parse result. It is not the right shape to locate you application's
semantics.

Clifford Heath.

Justin Tulloss

unread,
Feb 18, 2010, 5:48:02 PM2/18/10
to treet...@googlegroups.com
Ah, that makes sense. That requires a major restructuring of PDoc, but
I suppose that can be done.

Thanks for your help!

Justin

> --
> You received this message because you are subscribed to the Google
> Groups "Treetop Development" group.
> To post to this group, send email to treet...@googlegroups.com.
> To unsubscribe from this group, send email to treetop-dev...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/treetop-dev?hl=en
> .
>

Bill Tozier

unread,
Feb 19, 2010, 5:13:30 PM2/19/10
to Treetop Development

I have a strong feeling this design pattern you're talking about is
one of the missing "secrets" of using Treetop effectively, but it's
still not obvious to the layman or even smart casual user. Could we
put a little more joint effort into an example, please? I'd love to
understand just what you mean.

"Martin J. Dürst"

unread,
Feb 20, 2010, 12:10:45 AM2/20/10
to treet...@googlegroups.com

Hello Bill, others,

I created a very simple example, in the two attached files. Please feel
free to use it in treetop documentation or elsewhere as you see fit.

It essentially shows a conversion from the *concrete* syntax tree (parse
tree, tree of SyntaxNodes in treetop) to a much more compact abstract
syntax tree (AST) that is closer to the application.

The example prints out both forms, so that you can see the differences.
The AST is printed in one line, which may make it look a bit smaller
than it actually is, but on the other hand, the parse tree doesn't show
all the information in the SyntaxNode objects.

The size and memory differences are due to various reasons:
1) Treetop's use of grammar rules for lexical structure: even a simple
number gets decomposed into digits, which blows up the data
structures and eats a lot of memory.

[hint, hint: When can we get the regular expression feature included
in the treetop distribution?]

2) The fact that SyntaxNodes keep information that was interesting for
parsing, but is no longer useful (e.g. at which character position
each construct started and ended)

3) The fact that SyntaxNodes keep everything as strings, whereas in the
AST, we have a representation closer to the problem (numbers become
Integers, operands become Symbols)

Please note that the overall structure (an assignment from a number to
an identifier) may not look too different in both cases, but it may be
more obvious when using the usual drawing conventions:

Parse tree:
assignment
/ | \
abc = 123

AST:
=
/ \
abc 123

The example also shows how to do the conversion: To each grammar rule of
interest, you add a definition with the same method name ('ast' in the
example) to construct and return (a part of) the tree.

Hope this helps.

Regards, Martin.

--
#-# Martin J. D�rst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp

simple_ast.treetop
simple_ast_example.rb

Bill Tozier

unread,
Feb 20, 2010, 10:29:02 AM2/20/10
to treet...@googlegroups.com

On Feb 20, 2010, at 12:10 AM, Martin J. Dürst wrote:

> I created a very simple example, in the two attached files. Please feel free to use it in treetop documentation or elsewhere as you see fit.
>
> It essentially shows a conversion from the *concrete* syntax tree (parse tree, tree of SyntaxNodes in treetop) to a much more compact abstract syntax tree (AST) that is closer to the application.

Yes, this is actually a nice, straightforward example of the differences between ASTs and parse trees. But Clifford (in response to the OP) seems to have been suggesting something a bit more complex and counterintuitive for the lay reader: creating new parsers /inside/ the modules executed by the rules themselves, so that small---tiny---chunks of source are parsed.

One part of that I think I understand---though only after a few weeks' of intense work :)---and I've modified your example to capture the sort of 'cascade' Clifford is suggesting.

In my trivial variation of your SimpleASTParser, the "number" rule does not itself build an ast instance. Instead, it fires up a new NumberParser, which parses the recognized text_value. That's where the #ast is executed. I've attached the three (3) files below.

I can see definitely see how Clifford's argument works: this will save memory in the long term, since only the parsers needed for one depth-first traversal of the parsetree will be created, and then as the traversal rolls back they will be garbage-collected.

What I'm not sure about, and what doesn't appear to be in the documentation anywhere, is how to use Parser#consume_all_input to make this even better.

I suppose my own example could be modified to handle that aspect as well, though maybe a bit more tree-depth would be helpful. I'll poke at it and see if I can explain it to myself; in the meantime, anybody: chime in!

cascading_ast.treetop
multiparser_ast_example.rb
number.treetop

Clifford Heath

unread,
Feb 20, 2010, 8:59:50 PM2/20/10
to treet...@googlegroups.com
On 21/02/2010, at 2:29 AM, Bill Tozier wrote:
> Yes, this is actually a nice, straightforward example of the
> differences between ASTs and parse trees. But Clifford (in response
> to the OP) seems to have been suggesting something a bit more
> complex and counterintuitive for the lay reader: creating new
> parsers /inside/ the modules executed by the rules themselves, so
> that small---tiny---chunks of source are parsed.

Well, no, not really. Martin got it about right.

Most languages have some kind of sentential form,
and it makes sense to parse just one sentence at a time.
In C or JS etc, that means one function or declaration.

What I was trying to say is don't try to use the syntax tree
as your AST, because it will often contain a node for every
character. In an AST, the individual characters rarely matter.

I think your revised example is pretty weird actually. Why
would you want to parse the number twice?

I go back to my original example code, which parses a sentence
at a time until it reaches the end of the file:

p=MyParser.new(input)
p.root = "sentence"


p.consume_all_input = false
begin
result = p.parse(input, :index => p.index || 0)

do_something_with(result.value)


end until p.index == input.length

Setting "p.root" tells the parser which rule to use, if not the
topmost rule.

Setting consume_all_input = false allows the parser to succeed
without consuming all input (so it returns a result). Passing
:index => p.index tells the parser to start the next sentence at
the end of the previous one.

Clifford Heath.

Bill Tozier

unread,
Feb 21, 2010, 6:41:52 AM2/21/10
to treet...@googlegroups.com

On Feb 20, 2010, at 8:59 PM, Clifford Heath wrote:

> On 21/02/2010, at 2:29 AM, Bill Tozier wrote:
>> Yes, this is actually a nice, straightforward example of the differences between ASTs and parse trees. But Clifford (in response to the OP) seems to have been suggesting something a bit more complex and counterintuitive for the lay reader: creating new parsers /inside/ the modules executed by the rules themselves, so that small---tiny---chunks of source are parsed.
>
> Well, no, not really. Martin got it about right.
>
> Most languages have some kind of sentential form,
> and it makes sense to parse just one sentence at a time.
> In C or JS etc, that means one function or declaration.
>
> What I was trying to say is don't try to use the syntax tree
> as your AST, because it will often contain a node for every
> character. In an AST, the individual characters rarely matter.
>
> I think your revised example is pretty weird actually. Why
> would you want to parse the number twice?

LOL. I stand corrected, I suppose.

"Martin J. Dürst"

unread,
Jun 4, 2010, 6:18:09 AM6/4/10
to treet...@googlegroups.com
Hello Clifford, others,

Sorry to go back to a very old thread.

On 2010/02/21 10:59, Clifford Heath wrote:
> On 21/02/2010, at 2:29 AM, Bill Tozier wrote:
>> Yes, this is actually a nice, straightforward example of the
>> differences between ASTs and parse trees. But Clifford (in response to
>> the OP) seems to have been suggesting something a bit more complex and
>> counterintuitive for the lay reader: creating new parsers /inside/ the
>> modules executed by the rules themselves, so that
>> small---tiny---chunks of source are parsed.
>
> Well, no, not really. Martin got it about right.
>
> Most languages have some kind of sentential form,
> and it makes sense to parse just one sentence at a time.
> In C or JS etc, that means one function or declaration.
>
> What I was trying to say is don't try to use the syntax tree
> as your AST, because it will often contain a node for every
> character. In an AST, the individual characters rarely matter.

Agreed, of course. But I have been wondering all along: Why does Treetop
construct a syntax tree in the first place? Things like yacc/bison don't
do this, instead of allowing the user to define an arbitrary number of
methods on each kind of syntax tree, they only allow the user to define
how to calculate one semantic value (which may be arbitrarily simple or
complex, but is usually a node in the AST or nothing if that part of the
syntax doesn't survive to the AST).

Regards, Martin.


> I think your revised example is pretty weird actually. Why
> would you want to parse the number twice?
>
> I go back to my original example code, which parses a sentence
> at a time until it reaches the end of the file:
>
> p=MyParser.new(input)
> p.root = "sentence"
> p.consume_all_input = false
> begin
> result = p.parse(input, :index => p.index || 0)
> do_something_with(result.value)
> end until p.index == input.length
>
> Setting "p.root" tells the parser which rule to use, if not the
> topmost rule.
>
> Setting consume_all_input = false allows the parser to succeed
> without consuming all input (so it returns a result). Passing
> :index => p.index tells the parser to start the next sentence at
> the end of the previous one.
>
> Clifford Heath.
>

--

Clifford Heath

unread,
Jun 4, 2010, 11:19:49 PM6/4/10
to treet...@googlegroups.com
On 04/06/2010, at 8:18 PM, Martin J. Dürst wrote:
>> What I was trying to say is don't try to use the syntax tree
>> as your AST, because it will often contain a node for every
>> character. In an AST, the individual characters rarely matter.
>
> Agreed, of course. But I have been wondering all along: Why does
> Treetop construct a syntax tree in the first place?

PEG parsers avoid having to analyse the grammar by interpreting
it blindly, using greedy matching and unguided backtracking. So
that this doesn't produce exponential execution time, it memoises
the results. Any result that is memoized for one rule at one point in
the input contains subtrees that are memoized in their respective
rules. Consequently the subtrees of a node can't realistically be
pruned from that node; they won't be GC'd because they will have
been memoized in subordinate rules anyway.

So the answer to your question "why", is that it adds no expense
over what is necessary for the algorithm to work in the first place.

I have thought it would be good to provide an option for Treetop
to produce an instrumented parser, which says for each rule, how
many different successful results were memoized, and how many
times the rule made use of a memoized result. It's my belief that
many rules will memoize results that will never get used, so the
report from an instrumented parser would allow you to annotate
the grammar saying "don't bother memoizing this". It could save
a lot of execution time and/or memory.

Clifford Heath.

Reply all
Reply to author
Forward
0 new messages