I've been having fun with Treetop the past couple of days, but I've
run into some performance issues. I'm extracting dates from text,
which involves parsing many fragments of long strings looking for
matching bits of text. The root node of the grammar is this... a date,
followed by excess stuff...
rule sentence
date .*
end
Finding that this was considerably slower than I'd expected I took a
look at the code and saw that '.*' had been transformed into this.
Note that a new SyntaxNode is created for every character.
loop do
if index < input_length
r3 = (SyntaxNode).new(input, index...(index + 1))
@index += 1
else
terminal_parse_failure("any character")
r3 = nil
end
if r3
s2 << r3
else
break
end
end
So, questions...
1) Is there a better way to say 'and ignore the following garbage'?
2) If this is the best way, then does there really need to be a node
per character?
3) If yes, then is there a way to wrap '.*' in such a way that it is
seen as a terminal and so mapped onto a single SyntaxNode?
To answer my own questions a bit I took a look at the generator code
and I can see that this is all a result of the literal transformation
from grammar to code.... there's no support for any 'global'
optimizations :-( Perhaps the meta-grammar could first convert the
grammar into a parse tree that could be transformed by another grammar
into a tree of operations that could then be transformed into the
literal implementation code. Then you'd have the opportunity to spot
optimization opportunities like this. [Re-reading that I realise that
I'm channelling ANTLR, which I used once to write an XPath to database
query processor... so maybe my thinking is tainted a bit there as to
how you'd go about achieving this.]
Oh, btw, I also wrote this... very quick and dirty... but seemed
reasonable...
I thought it'd map down onto a single Hash looking up... but actually
it generates a long list of nested if token=literal statement... and
the performance killer here is that the call to terminal_parse_failure
creates a new object of every failed match.
On 20/09/2008, at 2:01 PM, merre...@gmail.com wrote:
> The root node of the grammar is this... a date, > followed by excess stuff...
> rule sentence > date .* > end
You don't need to do that for the root node. Just set the option that allows Treetop to succeed on just part of the input:
parser.consume_all_input = false
> 1) Is there a better way to say 'and ignore the following garbage'?
We need to add Regexp support, which will produce a single node. Nathan was somewhat opposed to it philosophically as the PEG algorithm is of the same *order* as a Regexp algo, but the constant multiplier and memory costs should trump that IMO.
> 3) If yes, then is there a way to wrap '.*' in such a way that it is > seen as a terminal and so mapped onto a single SyntaxNode?
I suggested an alternate type of rule, introduced by the keyword "skip" instead of rule (but otherwise identical), that doesn't produce a node, or one that's immediately discarded. The implementation is so far an exercise for the reader ;-).
I have much more interest in implementing semantic predicates, which will substantially improve the power of Treetop to resolve the kinds of natural-language grammar I work with.
> Oh, btw, I also wrote this... very quick and dirty... but seemed > reasonable...
> I thought it'd map down onto a single Hash looking up... but actually > it generates a long list of nested if token=literal statement... and > the performance killer here is that the call to terminal_parse_failure > creates a new object of every failed match.
An optimisation of this case is possible, but not in slightly more general cases, where the terminal_parse_failure is needed to store the memoization that allows PEG to work at faster than exponential speeds.
On Sep 20, 2008, at 12:35 AM, Clifford Heath wrote:
> On 20/09/2008, at 2:01 PM, merre...@gmail.com wrote:
>> rule sentence >> date .* >> end
> You don't need to do that for the root node. Just set the option > that allows Treetop to succeed on just part of the input:
> parser.consume_all_input = false
Ah - perfect.
>> 1) Is there a better way to say 'and ignore the following garbage'?
> We need to add Regexp support, which will produce a single node. > Nathan was somewhat opposed to it philosophically as the PEG > algorithm is of the same *order* as a Regexp algo, but the constant > multiplier and memory costs should trump that IMO.
Yes - that'd improve performance greatly.
> I have much more interest in implementing semantic predicates, > which will substantially improve the power of Treetop to resolve > the kinds of natural-language grammar I work with.
>> I have much more interest in implementing semantic predicates, >> which will substantially improve the power of Treetop to resolve >> the kinds of natural-language grammar I work with. > Like so? > months = { Jan, Feb, etc } > rule month > [a-z]+ { months.includes?[token] } > end
Yes, but taking a leaf out of ANTLR's book, and because what you've shown is ambiguous, using }? to close the predicate. The main implementation issue is that the SyntaxNode has to be constructed to evaluated a predicate, and that means it can't (easily) happen within a sequence, only at the end.
It means you can look up symbol tables for things defined earlier in the same sentence, for example, so increases the power of the parser significantly.
There's also the more obvious problem with approaches like this that
they are case sensitive. When I try things like this I tend to find
that my solutions are either brittle or end up really complex.
On Sep 22, 3:14 am, Clifford Heath <clifford.he...@gmail.com> wrote:
> >> I have much more interest in implementing semantic predicates,
> >> which will substantially improve the power of Treetop to resolve
> >> the kinds of natural-language grammar I work with.
> > Like so?
> > months = { Jan, Feb, etc }
> > rule month
> > [a-z]+ { months.includes?[token] }
> > end
> Yes, but taking a leaf out of ANTLR's book, and because
> what you've shown is ambiguous, using }? to close the
> predicate. The main implementation issue is that the
> SyntaxNode has to be constructed to evaluated a predicate,
> and that means it can't (easily) happen within a sequence,
> only at the end.
> It means you can look up symbol tables for things defined
> earlier in the same sentence, for example, so increases the
> power of the parser significantly.