>
>
> Hello,
>
> I've been having fun with Treetop the past couple of days, but I've
> run into some performance issues. I'm extracting dates from text,
Have you run any benchmarks? I suppose if you could demonstrate that
it's *terribly* slow someone would be willing to whip up a patch.
Scott
You don't need to do that for the root node. Just set the option
that allows Treetop to succeed on just part of the input:
parser.consume_all_input = false
> 1) Is there a better way to say 'and ignore the following garbage'?
We need to add Regexp support, which will produce a single node.
Nathan was somewhat opposed to it philosophically as the PEG
algorithm is of the same *order* as a Regexp algo, but the constant
multiplier and memory costs should trump that IMO.
> 3) If yes, then is there a way to wrap '.*' in such a way that it is
> seen as a terminal and so mapped onto a single SyntaxNode?
I suggested an alternate type of rule, introduced by the keyword
"skip" instead of rule (but otherwise identical), that doesn't produce
a node, or one that's immediately discarded. The implementation is
so far an exercise for the reader ;-).
I have much more interest in implementing semantic predicates,
which will substantially improve the power of Treetop to resolve
the kinds of natural-language grammar I work with.
> Oh, btw, I also wrote this... very quick and dirty... but seemed
> reasonable...
>
> rule month
> 'January' / 'February' / 'March' / 'April' / 'May' / 'June' /
> 'July' / 'August' / 'September' / 'October' / 'November' / 'December'
> end
>
> I thought it'd map down onto a single Hash looking up... but actually
> it generates a long list of nested if token=literal statement... and
> the performance killer here is that the call to terminal_parse_failure
> creates a new object of every failed match.
An optimisation of this case is possible, but not in slightly more
general
cases, where the terminal_parse_failure is needed to store the
memoization
that allows PEG to work at faster than exponential speeds.
Clifford Heath.
>
> On 20/09/2008, at 2:01 PM, merr...@gmail.com wrote:
>>
>> rule sentence
>> date .*
>> end
>
> You don't need to do that for the root node. Just set the option
> that allows Treetop to succeed on just part of the input:
>
> parser.consume_all_input = false
Ah - perfect.
>
>> 1) Is there a better way to say 'and ignore the following garbage'?
>
> We need to add Regexp support, which will produce a single node.
> Nathan was somewhat opposed to it philosophically as the PEG
> algorithm is of the same *order* as a Regexp algo, but the constant
> multiplier and memory costs should trump that IMO.
Yes - that'd improve performance greatly.
> I have much more interest in implementing semantic predicates,
> which will substantially improve the power of Treetop to resolve
> the kinds of natural-language grammar I work with.
Like so?
months = { Jan, Feb, etc }
rule month
[a-z]+ { months.includes?[token] }
end
That'd also simplify the calling code and improve performance.
Thanks.
Yes, but taking a leaf out of ANTLR's book, and because
what you've shown is ambiguous, using }? to close the
predicate. The main implementation issue is that the
SyntaxNode has to be constructed to evaluated a predicate,
and that means it can't (easily) happen within a sequence,
only at the end.
It means you can look up symbol tables for things defined
earlier in the same sentence, for example, so increases the
power of the parser significantly.
Clifford Heath.