performance (dot star, list of literal strings)

15 views
Skip to first unread message

merr...@gmail.com

unread,
Sep 20, 2008, 12:01:52 AM9/20/08
to Treetop Development

Hello,

I've been having fun with Treetop the past couple of days, but I've
run into some performance issues. I'm extracting dates from text,
which involves parsing many fragments of long strings looking for
matching bits of text. The root node of the grammar is this... a date,
followed by excess stuff...

rule sentence
date .*
end

Finding that this was considerably slower than I'd expected I took a
look at the code and saw that '.*' had been transformed into this.
Note that a new SyntaxNode is created for every character.

loop do
if index < input_length
r3 = (SyntaxNode).new(input, index...(index + 1))
@index += 1
else
terminal_parse_failure("any character")
r3 = nil
end
if r3
s2 << r3
else
break
end
end

So, questions...

1) Is there a better way to say 'and ignore the following garbage'?
2) If this is the best way, then does there really need to be a node
per character?
3) If yes, then is there a way to wrap '.*' in such a way that it is
seen as a terminal and so mapped onto a single SyntaxNode?

To answer my own questions a bit I took a look at the generator code
and I can see that this is all a result of the literal transformation
from grammar to code.... there's no support for any 'global'
optimizations :-( Perhaps the meta-grammar could first convert the
grammar into a parse tree that could be transformed by another grammar
into a tree of operations that could then be transformed into the
literal implementation code. Then you'd have the opportunity to spot
optimization opportunities like this. [Re-reading that I realise that
I'm channelling ANTLR, which I used once to write an XPath to database
query processor... so maybe my thinking is tainted a bit there as to
how you'd go about achieving this.]

Oh, btw, I also wrote this... very quick and dirty... but seemed
reasonable...

rule month
'January' / 'February' / 'March' / 'April' / 'May' / 'June' /
'July' / 'August' / 'September' / 'October' / 'November' / 'December'
end

I thought it'd map down onto a single Hash looking up... but actually
it generates a long list of nested if token=literal statement... and
the performance killer here is that the call to terminal_parse_failure
creates a new object of every failed match.

John


--
John Merrells
http://www.johnmerrells.com
+1.415.244.5808

Scott Taylor

unread,
Sep 20, 2008, 1:04:30 AM9/20/08
to treet...@googlegroups.com

On Sep 20, 2008, at 12:01 AM, merr...@gmail.com wrote:

>
>
> Hello,
>
> I've been having fun with Treetop the past couple of days, but I've
> run into some performance issues. I'm extracting dates from text,

Have you run any benchmarks? I suppose if you could demonstrate that
it's *terribly* slow someone would be willing to whip up a patch.

Scott

Clifford Heath

unread,
Sep 20, 2008, 3:35:58 AM9/20/08
to treet...@googlegroups.com
On 20/09/2008, at 2:01 PM, merr...@gmail.com wrote:
> The root node of the grammar is this... a date,
> followed by excess stuff...
>
> rule sentence
> date .*
> end

You don't need to do that for the root node. Just set the option
that allows Treetop to succeed on just part of the input:

parser.consume_all_input = false

> 1) Is there a better way to say 'and ignore the following garbage'?

We need to add Regexp support, which will produce a single node.
Nathan was somewhat opposed to it philosophically as the PEG
algorithm is of the same *order* as a Regexp algo, but the constant
multiplier and memory costs should trump that IMO.

> 3) If yes, then is there a way to wrap '.*' in such a way that it is
> seen as a terminal and so mapped onto a single SyntaxNode?

I suggested an alternate type of rule, introduced by the keyword
"skip" instead of rule (but otherwise identical), that doesn't produce
a node, or one that's immediately discarded. The implementation is
so far an exercise for the reader ;-).

I have much more interest in implementing semantic predicates,
which will substantially improve the power of Treetop to resolve
the kinds of natural-language grammar I work with.

> Oh, btw, I also wrote this... very quick and dirty... but seemed
> reasonable...
>
> rule month
> 'January' / 'February' / 'March' / 'April' / 'May' / 'June' /
> 'July' / 'August' / 'September' / 'October' / 'November' / 'December'
> end
>
> I thought it'd map down onto a single Hash looking up... but actually
> it generates a long list of nested if token=literal statement... and
> the performance killer here is that the call to terminal_parse_failure
> creates a new object of every failed match.

An optimisation of this case is possible, but not in slightly more
general
cases, where the terminal_parse_failure is needed to store the
memoization
that allows PEG to work at faster than exponential speeds.

Clifford Heath.

John Merrells

unread,
Sep 22, 2008, 2:40:13 AM9/22/08
to treet...@googlegroups.com

On Sep 20, 2008, at 12:35 AM, Clifford Heath wrote:

>
> On 20/09/2008, at 2:01 PM, merr...@gmail.com wrote:
>>
>> rule sentence
>> date .*
>> end
>
> You don't need to do that for the root node. Just set the option
> that allows Treetop to succeed on just part of the input:
>
> parser.consume_all_input = false

Ah - perfect.

>
>> 1) Is there a better way to say 'and ignore the following garbage'?
>
> We need to add Regexp support, which will produce a single node.
> Nathan was somewhat opposed to it philosophically as the PEG
> algorithm is of the same *order* as a Regexp algo, but the constant
> multiplier and memory costs should trump that IMO.

Yes - that'd improve performance greatly.

> I have much more interest in implementing semantic predicates,
> which will substantially improve the power of Treetop to resolve
> the kinds of natural-language grammar I work with.

Like so?

months = { Jan, Feb, etc }
rule month
[a-z]+ { months.includes?[token] }
end

That'd also simplify the calling code and improve performance.

Thanks.

Clifford Heath

unread,
Sep 22, 2008, 3:14:35 AM9/22/08
to treet...@googlegroups.com
On 22/09/2008, at 4:40 PM, John Merrells wrote:
>> I have much more interest in implementing semantic predicates,
>> which will substantially improve the power of Treetop to resolve
>> the kinds of natural-language grammar I work with.
> Like so?
> months = { Jan, Feb, etc }
> rule month
> [a-z]+ { months.includes?[token] }
> end

Yes, but taking a leaf out of ANTLR's book, and because
what you've shown is ambiguous, using }? to close the
predicate. The main implementation issue is that the
SyntaxNode has to be constructed to evaluated a predicate,
and that means it can't (easily) happen within a sequence,
only at the end.

It means you can look up symbol tables for things defined
earlier in the same sentence, for example, so increases the
power of the parser significantly.

Clifford Heath.

Tom M

unread,
Sep 22, 2008, 9:00:11 PM9/22/08
to Treetop Development
There's also the more obvious problem with approaches like this that
they are case sensitive. When I try things like this I tend to find
that my solutions are either brittle or end up really complex.
Reply all
Reply to author
Forward
0 new messages