Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
performance (dot star, list of literal strings)
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  6 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
merrells@gmail.com  
View profile  
 More options Sep 20 2008, 12:01 am
From: "merre...@gmail.com" <merre...@gmail.com>
Date: Fri, 19 Sep 2008 21:01:52 -0700 (PDT)
Local: Sat, Sep 20 2008 12:01 am
Subject: performance (dot star, list of literal strings)

Hello,

I've been having fun with Treetop the past couple of days, but I've
run into some performance issues. I'm extracting dates from text,
which involves parsing many fragments of long strings looking for
matching bits of text. The root node of the grammar is this... a date,
followed by excess stuff...

  rule sentence
    date .*
  end

Finding that this was considerably slower than I'd expected I took a
look at the code and saw that '.*' had been transformed into this.
Note that a new SyntaxNode is created for every character.

      loop do
        if index < input_length
          r3 = (SyntaxNode).new(input, index...(index + 1))
          @index += 1
        else
          terminal_parse_failure("any character")
          r3 = nil
        end
        if r3
          s2 << r3
        else
          break
        end
      end

So, questions...

1) Is there a better way to say 'and ignore the following garbage'?
2) If this is the best way, then does there really need to be a node
per character?
3) If yes, then is there a way to wrap '.*' in such a way that it is
seen as a terminal and so mapped onto a single SyntaxNode?

To answer my own questions a bit I took a look at the generator code
and I can see that this is all a result of the literal transformation
from grammar to code.... there's no support for any 'global'
optimizations :-( Perhaps the meta-grammar could first convert the
grammar into a parse tree that could be transformed by another grammar
into a tree of operations that could then be transformed into the
literal implementation code. Then you'd have the opportunity to spot
optimization opportunities like this.  [Re-reading that I realise that
I'm channelling ANTLR, which I used once to write an XPath to database
query processor... so maybe my thinking is tainted a bit there as to
how you'd go about achieving this.]

Oh, btw, I also wrote this... very quick and dirty... but seemed
reasonable...

  rule month
    'January' / 'February' / 'March' / 'April' / 'May' / 'June' /
'July' / 'August' / 'September' / 'October' / 'November' / 'December'
  end

I thought it'd map down onto a single Hash looking up... but actually
it generates a long list of nested if token=literal statement... and
the performance killer here is that the call to terminal_parse_failure
creates a new object of every failed match.

John

--
John Merrells
http://www.johnmerrells.com
+1.415.244.5808


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Scott Taylor  
View profile  
 More options Sep 20 2008, 1:04 am
From: Scott Taylor <sc...@railsnewbie.com>
Date: Sat, 20 Sep 2008 01:04:30 -0400
Local: Sat, Sep 20 2008 1:04 am
Subject: Re: performance (dot star, list of literal strings)

On Sep 20, 2008, at 12:01 AM, merre...@gmail.com wrote:

> Hello,

> I've been having fun with Treetop the past couple of days, but I've
> run into some performance issues. I'm extracting dates from text,

Have you run any benchmarks?  I suppose if you could demonstrate that  
it's *terribly* slow someone would be willing to whip up a patch.

Scott


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Clifford Heath  
View profile  
 More options Sep 20 2008, 3:35 am
From: Clifford Heath <clifford.he...@gmail.com>
Date: Sat, 20 Sep 2008 17:35:58 +1000
Local: Sat, Sep 20 2008 3:35 am
Subject: Re: performance (dot star, list of literal strings)
On 20/09/2008, at 2:01 PM, merre...@gmail.com wrote:

> The root node of the grammar is this... a date,
> followed by excess stuff...

>  rule sentence
>    date .*
>  end

You don't need to do that for the root node. Just set the option
that allows Treetop to succeed on just part of the input:

        parser.consume_all_input = false

> 1) Is there a better way to say 'and ignore the following garbage'?

We need to add Regexp support, which will produce a single node.
Nathan was somewhat opposed to it philosophically as the PEG
algorithm is of the same *order* as a Regexp algo, but the constant
multiplier and memory costs should trump that IMO.

> 3) If yes, then is there a way to wrap '.*' in such a way that it is
> seen as a terminal and so mapped onto a single SyntaxNode?

I suggested an alternate type of rule, introduced by the keyword
"skip" instead of rule (but otherwise identical), that doesn't produce
a node, or one that's immediately discarded. The implementation is
so far an exercise for the reader ;-).

I have much more interest in implementing semantic predicates,
which will substantially improve the power of Treetop to resolve
the kinds of natural-language grammar I work with.

> Oh, btw, I also wrote this... very quick and dirty... but seemed
> reasonable...

>  rule month
>    'January' / 'February' / 'March' / 'April' / 'May' / 'June' /
> 'July' / 'August' / 'September' / 'October' / 'November' / 'December'
>  end

> I thought it'd map down onto a single Hash looking up... but actually
> it generates a long list of nested if token=literal statement... and
> the performance killer here is that the call to terminal_parse_failure
> creates a new object of every failed match.

An optimisation of this case is possible, but not in slightly more  
general
cases, where the terminal_parse_failure is needed to store the  
memoization
that allows PEG to work at faster than exponential speeds.

Clifford Heath.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
John Merrells  
View profile  
 More options Sep 22 2008, 2:40 am
From: John Merrells <merre...@gmail.com>
Date: Sun, 21 Sep 2008 23:40:13 -0700
Local: Mon, Sep 22 2008 2:40 am
Subject: Re: performance (dot star, list of literal strings)

On Sep 20, 2008, at 12:35 AM, Clifford Heath wrote:

> On 20/09/2008, at 2:01 PM, merre...@gmail.com wrote:

>> rule sentence
>>   date .*
>> end

> You don't need to do that for the root node. Just set the option
> that allows Treetop to succeed on just part of the input:

>    parser.consume_all_input = false

Ah - perfect.

>> 1) Is there a better way to say 'and ignore the following garbage'?

> We need to add Regexp support, which will produce a single node.
> Nathan was somewhat opposed to it philosophically as the PEG
> algorithm is of the same *order* as a Regexp algo, but the constant
> multiplier and memory costs should trump that IMO.

Yes - that'd improve performance greatly.

> I have much more interest in implementing semantic predicates,
> which will substantially improve the power of Treetop to resolve
> the kinds of natural-language grammar I work with.

Like so?

months = { Jan, Feb, etc }
rule month
   [a-z]+ { months.includes?[token] }
end

That'd also simplify the calling code and improve performance.

Thanks.

John

--
John Merrells
http://www.johnmerrells.com
+1.415.244.5808


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Clifford Heath  
View profile  
 More options Sep 22 2008, 3:14 am
From: Clifford Heath <clifford.he...@gmail.com>
Date: Mon, 22 Sep 2008 17:14:35 +1000
Local: Mon, Sep 22 2008 3:14 am
Subject: Re: performance (dot star, list of literal strings)
On 22/09/2008, at 4:40 PM, John Merrells wrote:

>> I have much more interest in implementing semantic predicates,
>> which will substantially improve the power of Treetop to resolve
>> the kinds of natural-language grammar I work with.
> Like so?
> months = { Jan, Feb, etc }
> rule month
>   [a-z]+ { months.includes?[token] }
> end

Yes, but taking a leaf out of ANTLR's book, and because
what you've shown is ambiguous, using }? to close the
predicate. The main implementation issue is that the
SyntaxNode has to be constructed to evaluated a predicate,
and that means it can't (easily) happen within a sequence,
only at the end.

It means you can look up symbol tables for things defined
earlier in the same sentence, for example, so increases the
power of the parser significantly.

Clifford Heath.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom M  
View profile  
 More options Sep 22 2008, 9:00 pm
From: Tom M <thomas.mack...@gmail.com>
Date: Mon, 22 Sep 2008 18:00:11 -0700 (PDT)
Local: Mon, Sep 22 2008 9:00 pm
Subject: Re: performance (dot star, list of literal strings)
There's also the more obvious problem with approaches like this that
they are case sensitive.  When I try things like this I tend to find
that my solutions are either brittle or end up really complex.

On Sep 22, 3:14 am, Clifford Heath <clifford.he...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »