Help with a grammar problem

Mike Diehl

unread,

Sep 1, 2009, 10:33:22 PM9/1/09

to recde...@perl.org

Hi all,

I'm struggling with an RD grammar problem and am hoping you can help.

I've got some data that is embedded inside a file and I need to parse only the
embedded data and leave the "noise" untouched.

For example:

afaf asf af <DELIMITER> command command command </DELIMITER> asdf asd qer f a

I want to parse the command(s), remove the DELIMITERS and preserve everything
else.

In the past, I've looped over the file with a regex looking for the delimeters
and then running RD on the text inside. However, the cost of launching
several instances of the parser is very expensive, about 80% of runtime.

I'd like to be able to use one parser and have it "do" the entire file.

What I've tried amounts to this:

chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/

However, I think the first regex is eating too much.

Any suggestions on how to do this?

TIA.

--

Take care and have fun,
Mike Diehl.

Ted Zlatanov

unread,

Sep 2, 2009, 6:56:28 AM9/2/09

to Mike Diehl, recde...@perl.org

On Tue, 1 Sep 2009 20:33:22 -0600 Mike Diehl <mdi...@diehlnet.com> wrote:

MD> Hi all,
MD> I'm struggling with an RD grammar problem and am hoping you can help.

MD> I've got some data that is embedded inside a file and I need to parse only the
MD> embedded data and leave the "noise" untouched.

MD> For example:

MD> afaf asf af <DELIMITER> command command command </DELIMITER> asdf asd qer f a

MD> I want to parse the command(s), remove the DELIMITERS and preserve everything
MD> else.

MD> In the past, I've looped over the file with a regex looking for the delimeters
MD> and then running RD on the text inside. However, the cost of launching
MD> several instances of the parser is very expensive, about 80% of runtime.

MD> I'd like to be able to use one parser and have it "do" the entire file.

MD> What I've tried amounts to this:

MD> chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/

MD> However, I think the first regex is eating too much.

MD> Any suggestions on how to do this?

This seems reasonable. Can you show a full runnable example that fails?

Ted

Damian Conway

unread,

Sep 3, 2009, 3:50:58 AM9/3/09

to Mike Diehl, recde...@perl.org

Hi Mike,

> What I've tried amounts to this:
>

> chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/

Unfortunately that won't work, because every regex in a PRD grammar is
independent of the rest of the grammar, so even a minimal-matching .*?
eats everything.

Is there some reason you can't use something like:

my $parser = Parse::RecDescent->new($grammar);

$text =~ s{<DELIMITER> (.*?) </DELIMITER>}
{ $parser->parse($1); q{} }gexs;

???

Damian

Mike Diehl

unread,

Sep 3, 2009, 11:50:33 AM9/3/09

to Damian Conway, recde...@perl.org

On Thursday 03 September 2009 01:50:58 Damian Conway wrote:
> Hi Mike,
>
> > What I've tried amounts to this:
> >
> > chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
>
> Unfortunately that won't work, because every regex in a PRD grammar is
> independent of the rest of the grammar, so even a minimal-matching .*?
> eats everything.

Ya, that's what I was suspecting. In hind sight, I should have figured that;
that's how I'd write it...

> Is there some reason you can't use something like:
>
> my $parser = Parse::RecDescent->new($grammar);
>
> $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
> { $parser->parse($1); q{} }gexs;

That's what I was doing, but it seems I misinterpreted my profiling results.
I found from profiling that the function I use to create (once) and run the
parser accounted for 80% of runtime.

I assumed that since I only create the parser once (if !defined), creating the
parser wasn't where the cost was. So I decided that it must be due to
actually running the parser, which might run several times during program
execution. My conclusion was that I needed to rewrite the grammar so that
the parser would only run once.

It sounds like I may need to go back to the old algorithm and start tuning the
grammar.

Matthew Braid

unread,

Sep 3, 2009, 6:22:51 PM9/3/09

to Mike Diehl, Damian Conway, recde...@perl.org

Hi all,

Would there be some way of manipulating the skip re to do this?

Something along the lines of:

top: <skip: /NOT START DELIMETER/> chunk(s) eof
chunk: delimeter_start <skip: /NORMAL SKIP/> command(s) delimiter_end
eof: /\Z/

The problem there is defining a skip that won't skip a
delimeter_start. This probably won't allow delimeter_start to _not_
mean the start of a set of commands as well.

Not tested, but just a suggestion.

MB

2009/9/4 Mike Diehl <mdi...@diehlnet.com>:

Matthew Braid

unread,

Sep 6, 2009, 7:46:48 PM9/6/09

to Mike Diehl, Damian Conway, recde...@perl.org

Sorry, replying to myself, but I just stumbled across a similar
situation and my solution might help you too.

I needed to define a block like this:

perl until FLAG
PERL
FLAG;

which is like a 'here-doc' for inlining perl in another language that
doesn't require actually parsing the perl code. Like your input, I
need to match 'anything' up to the closing flag. I ended up using a
rule similar to your original solution, except instead of having a
/.*?/ match, I combined that with the next terminal. After playing
around a bit, I came up with the following test script that parses out
all valid chunks between 'START' and 'END' amongst other rubbish in
the input in one pass:

====== START CODE ======

use Parse::RecDescent;
use Data::Dumper;
#$::RD_TRACE = 1;

# assuming start/end delimiters of START and END
my $grammar = <<'STOP';

start:
chunk(s?)

chunk:
/.*?START/s command(s) 'END' # This is the important bit
{$item[2]}

command:
'test' ';'
{"TEST COMMAND"}

STOP

my $text = << 'STOP';

blah blah blayh

asdsd kjkl

START
test;
test;
END

kjsaljdlk
askd

START
test;
END

sad
asdgfdsf
gfsfg

STOP

my $res = Parse::RecDescent->new($grammar)->start($text);
print Data::Dumper::Dumper($res), "\n";

====== END CODE ======

Note that the /s modifier on the 'garbage scooping' re's is important
for this to work. Was scratching my head over that for a bit :)

The output of that is:

====== START OUTPUT ======

$VAR1 = [
[
'TEST COMMAND',
'TEST COMMAND'
],
[
'TEST COMMAND'
]
];

====== END OUTPUT ======

I haven't done any benchmarking, but that might be faster than
sequential parses of 'clean' data. My original solution anchored to
the end of the input with an eof marker and a 'trailing_guff' rule
that matched anything after the chunk(s?) subrule, but that is
unnecessary.

MB

2009/9/4 Matthew Braid <matt...@gmail.com>: