I'm struggling with an RD grammar problem and am hoping you can help.
I've got some data that is embedded inside a file and I need to parse only the
embedded data and leave the "noise" untouched.
For example:
afaf asf af <DELIMITER> command command command </DELIMITER> asdf asd qer f a
I want to parse the command(s), remove the DELIMITERS and preserve everything
else.
In the past, I've looped over the file with a regex looking for the delimeters
and then running RD on the text inside. However, the cost of launching
several instances of the parser is very expensive, about 80% of runtime.
I'd like to be able to use one parser and have it "do" the entire file.
What I've tried amounts to this:
chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
However, I think the first regex is eating too much.
Any suggestions on how to do this?
TIA.
--
Take care and have fun,
Mike Diehl.
MD> Hi all,
MD> I'm struggling with an RD grammar problem and am hoping you can help.
MD> I've got some data that is embedded inside a file and I need to parse only the
MD> embedded data and leave the "noise" untouched.
MD> For example:
MD> afaf asf af <DELIMITER> command command command </DELIMITER> asdf asd qer f a
MD> I want to parse the command(s), remove the DELIMITERS and preserve everything
MD> else.
MD> In the past, I've looped over the file with a regex looking for the delimeters
MD> and then running RD on the text inside. However, the cost of launching
MD> several instances of the parser is very expensive, about 80% of runtime.
MD> I'd like to be able to use one parser and have it "do" the entire file.
MD> What I've tried amounts to this:
MD> chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
MD> However, I think the first regex is eating too much.
MD> Any suggestions on how to do this?
This seems reasonable. Can you show a full runnable example that fails?
Ted
> What I've tried amounts to this:
>
> chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
Unfortunately that won't work, because every regex in a PRD grammar is
independent of the rest of the grammar, so even a minimal-matching .*?
eats everything.
Is there some reason you can't use something like:
my $parser = Parse::RecDescent->new($grammar);
$text =~ s{<DELIMITER> (.*?) </DELIMITER>}
{ $parser->parse($1); q{} }gexs;
???
Damian
Ya, that's what I was suspecting. In hind sight, I should have figured that;
that's how I'd write it...
> Is there some reason you can't use something like:
>
> my $parser = Parse::RecDescent->new($grammar);
>
> $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
> { $parser->parse($1); q{} }gexs;
That's what I was doing, but it seems I misinterpreted my profiling results.
I found from profiling that the function I use to create (once) and run the
parser accounted for 80% of runtime.
I assumed that since I only create the parser once (if !defined), creating the
parser wasn't where the cost was. So I decided that it must be due to
actually running the parser, which might run several times during program
execution. My conclusion was that I needed to rewrite the grammar so that
the parser would only run once.
It sounds like I may need to go back to the old algorithm and start tuning the
grammar.
Would there be some way of manipulating the skip re to do this?
Something along the lines of:
top: <skip: /NOT START DELIMETER/> chunk(s) eof
chunk: delimeter_start <skip: /NORMAL SKIP/> command(s) delimiter_end
eof: /\Z/
The problem there is defining a skip that won't skip a
delimeter_start. This probably won't allow delimeter_start to _not_
mean the start of a set of commands as well.
Not tested, but just a suggestion.
MB
2009/9/4 Mike Diehl <mdi...@diehlnet.com>:
I needed to define a block like this:
perl until FLAG
PERL
FLAG;
which is like a 'here-doc' for inlining perl in another language that
doesn't require actually parsing the perl code. Like your input, I
need to match 'anything' up to the closing flag. I ended up using a
rule similar to your original solution, except instead of having a
/.*?/ match, I combined that with the next terminal. After playing
around a bit, I came up with the following test script that parses out
all valid chunks between 'START' and 'END' amongst other rubbish in
the input in one pass:
====== START CODE ======
use Parse::RecDescent;
use Data::Dumper;
#$::RD_TRACE = 1;
# assuming start/end delimiters of START and END
my $grammar = <<'STOP';
start:
chunk(s?)
chunk:
/.*?START/s command(s) 'END' # This is the important bit
{$item[2]}
command:
'test' ';'
{"TEST COMMAND"}
STOP
my $text = << 'STOP';
blah blah blayh
asdsd kjkl
START
test;
test;
END
kjsaljdlk
askd
START
test;
END
sad
asdgfdsf
gfsfg
STOP
my $res = Parse::RecDescent->new($grammar)->start($text);
print Data::Dumper::Dumper($res), "\n";
====== END CODE ======
Note that the /s modifier on the 'garbage scooping' re's is important
for this to work. Was scratching my head over that for a bit :)
The output of that is:
====== START OUTPUT ======
$VAR1 = [
[
'TEST COMMAND',
'TEST COMMAND'
],
[
'TEST COMMAND'
]
];
====== END OUTPUT ======
I haven't done any benchmarking, but that might be faster than
sequential parses of 'clean' data. My original solution anchored to
the end of the input with an eof marker and a 'trailing_guff' rule
that matched anything after the chunk(s?) subrule, but that is
unnecessary.
MB
2009/9/4 Matthew Braid <matt...@gmail.com>: