But if you want to re-used the match tree for something different (say,
instead of doing a semantic analysis we want to do syntax hilighting)
it's rather hard to reconstruct the original text, and what part of it
was matched by which subrule. Currently you have to fiddle with $/.from
and $/.to, and sort all subrules by their respective $/.from and $/.to,
and then work out which part hasn't been matched by subrules.
This is rather weird and error-prone difference, and I wonder if we
should provide some easier way to access all chunks of text in the order
that they were matched.
I guess this description isn't very clear, so I'll try with an example:
"abc 234 def 789 for 456" ~~ mm/ [ <ident> \d+ ]**0..2 'for' (\d+) /;
$/.chunks would be this list:
$<ident>[0],
' ',
'234',
' ',
$<ident>[1],
' ',
'789',
' ',
'for',
' ',
'456'
I don't know if the syntax and exact semantics are very good, but IMHO
we should have some way of reconstructing a match that is closer to the
original string than to the structure of the matching regex.
(I also don't know if that's feasible in terms of efficiency)
Any ideas?
Moritz
--
Moritz Lenz
http://perlgeek.de/ | http://perl-6.de/ | http://sudokugarden.de/
Perhaps aliases...?
m/ <this>+ <that>? <andthen=this>* /
This is probably not exactly what you're looking for, but
that would be what I would look at for this specific example.
Pm
I'm looking more for a general solution for which you don't have to
manipulate the rule itself, and which should ideally work with as little
knowledge of the rule as possible.
Just see through which loops STD5_dump_match (in the same dir as STD.pm)
has to jump to get a grab of the parse tree in the right order.
Larry
Yes, funny thing is I was just thinking about the same thing this
morning after Mitchell Charity noticed that elsifs were missing
from the tree. It will be relatively trivial to do this with STD,
since it already produces a general mapping from position to hash,
which it uses to cache whitespace matches and line numbers, but could
easily record what matched where. (See the .<_> hash for that.)
In my case, I was wanting to find the set of non-whitespace things
that are parsed but don't end up in the parse tree. Maybe the :keepall
modifier needs access to something like this as well.
It may also let me remove the kludge whereby ~ remembers the delimiters
on either side.
It could also revolutionize the implementation of split. :)
My big question is how best to make this ordered info available within
a Match, given that we currently use the Positional role for something
else. An argument could be made that this info is more important than
revealing $0,$1 etc at the top level of the Match, that is, that split
semantics are more natural than comb semantics for @($/). One data
point is that the STD grammar uses very little $0 and then only as
a named parameter that happens to have a numeric name. So we could
easily demote $0 etc to meaning $/.numbered[0] or some such. Of course,
it goes the other way too, and we can reveal the splits via a .split
method or some such. Plus we can have multiple levels of splitting
semantics, so then *they'd* be fighting over Positional if we made
one of them default.
So I'm thinking @($/) stays the way it is, but .splits might return
the top-level splits for a given rule, where strings are intermixed
with child tree nodes, whereas something like .allsplits might return
all the ordered strings along with mappings to what parsed them.
If we did that, then there's the question of whether .splits needs to
run the pattern lazily so that we can do a limited /':'/.splits(4)
and such. That may turn out to be abuse of the lazy system though.
And technically, that regex *isn't* binding the colons to a child
node, so there's a little semantic mismatch there as well, since a
split implemented in terms of .splits would look more like /.*?(':')/.
So maybe .splits is the wrong name. Suggestions welcome.
The cool thing about .allsplits is that if you doing, say, syntax
highlighting on the fly in an editor, it might be relatively easy to
run down the list and determine top-level nodes that limit how much
needs to be reparsed. Contrariwise, with the "fate" system of STD it
might even be relatively easy to put the parser back into a state
that was deeply recursive and restart the parse at any point.
'Course, "relatively easy" is one o' them relative concepts... :)
Larry
That’s an intriguing observation. Another case for having some
XPath-ish facility in the language?
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>