Store captures and non-captures in source-string order

Moritz Lenz

unread,

Oct 12, 2008, 5:44:05 AM10/12/08

to Perl6

When we write regexes, we generally capture stuff in a way that makes
the following semantic analysis easier. For example we could have a
regex m/ <this>+ <that>? <this>*/ if we're only interested in the match
trees of what <this> and <that> matches, not their respective order.

But if you want to re-used the match tree for something different (say,
instead of doing a semantic analysis we want to do syntax hilighting)
it's rather hard to reconstruct the original text, and what part of it
was matched by which subrule. Currently you have to fiddle with $/.from
and $/.to, and sort all subrules by their respective $/.from and $/.to,
and then work out which part hasn't been matched by subrules.

This is rather weird and error-prone difference, and I wonder if we
should provide some easier way to access all chunks of text in the order
that they were matched.

I guess this description isn't very clear, so I'll try with an example:

"abc 234 def 789 for 456" ~~ mm/ [ <ident> \d+ ]**0..2 'for' (\d+) /;
$/.chunks would be this list:

$<ident>[0],
' ',
'234',
' ',
$<ident>[1],
' ',
'789',
' ',
'for',
' ',
'456'

I don't know if the syntax and exact semantics are very good, but IMHO
we should have some way of reconstructing a match that is closer to the
original string than to the structure of the matching regex.

(I also don't know if that's feasible in terms of efficiency)

Any ideas?

Moritz

--
Moritz Lenz
http://perlgeek.de/ | http://perl-6.de/ | http://sudokugarden.de/

Patrick R. Michaud

unread,

Oct 12, 2008, 11:08:50 AM10/12/08

to Moritz Lenz, Perl6

On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote:
> When we write regexes, we generally capture stuff in a way that makes
> the following semantic analysis easier. For example we could have a
> regex m/ <this>+ <that>? <this>*/ if we're only interested in the match
> trees of what <this> and <that> matches, not their respective order.

> [...]

> But if you want to re-used the match tree for something different (say,
> instead of doing a semantic analysis we want to do syntax hilighting)
> it's rather hard to reconstruct the original text, and what part of it
> was matched by which subrule.

Perhaps aliases...?

m/ <this>+ <that>? <andthen=this>* /

This is probably not exactly what you're looking for, but
that would be what I would look at for this specific example.

Pm

Moritz Lenz

unread,

Oct 12, 2008, 11:34:49 AM10/12/08

to Patrick R. Michaud, Perl6

I'm looking more for a general solution for which you don't have to
manipulate the rule itself, and which should ideally work with as little
knowledge of the rule as possible.

Just see through which loops STD5_dump_match (in the same dir as STD.pm)
has to jump to get a grab of the parse tree in the right order.

Larry Wall

unread,

Oct 13, 2008, 12:54:36 PM10/13/08

to perl...@perl.org, Perl6

Or maybe we're not thinking big enough here. Maybe we're looking at
a generalized tree query language that, as limiting cases, defines the
.splits and .allsplits as (re)linearized query results, where .splits
linearizes the top level nodes, and .allsplits linearizes the leaves,
but may intermediate linearizations are possible. Don't want to
get stuck into binary thinking here...

Larry

Larry Wall

unread,

Oct 13, 2008, 12:47:30 PM10/13/08

to perl...@perl.org, Perl6

Yes, funny thing is I was just thinking about the same thing this
morning after Mitchell Charity noticed that elsifs were missing
from the tree. It will be relatively trivial to do this with STD,
since it already produces a general mapping from position to hash,
which it uses to cache whitespace matches and line numbers, but could
easily record what matched where. (See the .<_> hash for that.)
In my case, I was wanting to find the set of non-whitespace things
that are parsed but don't end up in the parse tree. Maybe the :keepall
modifier needs access to something like this as well.

It may also let me remove the kludge whereby ~ remembers the delimiters
on either side.

It could also revolutionize the implementation of split. :)

My big question is how best to make this ordered info available within
a Match, given that we currently use the Positional role for something
else. An argument could be made that this info is more important than
revealing $0,$1 etc at the top level of the Match, that is, that split
semantics are more natural than comb semantics for @($/). One data
point is that the STD grammar uses very little $0 and then only as
a named parameter that happens to have a numeric name. So we could
easily demote $0 etc to meaning $/.numbered[0] or some such. Of course,
it goes the other way too, and we can reveal the splits via a .split
method or some such. Plus we can have multiple levels of splitting
semantics, so then *they'd* be fighting over Positional if we made
one of them default.

So I'm thinking @($/) stays the way it is, but .splits might return
the top-level splits for a given rule, where strings are intermixed
with child tree nodes, whereas something like .allsplits might return
all the ordered strings along with mappings to what parsed them.

If we did that, then there's the question of whether .splits needs to
run the pattern lazily so that we can do a limited /':'/.splits(4)
and such. That may turn out to be abuse of the lazy system though.
And technically, that regex *isn't* binding the colons to a child
node, so there's a little semantic mismatch there as well, since a
split implemented in terms of .splits would look more like /.*?(':')/.
So maybe .splits is the wrong name. Suggestions welcome.

The cool thing about .allsplits is that if you doing, say, syntax
highlighting on the fly in an editor, it might be relatively easy to
run down the list and determine top-level nodes that limit how much
needs to be reparsed. Contrariwise, with the "fate" system of STD it
might even be relatively easy to put the parser back into a state
that was deeply recursive and restart the parse at any point.

'Course, "relatively easy" is one o' them relative concepts... :)

Larry

Aristotle Pagaltzis

unread,

Oct 13, 2008, 1:46:50 PM10/13/08

to perl6-l...@perl.org

* Larry Wall <la...@wall.org> [2008-10-13 19:00]:

> Maybe we're looking at a generalized tree query language

That’s an intriguing observation. Another case for having some
XPath-ish facility in the language?

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>