After reading over Apocalypse 5 one more time, I noticed that balanced
matches (like capturing nested parenthetical comments ((like this))) had
been glossed over in the rejection of RFC 145. What was not even
mentioned in the rejection was the possibility of balanced expressions
that would take rules as their opening and closing delimiters. This
would be especially useful, for example, when capturing nested tables in
an HTML document, since not all tables look the same (<table> vs. <tAbLE
attrs...> for instance). You may object that this would just make the
regexp uglier, but what happens if we allow XML-ish rules, e.g.
$html =~ /<balanced opening=<table_start> closing=<table_end>>/;
where the "balanced" rule gets to play with %_{opening} and %_{closing}
to do its magic?
I am not saying that such a "balanced" rule would be easy to implement
in Perl (I personally think that the "balanced" rule is something that
should be more deeply tied to the Regex Engine), but I am proposing that
it can simultaneously be very useful and still look nice. Isn't that
justification enough?
Comments are appreciated,
Peter Behroozi
rule parenthesized { \( ( <-[()]> | <parenthesized> ) \) }
The key to balanced delimiters is recursion. A5 gives us convenient
recursion; therefore, it gives us balanced delimiters.
--Brent Dax <bren...@cpan.org>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)
He who fights and runs away wasted valuable running time with the
fighting.
That being said, there may well be a builtin <self> rule that refers
to the current rule without having to name it. That lets you write
anonymous recursive rules, or possibly a generic rule that could
have more than one name.
Larry
So that would mean to match nested tables, I would have to write
rule nested_tables { <start_table> [ <!before <start_table>><!before
<end_table>> . | <nested_tables> ] <end_table> }
or maybe even
rule balanced { @_[0] [ <!before @_[0]><!before @_[1]> . | <self> ]
@_[1] };
$html =~ /<balanced(<start_table>, <end_table>)>/;
Forgiving lookahead syntax errors on my part, that isn't as bad as I had
thought. Thanks for pointing that out.
However, since you forced me to read through A5 again, I now have
another question :). Since we can now do
$string.tr %hash;
what happens when the keys of %hash have overlapping ranges by accident
or otherwise? Are there any other options than reporting an overlap
(hard), auto-sorting the key-value pairs (medium), or not allowing
hashes (easy)?
Peter Behroozi
I suspected as much, but didn't use it to avoid stepping on toes. :^)
Doing tr efficiently generally requires precompilation, so in the
case of a hash, the compiled result would be stored as a run-time
property. So we can really do whatever processing we want, on
the assumption that the hash will change much less frequently than
it gets used. Alternatively, we could restrict hashes to single
character translations. But under UTF-8 that doesn't guarantee a
constant string length (as measured in bytes). But I would guess that
hashes would be used for even longer sequences of characters too,
so some amount of preprocessing would be desirable to determine if
one key was a prefix of another. Maybe it wants to get translated
to some sort of trie parser. Really just depends on how much memory
we want to throw at it to make it fast.
Larry