Balanced Matches in Regexps?

Peter Behroozi

unread,

Aug 17, 2002, 12:59:12 PM8/17/02

to perl6-l...@perl.org

Hello All,

After reading over Apocalypse 5 one more time, I noticed that balanced
matches (like capturing nested parenthetical comments ((like this))) had
been glossed over in the rejection of RFC 145. What was not even
mentioned in the rejection was the possibility of balanced expressions
that would take rules as their opening and closing delimiters. This
would be especially useful, for example, when capturing nested tables in
an HTML document, since not all tables look the same (<table> vs. <tAbLE
attrs...> for instance). You may object that this would just make the
regexp uglier, but what happens if we allow XML-ish rules, e.g.

$html =~ /<balanced opening=<table_start> closing=<table_end>>/;

where the "balanced" rule gets to play with %_{opening} and %_{closing}
to do its magic?

I am not saying that such a "balanced" rule would be easy to implement
in Perl (I personally think that the "balanced" rule is something that
should be more deeply tied to the Regex Engine), but I am proposing that
it can simultaneously be very useful and still look nice. Isn't that
justification enough?

Comments are appreciated,

Peter Behroozi

Brent Dax

unread,

Aug 17, 2002, 3:31:39 PM8/17/02

to Peter Behroozi, perl6-l...@perl.org

Peter Behroozi:
# After reading over Apocalypse 5 one more time, I noticed that
# balanced matches (like capturing nested parenthetical
# comments ((like this))) had been glossed over in the
# rejection of RFC 145. What was not even mentioned in the

rule parenthesized { $ ( <-[()]> | <parenthesized> ) $ }

The key to balanced delimiters is recursion. A5 gives us convenient
recursion; therefore, it gives us balanced delimiters.

--Brent Dax <bren...@cpan.org>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

He who fights and runs away wasted valuable running time with the
fighting.

Larry Wall

unread,

Aug 17, 2002, 4:05:33 PM8/17/02

to Brent Dax, Peter Behroozi, perl6-l...@perl.org

On Sat, 17 Aug 2002, Brent Dax wrote:
: Peter Behroozi:

: # After reading over Apocalypse 5 one more time, I noticed that
: # balanced matches (like capturing nested parenthetical
: # comments ((like this))) had been glossed over in the
: # rejection of RFC 145. What was not even mentioned in the
:
: rule parenthesized { $ ( <-[()]> | <parenthesized> ) $ }
:
: The key to balanced delimiters is recursion. A5 gives us convenient
: recursion; therefore, it gives us balanced delimiters.

That being said, there may well be a builtin <self> rule that refers
to the current rule without having to name it. That lets you write
anonymous recursive rules, or possibly a generic rule that could
have more than one name.

Larry

Peter Behroozi

unread,

Aug 17, 2002, 4:16:22 PM8/17/02

to perl6-l...@perl.org

On Sat, 2002-08-17 at 14:31, Brent Dax wrote:
> Peter Behroozi:
> # After reading over Apocalypse 5 one more time, I noticed that
> # balanced matches (like capturing nested parenthetical
> # comments ((like this))) had been glossed over in the
> # rejection of RFC 145. What was not even mentioned in the
>
> rule parenthesized { $ ( <-[()]> | <parenthesized> ) $ }
>

So that would mean to match nested tables, I would have to write

rule nested_tables { <start_table> [ <!before <start_table>><!before
<end_table>> . | <nested_tables> ] <end_table> }

or maybe even

rule balanced { @_[0] [ <!before @_[0]><!before @_[1]> . | <self> ]
@_[1] };

$html =~ /<balanced(<start_table>, <end_table>)>/;

Forgiving lookahead syntax errors on my part, that isn't as bad as I had
thought. Thanks for pointing that out.

However, since you forced me to read through A5 again, I now have
another question :). Since we can now do

$string.tr %hash;

what happens when the keys of %hash have overlapping ranges by accident
or otherwise? Are there any other options than reporting an overlap
(hard), auto-sorting the key-value pairs (medium), or not allowing
hashes (easy)?

Peter Behroozi

Brent Dax

unread,

Aug 17, 2002, 4:21:26 PM8/17/02

to Larry Wall, Peter Behroozi, perl6-l...@perl.org

Larry Wall:
# That being said, there may well be a builtin <self> rule that
# refers to the current rule without having to name it. That
# lets you write anonymous recursive rules, or possibly a
# generic rule that could have more than one name.

I suspected as much, but didn't use it to avoid stepping on toes. :^)

Larry Wall

unread,

Aug 17, 2002, 8:04:30 PM8/17/02

to Peter Behroozi, perl6-l...@perl.org

On 17 Aug 2002, Peter Behroozi wrote:
: However, since you forced me to read through A5 again, I now have

: another question :). Since we can now do
:
: $string.tr %hash;
:
: what happens when the keys of %hash have overlapping ranges by accident
: or otherwise? Are there any other options than reporting an overlap
: (hard), auto-sorting the key-value pairs (medium), or not allowing
: hashes (easy)?

Doing tr efficiently generally requires precompilation, so in the
case of a hash, the compiled result would be stored as a run-time
property. So we can really do whatever processing we want, on
the assumption that the hash will change much less frequently than
it gets used. Alternatively, we could restrict hashes to single
character translations. But under UTF-8 that doesn't guarantee a
constant string length (as measured in bytes). But I would guess that
hashes would be used for even longer sequences of characters too,
so some amount of preprocessing would be desirable to determine if
one key was a prefix of another. Maybe it wants to get translated
to some sort of trie parser. Really just depends on how much memory
we want to throw at it to make it fast.

Larry