balanced paren regex's

ivo...@gmail.com

unread,

Jun 17, 2006, 9:51:57 AM6/17/06

to

[posted earlier in perl.modules, but no answer.]

dear perl users: I want to write a function that extracts "ordinary"
subroutines from perl code. (an equivalent task is extracting all
macros from a latex file.) I am not trying to be too clever. let's
presume I can recognize subs because subs and only subs always start at
the first character of a line and are not anonymous. a sub is followed
by a name and can contain nested expressions.

I can do plain pattern matching to find the first occurance of the
first sub: '^sub \w+'. but now I am stuck. I need to continue
on with a Text::Balanced expression right after, and after the
text::balanced is done, continue on with my regex search (\G).

my $text=
"
{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}
";

and I want to call a subroutine getnextsub() that first will return
"sub a {\n if (1==1) { print \"true\"; } # if string or
comment could contain unbalanced paren, even better\n }"
and on the next call will return
"sub b {\n if (0) { print \"false\" }\n }";

this must be a fairly common problem (e.g., extracting tex macro
arguments, etc.), but short of mimicking C in writing very low-level
paren counter subroutines on individual characters, I cannot see how to
solve this. I do understand the issue of how to treat nested subs, but
this right now is only a secondary concern.

help appreciated.

sincerely, /iaw

Dr.Ruud

unread,

Jun 17, 2006, 10:20:59 AM6/17/06

to

ivo...@gmail.com schreef:

> I want to write a function that extracts "ordinary"
> subroutines from perl code.

Search CPAN on 'balanced' or on 'parse'.

--
Affijn, Ruud

"Gewoon is een tijger."

Xicheng Jia

unread,

Jun 17, 2006, 3:34:15 PM6/17/06

to

ivo...@gmail.com wrote:
[snip]

> and I want to call a subroutine getnextsub() that first will return
> "sub a {\n if (1==1) { print \"true\"; } # if string or
> comment could contain unbalanced paren, even better\n }"
> and on the next call will return
> "sub b {\n if (0) { print \"false\" }\n }";
>
> this must be a fairly common problem (e.g., extracting tex macro
> arguments, etc.), but short of mimicking C in writing very low-level
> paren counter subroutines on individual characters, I cannot see how to
> solve this. I do understand the issue of how to treat nested subs, but
> this right now is only a secondary concern.

Here is one way you may use(the iterator way from the book HOP):
#########################
use strict;
use warnings;

my $text= <<'END_TEST';

{ text }
expressions
sub a {
if (1==1) { print "true"; } # if string or comment could contain
unbalanced paren, even better
}
more expressions
sub b {
if (0) { print "false" }
}

safdsf
END_TEST

local our $n;
# pattern to track embedded braces
my $pattern = qr/
(?> (?{$n = 0})
(?:
[^{}]
|
\{ (?{$n++})
|
\} (?(?{$n != 0}) (?{$n--}) | (?!) )
)*
)(?(?{$n != 0})(?!))
/x;

# set the iterator
my $it = getnextsub($text);
my $count = 0;

# loop through the text and print out all functions
while (my $next_sub = $it->()) {
print "Subroutine-", ++$count, " is:\n${next_sub}\n\n";
}

# subroutine to set the iteratior
sub getnextsub {
my $text = shift;
return sub {
my $sub_def;
if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
$sub_def = $1;
}
$sub_def;
}
}
####################################

Xicheng Jia

unread,

Jun 17, 2006, 7:52:02 PM6/17/06

to

BTW. you can make the subroutine "getnextsub" to skip any number of
function definitions.
#################

sub getnextsub {
my $text = shift;
return sub {

my $num_subs = shift || 1;
my ($sub_def, $cnt);
while ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
if (++$cnt == $num_subs) {
$sub_def = $1;
last;
}
}
return $sub_def || "undefined\n";
}
}
################
# if you have subroutine definitions a, b, c, d, e, f
# and in that order, then

$it = getnextsub($text); # set the iterator
$next_sub = $it->(); #get sub a {...}
$next_sub = $it->(3); #get sub d {...}
$next_sub = $it->(); #get sub e {...}
$next_sub = $it->(2); #return "undefined"
$it = getnextsub($text); # reset the iterator

Xicheng :-)

Xicheng Jia

unread,

Jun 17, 2006, 8:11:03 PM6/17/06

to

=> > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {

should change from \s* to \s+, and \w+ is enough to replace \S+

if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {

Xicheng

Xicheng Jia

unread,

Jun 18, 2006, 1:19:55 AM6/18/06

to

Xicheng Jia wrote:
> Xicheng Jia wrote:

[snip]

> => > if ($text =~ s/(sub\s*\S+\s*{$pattern})//) {
>
> should change from \s* to \s+, and \w+ is enough to replace \S+
>
> if ($text =~ s/(sub\s+\w+\s*{$pattern})//) {
>

one more modification:

if ($text =~ s/.*?(sub\s+\w+\s*{$pattern})//) {

Xicheng

ivo...@gmail.com

unread,

Jun 18, 2006, 9:55:01 AM6/18/06

to

thank you very much. regards, /iaw

ivo...@gmail.com

unread,

Jun 18, 2006, 4:15:24 PM6/18/06

to

I am truly becoming greedy now. is there a good/clever way to keep
track on which lineno the match was made on (i.e., how many \n occurred
before)? [something similar to $., but in connection with a text
match.]

regards,

/iaw

Ben Morrow

unread,

Jun 17, 2006, 6:06:59 PM6/17/06

to

Quoth ivo...@gmail.com:

>
> [posted earlier in perl.modules, but no answer.]
>
> dear perl users: I want to write a function that extracts "ordinary"
> subroutines from perl code. (an equivalent task is extracting all
> macros from a latex file.) I am not trying to be too clever. let's
> presume I can recognize subs because subs and only subs always start at
> the first character of a line and are not anonymous. a sub is followed
> by a name and can contain nested expressions.
>
> I can do plain pattern matching to find the first occurance of the
> first sub: '^sub \w+'. but now I am stuck. I need to continue
> on with a Text::Balanced expression right after, and after the
> text::balanced is done, continue on with my regex search (\G).

You mentioned Text::Balanced; how does extract_codeblock not do what you
want?

Ben

--
The cosmos, at best, is like a rubbish heap scattered at random.
Heraclitus
benm...@tiscali.co.uk

Xicheng Jia

unread,

Jun 18, 2006, 5:36:33 PM6/18/06

to

sure you can. the key for this method (you might want to read [1] for
more introduction about iteration) is how to use "closure" in Perl
subroutines. I revised the previous subroutine again and fixed some
bugs. :

1) 's' modifier is added in the s/// expression, otherwise .*? can not
match multiple lines;
2) capture two parts: $1, and $2, and use something like $1 =~ tr/\n//;
to count the number of newlines in a substring.
3) "return" statement is revised so that you can use the iterator in a
while loop;
4) two variables added: $line_num to count newlines in the whole
matched text block. $lineno is the line_number containing the keyword
'sub' of your function declaration...

###################################

sub getnextsub {
my $text = shift;

my $line_num = 0;

return sub {
my $num_subs = shift || 1;

my ($sub_def, $cnt) = ("", 0);
while ($text =~ s/(.*?(sub\s*\S+\s*{$pattern}))//s) {
$line_num += ($1 =~ tr/\n//);
if (++$cnt == $num_subs) {
$sub_def = $2;
my $lineno = $line_num + 1 - ($sub_def =~ tr/\n//);
print "line_number is $lineno\n";
last;
}
}
print "undefined\n" if not $sub_def;
return $sub_def;
}
} # end of getnextsub #
###################################

don't know where you want the line numbers to go, so just print them
out.

Good luck
Xicheng

[1] "Higher-Order Perl: Transforming Programs with Programs", by M.J.
Dominus.