parenthesis

Michele Simionato

unread,

Nov 4, 2002, 3:24:31 PM11/4/02

to

Suppose I want to parse the following expression:

>>> exp='(a*(b+c*(2-x))+d)+f(s1)'

I want to extract the first part, i.e. '(a*(b+c*(2-x))+d)'.

Now if I use a greedy regular expression

>>> import re; greedy=re.compile('$.*$')

I obtain to much, the full expression:

>>> match=greedy.search(exp); match.group()

'(a*(b+c*(2-x))+d)+f(s1)'

On the other hand, if I use a nongreedy regular expression

>>> nongreedy=re.compile('$.*?$')

I obtain too little:

>>> match=nongreedy.search(exp); match.group()

'(a*(b+c*(2-x)'

Is there a way to specify a clever regular expression able to match
the first parenthesized group ? What I did, was to write a routine
to extract the first parenthesized group:

def parenthesized_group(exp):
nesting_level,out=0,[]
for c in exp:
out.append(c)
if c=='(': nesting_level+=1
elif c==')': nesting_level-=1
if nesting_level==0: break
return ''.join(out)

>>> print parenthesized_group(exp)

(a*(b+c*(2-x))+d)

Still, this seems to me not the best way to go and I would like to know
if this can be done with a regular expression. Notice that I don't need
to control all the nesting levels of the parenthesis, for me it is enough
to recognize the end of the first parenthesized group.

Obiously, I would like a general recipe valid for more complicate
expressions: in particular I cannot assume that the first group ends
right before a mathematical operator (like '+' in this case) since
these expressions are not necessarely mathematical expressions (as the
example could wrongly suggest). In general I have expressions of the
form

( ... contains nested expressions with parenthesis... )...other stuff

where other stuff may contain nested parenthesis. I can assume that
there are no errors, i.e. that all the internal open parenthesis are
matched by closing parenthesis.

Is this a problem which can be tackled with regular expressions ?

TIA,

--
Michele Simionato - Dept. of Physics and Astronomy
210 Allen Hall Pittsburgh PA 15260 U.S.A.
Phone: 001-412-624-9041 Fax: 001-412-624-9163
Home-page: http://www.phyast.pitt.edu/~micheles/

Joshua Marshall

unread,

Nov 4, 2002, 3:49:43 PM11/4/02

to

Regular expressions are not powerful enough to be used to match
strings when you need to be intelligent about nesting. There are
probably parser generators available--links anyone?

For your particular application, also take a look at the "parser"
Python module. It's a little ugly, since it gives you complete
(rather than abstract) syntax trees, but it may help you.

Matthew Knepley

unread,

Nov 4, 2002, 4:16:31 PM11/4/02

to

>>>>> ">" == Joshua Marshall <jmar...@mathworks.com> writes:

>> Regular expressions are not powerful enough to be used to match strings when you need to be intelligent about
>> nesting. There are probably parser generators available--links anyone?

I really like PLY from David Beazley, http://systems.cs.uchicago.edu/ply.

Matt

>> For your particular application, also take a look at the "parser" Python module. It's a little ugly, since it
>> gives you complete (rather than abstract) syntax trees, but it may help you.

--
"Failure has a thousand explanations. Success doesn't need one" -- Sir Alec Guiness

Mike C. Fletcher

unread,

Nov 4, 2002, 4:43:57 PM11/4/02

to

http://simpleparse.sf.net/

HTH,
Mike

Joshua Marshall wrote:

--
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/

Bengt Richter

unread,

Nov 4, 2002, 5:05:11 PM11/4/02

to

Well, they don't count, so if you want to count you have to throw in
something extra. E.g., you could do this, to insert a delimiter after
a closing right paren, and then split on the delimiter. Probably not
wonderfully efficient, and I am just duplicating what you did, except
the regex separates the chunks for me.

>>> import re
>>> rx = re.compile(r'([()]|[^()]*)')
>>> class Addelim:
... def __init__(self): self.parens=0
... def __call__(self, m):
... s = m.group(1)
... if s=='(': self.parens+=1
... if self.parens==1 and s==')':
... self.parens=0
... return s+'\x00'
... if s==')': self.parens -=1
... return s
...
>>> for e in rx.sub(Addelim(),exp).split('\x00'): print e
...
(a*(b+c*(2-x))+d)
+f(s1)

Where exp was
>>> exp
'(a*(b+c*(2-x))+d)+f(s1)'

Regards,
Bengt Richter

Lee Harr

unread,

Nov 4, 2002, 8:42:16 PM11/4/02

to

In article <2259b0e2.02110...@posting.google.com>,

Michele Simionato wrote:
> Suppose I want to parse the following expression:
>
>>>> exp='(a*(b+c*(2-x))+d)+f(s1)'
>
> I want to extract the first part, i.e. '(a*(b+c*(2-x))+d)'.
>

n = 0
for c in range(len(exp)):
l = exp[c]
if l == '(':
n += 1
elif l == ')':
n -= 1
if not n:
out = exp[0:c+1]
break
print out

Bengt Richter

unread,

Nov 4, 2002, 9:25:57 PM11/4/02

to

On 4 Nov 2002 22:05:11 GMT, bo...@oz.net (Bengt Richter) wrote:

>On 4 Nov 2002 12:24:31 -0800, mi...@pitt.edu (Michele Simionato) wrote:
>
>>Suppose I want to parse the following expression:
>>
>>>>> exp='(a*(b+c*(2-x))+d)+f(s1)'
>>
>>I want to extract the first part, i.e. '(a*(b+c*(2-x))+d)'.
>>

[... previous version ...]

Wondering why I didn't just write:

>>> import re
>>> rx = re.compile(r'([()]|[^()]+)')
>>> class Addelim:
... def __init__(self, delim):
... self.parens=0; self.delim=delim

... def __call__(self, m):
... s = m.group(1)
... if s=='(': self.parens+=1
... if self.parens==1 and s==')':
... self.parens=0

... return s+self.delim

... if s==')': self.parens -=1
... return s
...

>>> exp = '(a*(b+c*(2-x))+d)+f(s1)'

It was natural to be able to specify the delimiter. And the + is probably
better than the * on the non-paren "[^()]+" part of the pattern.
Then using \n as delimiter to break into lines one can just print it.

>>> print rx.sub(Addelim('\n'),exp)
(a*(b+c*(2-x))+d)
+f(s1)

Which you could also use like:

>>> print rx.sub(Addelim('\n'),exp).splitlines()
['(a*(b+c*(2-x))+d)', '+f(s1)']

Or to get back to your original requirement,

>>> print rx.sub(Addelim('\n'),exp).splitlines()[0]
(a*(b+c*(2-x))+d)

But I suspect it would run faster to let a regex split the string and then use
a loop like yours on the pieces, which would be '(' or ')' or some other string
that you don't need to look at character by character. E.g.,

>>> rx = re.compile(r'([()])')
>>> ss = rx.split(exp)
>>> ss
['', '(', 'a*', '(', 'b+c*', '(', '2-x', ')', '', ')', '+d', ')', '+f', '(', 's1', ')', '']

Notice that the splitter matches wind up at the odd indices. I think that's generally true
when you put parens around the splitting expression, to return the matches as part of the list,
but I'm not 100% certain. Anyway, you could make use of that, something like:

>>>
>>> parens = 0
>>> endix = []
>>> for i in range(1,len(ss),2):
... if parens==1 and ss[i]==')':
... parens=0; endix.append(i+1)
... elif ss[i]=='(': parens += 1
... else: parens -= 1
...
>>> endix
[12, 16]

You could break the loop like you did if you just want the first expression,
or you could grab it by

>>> print ''.join(ss[:endix[0]])
(a*(b+c*(2-x))+d)

or list the bunch,

>>> lo=0
>>> for hi in endix:
... print ''.join(ss[lo:hi])
... lo = hi
...
(a*(b+c*(2-x))+d)
+f(s1)

or whatever. Which is not as slick, but probably faster if you had to do a bi-ig bunch of them.

I think when the fenceposts are simple, but you are mainly interested in the data between, splitting
on a fencepost regex and processing the resulting list can be simpler and faster than trying to
do it all with a complex regex.

Regards,
Bengt Richter

p...@quadstone.com

unread,

Nov 5, 2002, 8:19:57 AM11/5/02

to

> On 4 Nov 2002 22:05:11 GMT, bo...@oz.net (Bengt Richter) wrote:
>
> >On 4 Nov 2002 12:24:31 -0800, mi...@pitt.edu (Michele Simionato) wrote:
> >
> >>Suppose I want to parse the following expression:
> >>
> >>>>> exp='(a*(b+c*(2-x))+d)+f(s1)'
> >>
> >>I want to extract the first part, i.e. '(a*(b+c*(2-x))+d)'.
> >>

The method I've used in perl (! sorry, new to python :) with some success is to
generalize your routine that tracks the nesting level but actually modifies the
string to insert escaped versions of the parens. ie. call with something like:

escape_paren('(a*(b+c*(2-x))+d)+f(s1)','()')

where second arg is optional two character string with the opening and closing
brackets.

this returns a string like:

'<QL0>a*<QL1>b+c*<QL2>2-x<QR2><QR1>+d<QR0>+f<QL0>s1<QR0>'

then you can do whatever you like with regexp, and call an inverse routine

unescape_paren(escaped_str,'()')

to put back the parens (obviously the routines also need to handle (un)escaping
things that look like escaped parens).

Patrick

Michele Simionato

unread,

Nov 5, 2002, 12:56:54 PM11/5/02

to

bo...@oz.net (Bengt Richter) wrote in message news:<aq6qun$ib4$0...@216.39.172.122>...

Very interesting approach. But it is even more interesting to compare
its
speed with the simple minded approach. I thought your algorithm was
going to
be the fastest, since you do not split the initial string chars by
chars in Python, but let the regular expression do the job.
However a simple benchmark (not subtracting the overhead times) gives:

parenthesized_group: 130-140 microseconds
Addelim: 620-640 microseconds

The simple minded approach is more than four-five times faster!

I think this is rather instructive. Addelim is particular inefficient
for
long expressions, since it analizes the full expression whereas
parenthesized_group stops at the end of the first parenthesized group.
For fun I run Addelim on exp*100 (i.e. the 100 times the original
string): it takes more than 50000 microseconds whereas
parenthesized_group
is still on 140 microseconds.

It is good to have yet another proof of the dangers involved with
regular
expressions !

Bye,

Michele

Michele Simionato

unread,

Nov 5, 2002, 1:01:36 PM11/5/02

to

Lee Harr <mis...@frontiernet.net> wrote in message news:<use8fo2...@corp.supernews.com>...

This is the same I did.

Michele Simionato

unread,

Nov 5, 2002, 1:10:34 PM11/5/02

to

> Wondering why I didn't just write:
>
> >>> import re
> >>> rx = re.compile(r'([()]|[^()]+)')
> >>> class Addelim:
> ... def __init__(self, delim):
> ... self.parens=0; self.delim=delim
> ... def __call__(self, m):
> ... s = m.group(1)
> ... if s=='(': self.parens+=1
> ... if self.parens==1 and s==')':
> ... self.parens=0
> ... return s+self.delim
> ... if s==')': self.parens -=1
> ... return s
> ...
> >>> exp = '(a*(b+c*(2-x))+d)+f(s1)'
>
> It was natural to be able to specify the delimiter. And the + is probably
> better than the * on the non-paren "[^()]+" part of the pattern.

Not really. My benchmark gives essentially the same for "[^()]+*" and
"[^()]*", no sensible difference.

I strongly suspect that in this simple problem the simple approach is by far
the fastest.

Bye,

Michele

Michele Simionato

unread,

Nov 5, 2002, 1:42:57 PM11/5/02

to

Lee Harr <mis...@frontiernet.net> wrote in message news:<use8fo2...@corp.supernews.com>...

This is not the same that I did, indeed. Even if the logic is essentially the
same, you don't need to create the out list. It turns out the your approach
is 35% faster than mine, contrary to my naive expectations !

There-are-lies-damned-lies-and-benchmarks-ly yours,

Michele

Scott Gilbert

unread,

Nov 5, 2002, 3:39:33 PM11/5/02

to

mi...@pitt.edu (Michele Simionato) wrote:
>
> Suppose I want to parse the following expression:
>
> >>> exp='(a*(b+c*(2-x))+d)+f(s1)'
>

I'm not sure if you want to understand why simple regexps can't do
what you want, or if you just want a solution to your problem. Here
is one path you could try:

>>> from compiler import parse
>>> parse(exp)

Module(None, Stmt([Discard(Add((Add((Mul((Name('a'), Add((Name('b'),
Mul((Name('c'), Sub((Const(2), Name('x'))))))))), Name('d'))),
CallFunc(Name('f'), [Name('s1')], None, None))))]))

Cheers,
-Scott

Joshua Marshall

unread,

Nov 5, 2002, 4:03:35 PM11/5/02

to

I wasn't aware of this one. Much nicer than the output from the parser module.

Bengt Richter

unread,

Nov 5, 2002, 7:59:56 PM11/5/02

to

On 5 Nov 2002 09:56:54 -0800, mi...@pitt.edu (Michele Simionato) wrote:

>bo...@oz.net (Bengt Richter) wrote in message news:<aq6qun$ib4$0...@216.39.172.122>...
>> Well, they don't count, so if you want to count you have to throw in
>> something extra. E.g., you could do this, to insert a delimiter after
>> a closing right paren, and then split on the delimiter. Probably not

^^^^^^^^^^^^

>> wonderfully efficient, and I am just duplicating what you did, except

^^^^^^^^^^^^^^^^^^^^^-[1]

>> the regex separates the chunks for me.
>>
>> >>> import re
>> >>> rx = re.compile(r'([()]|[^()]*)')
>> >>> class Addelim:
>> ... def __init__(self): self.parens=0
>> ... def __call__(self, m):
>> ... s = m.group(1)
>> ... if s=='(': self.parens+=1
>> ... if self.parens==1 and s==')':
>> ... self.parens=0
>> ... return s+'\x00'
>> ... if s==')': self.parens -=1
>> ... return s
>> ...
>> >>> for e in rx.sub(Addelim(),exp).split('\x00'): print e
>> ...
>> (a*(b+c*(2-x))+d)
>> +f(s1)
>>
>> Where exp was
>> >>> exp
>> '(a*(b+c*(2-x))+d)+f(s1)'
>>
>> Regards,
>> Bengt Richter
>
>Very interesting approach. But it is even more interesting to compare

Yes, I had just learned from a post of Alex M that sub can take a function,
so I thought it would be interesting to use it ;-)

>its
>speed with the simple minded approach. I thought your algorithm was
>going to
>be the fastest, since you do not split the initial string chars by

I didn't think so [1] ;-)

>chars in Python, but let the regular expression do the job.

Well, doing a job that doesn't really need to be done never increases speed ;-)

I.e., for your original problem, there is no need to create an alternative string
with delimiters after expressions, never mind to split that string again to get
at just the first item. All that's needed is to look at the input string and count
to find out where the first expression ends.

The question then becomes, what is the fastest way to look for parentheses and make
sure you've counted some before detecting balanced count.

If parentheses were separated by thousands of characters, I would expect that a regex search
or finditer for [()] would go faster than a simple s[i] indexed scan, but for short expressions
such as the example, there's probably a crossover point. Certainly '(x)' can be scanned brute force
faster than you can get results from a regex object method.

For speed, I think it might be faster to count position but not use it to get the
characters, e.g.,

>>> def get_1st(exp):
... parens=i=0
... for c in exp:
... i += 1
... if c=='(': parens+=1
... elif c==')':
... parens -=1
... if not parens: return exp[:i]
...
>>> exp = '(a*(b+c*(2-x))+d)+f(s1)'
>>> get_1st(exp)
'(a*(b+c*(2-x))+d)'

This is a little different from Lee Harr's post (which BTW looks to me
like it depends on the first char being '('). I'd expect the above to run a
little faster.

>However a simple benchmark (not subtracting the overhead times) gives:
>
>parenthesized_group: 130-140 microseconds
>Addelim: 620-640 microseconds

Yes, Addelim has calling overhead for one, and is creating a new string by substitution
for another, so it's doing a lot of un-needed work.

>
>The simple minded approach is more than four-five times faster!
>
>I think this is rather instructive. Addelim is particular inefficient
>for
>long expressions, since it analizes the full expression whereas
>parenthesized_group stops at the end of the first parenthesized group.

Sure. The cost of generalizing requirements instead of sticking to specs ;-)

>For fun I run Addelim on exp*100 (i.e. the 100 times the original
>string): it takes more than 50000 microseconds whereas
>parenthesized_group
>is still on 140 microseconds.

Not exactly unanticipated, I'm sure ;-)

>
>It is good to have yet another proof of the dangers involved with
>regular
>expressions !

There are dangers in drawing too general conclusions from particular
examples too ;-)

Regards,
Bengt Richter

Michele Simionato

unread,

Nov 6, 2002, 9:59:39 AM11/6/02

to

bo...@oz.net (Bengt Richter) wrote in message

> For speed, I think it might be faster to count position but not use it to get the
> characters, e.g.,
>
> >>> def get_1st(exp):
> ... parens=i=0
> ... for c in exp:
> ... i += 1
> ... if c=='(': parens+=1
> ... elif c==')':
> ... parens -=1
> ... if not parens: return exp[:i]
> ...
> >>> exp = '(a*(b+c*(2-x))+d)+f(s1)'
> >>> get_1st(exp)
> '(a*(b+c*(2-x))+d)'
>
> This is a little different from Lee Harr's post (which BTW looks to me
> like it depends on the first char being '('). I'd expect the above to run a
> little faster.
>

You are right:

parenthesized_group: 130-140 microseconds
Addelim: 620-640 microseconds

Harr: 90-95 microseconds
get_1st: 60-65 microseconds

get_1st wins as the faster approach, more than twice my original
version
using lists (which however was not intended for speed).

> There are dangers in drawing too general conclusions from particular
> examples too ;-)
>

I agree, still it has been instructive to me to compare the various
approaches: I learned something more that just the fact that this is
not a problem for regular expressions (thing that I suspected).

Thanks to everybody,

Michele