Parsing nested parentheses (and other delimiters)

Amit Patel

unread,

Sep 12, 1997, 3:00:00 AM9/12/97

to

I've had this problem come up many times: I receive a string, and I
want to parse it. Regexps would do, EXCEPT for handling things like
nested parentheses, strings with backslash-quoted characters, etc.
Apparently I'm not the only one who could benefit from such a parser.
On another thread people are discussing the rfc822 parsing, which
doesn't seem to handle nested angle brackets or string quoting very
well.

Yes, I know, maybe I need a real parser.

However, this doesn't seem like something that requires a full blown
parser... just something a notch above string.split() or what a regexp
can do. I want to use a regexp like:

'<$.*$>' or '<$[^>]*$>'

Neither is correct, though. I don't want '.*' because it may grab too
many '>'s. I don't want '[^>]*' because it may not grab enough.
Instead I want a string that has proper nesting of <> and "".

I imagine this matcher function could be used as:

>>> m = make_nested_matcher('<','>')
>>> m.parse("Christiane Wieth <pager <1535494@\>telmi.de>> sent mail")
=> ['Christiane Wieth ','<pager <1535494@\>telmi.de>>',' sent mail']

I want output that splits the string, and respects the nesting levels
and quoting. A more sophisticated version would descend into the <>
enclosed text, and parse that as well.

Unfortunately I don't know where to find such a function, other than
in a full-blown parser system.

1. Is there something in the standard Python library that
would help?

2. Is there something in an undocumented lib (aiee) that
would help?

3. Is there contributed code somewhere that would help?

If the answer is 'no' to all three, then perhaps I'll have to sit down
and write it. :)

Thanks for any tips,

- Amit

Brad Howes

unread,

Sep 12, 1997, 3:00:00 AM9/12/97

to

>>>> Thus spake 'Amit Patel (am...@Xenon.Stanford.EDU)':

AP> I've had this problem come up many times: I receive a string, and I
AP> want to parse it. Regexps would do, EXCEPT for handling things like
AP> nested parentheses, strings with backslash-quoted characters, etc.
[snip]
AP> 1. Is there something in the standard Python library that
AP> would help?

AP> 2. Is there something in an undocumented lib (aiee) that
AP> would help?

AP> 3. Is there contributed code somewhere that would help?

How about this:

from string import splitfields, joinfields

#
# NestedParse -- parse a string into a list made up of groups delimited by
# the given left/right delimiter pair.
#
def NestedParse( s, leftDelim, rightDelim ):

#
# Replace any '[' and ']' with something that won't be in string (unsafe)
#
s = joinfields( splitfields( s, '[' ), '@LB@' )
s = joinfields( splitfields( s, ']' ), '@RB@' )

#
# Replace delimiters with string pairs that will create list of strings
# when eval'ed.
#
s = joinfields( splitfields( s, leftDelim ), '",["' )
s = joinfields( splitfields( s, rightDelim ), '"],"' )

#
# Finally, replace original substitutions
#
s = joinfields( splitfields( s, '@LB@' ), '[' )
s = joinfields( splitfields( s, '@RB@' ), ']' )

#
# Generate list of items
#
return eval( '["' + s + '"]' )

--
Brad Howes Motorola E-Mail ID: XBH001
EMT Development SMTP E-Mail: bho...@cssun3.corp.mot.com
Motorola Corporate - MD H1780 Voice: 602 441 1522 Fax: 602 441 5455

Michael Scharf

unread,

Sep 13, 1997, 3:00:00 AM9/13/97

to

This is a multi-part message in MIME format.
--------------856C0C77A67442CD6BB0E2A4
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

If you do not quote " and \ your might approach will run into problems:

>>> NestedParse('<">', '<','>')
Traceback (innermost last):
File "<stdin>", line 1, in ?
File "nestedparse.py", line 31, in NestedParse

return eval( '["' + s + '"]' )

File "<string>", line 1
["",["""],""]
^
SyntaxError: invalid token

Without quoting, you might get some strange side effects:
>>> NestedParse('",1+1,"', '<','>')
['', 2, '']
or worse:
>>> NestedParse('",open(\'/etc/passwd\').read(),"', '<','>')

to avoid the last one, use a super restricted environment
(if you don't trust the quoting!)

return eval( '["' + s + '"]', {"__builtins__": {} } )

Michael
--
''''\ Michael Scharf
` c-@@ TakeFive Software
` > http://www.TakeFive.com
\_ V mailto:Michael...@TakeFive.co.at
--------------856C0C77A67442CD6BB0E2A4
Content-Type: text/plain; charset=us-ascii; name="nestedparse.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline; filename="nestedparse.py"

# nestedparse.py
# original by Brad Howes <bho...@cssun3.corp.mot.com>
#
from string import joinfields, splitfields, find

_idmap = ''
for i in range(1,32): _idmap = _idmap + chr(i)

def FindQuoteString(s, start=''):
"""Find a substring that is not present in s."""
# The \000 will be in most cases the hit!
for c in _idmap:
if find(s,start+c)<0:
return start+c
# Please find a better algorythm....
return FindQuoteString(s,'\001')

def NestedParse( s, leftDelim, rightDelim ):

"""Parse a nested string into a list.

Parse a string into a list made up of groups
delimited by the given left/right delimiter pair."""

#
# Create strings that are not substrings of s
#
quote=FindQuoteString(s)
XL=quote+'L'
XR=quote+'R'

#
# Replace any '[' and ']' with something that won't be in string ...
#
s = joinfields( splitfields( s, '[' ), XL )
s = joinfields( splitfields( s, ']' ), XR )

#
# .. and also quotes and backslashed (quote first the backslashes!)
#
s = joinfields( splitfields( s, '\\' ), '\\\\' )
s = joinfields( splitfields( s, '"' ), '\\"' )

#
# Replace delimiters with string pairs that will create list of strings
# when eval'ed.
#
s = joinfields( splitfields( s, leftDelim ), '",["' )
s = joinfields( splitfields( s, rightDelim ), '"],"' )

#
# Finally, replace original substitutions
#

s = joinfields( splitfields( s, XL ), '[' )
s = joinfields( splitfields( s, XR ), ']' )

#
# Generate list of items
#
return eval( '["' + s + '"]' )

--------------856C0C77A67442CD6BB0E2A4--

John Viega

unread,

Sep 13, 1997, 3:00:00 AM9/13/97

to

In article <5vcabg$bkt$1...@nntp.Stanford.EDU> am...@Xenon.Stanford.EDU (Amit Patel) writes:

> I've had this problem come up many times: I receive a string, and I

> want to parse it. Regexps would do, EXCEPT for handling things like

> nested parentheses, strings with backslash-quoted characters, etc.

Yeah, matching things leads you into the realm of context-free
languages and beyond, which regular expressions just aren't powerful
enough to handle.

> Yes, I know, maybe I need a real parser.

Do you mean parser generator? Sure, you don't need a parser
generator. But knowing how to write a recursive decent parser is
invaluable for this sort of problem.

> I imagine this matcher function could be used as:
>
> >>> m = make_nested_matcher('<','>')
> >>> m.parse("Christiane Wieth <pager <1535494@\>telmi.de>> sent mail")
> => ['Christiane Wieth ','<pager <1535494@\>telmi.de>>',' sent mail']
>
> I want output that splits the string, and respects the nesting levels
> and quoting. A more sophisticated version would descend into the <>
> enclosed text, and parse that as well.

Okay, I whipped something up that does this for you. You could tweak
it for either speed or readibility, probably =) I present it here
just how it came out of my head.

Usage:
parser = Parser('<', '>')
parser.parse(somestring)

It accepts a left and right character. If you pass in a string of
more than 1 char for either one, it will not work.

If there is a mismatch in string quoting (" and ' both work), or in
your start and end delimiters, a ParseError is raised.

The return value is a list of items. If an item starts with your
delimiter, the return value is a list. For example, if you do:

parser = Parser('<', '>')
parser.parse("<foo> bar")

You'll get: [['foo'], ' bar']

Note that the list as the first item implies that we matched < and >.

This handles nested delimiters.

For example, if you do:
parser.parse("<foo <bar>>")

You'll get: [['foo ', ['bar']]]

You can use a \ in a string to escape any special character.

Quoted strings count as their own piece. I wasn't sure what you
wanted here, but if you do:
parser.parse("foo 'biz boz'")

You'll get:
['foo ', "'biz boz'"]

The alternative was to have it return ["foo 'biz boz'"], but that
shouldn't be too hard to change if you so desire.

Hope this helps!

------------%< snip %<----------------------%< snip %<------------

# Private class
class Token:
pass

ParseError = "ParseError"

# These variables are for private use only
LEFT_CHAR = 1
RIGHT_CHAR = 2
END = 3
OTHER = 4

class Parser:
def __init__(self, lchar, rchar):
self.lchar = lchar
self.rchar = rchar

def parse(self, str):
self.str = str
self.index = 0
self.strlen = len(str)
res, mismatch = self.line();
if mismatch:
raise ParseError
return res

# Don't call anything below this line yourself.

# line() returns a tuple of (matched_stuff, found_right_delim)
# The caller will know if the right delim matches a left delim
# or if it is an error.
def line(self):
results = []
while 1:
token = self.scan()
if token.id is RIGHT_CHAR:
return results, 1
elif token.id is LEFT_CHAR:
res, found_right = self.line()
if not found_right:
raise ParseError # No right delimiter
results.append(res)
elif token.id is END:
return results, 0
else:
results.append(token.value)

def scan(self):
t = Token()
t.id = OTHER
if self.index >= self.strlen:
t.id = END
return t
chr = self.str[self.index]
if chr == self.rchar:
t.id = RIGHT_CHAR
self.index = self.index + 1
return t
elif chr == self.lchar:
t.id = LEFT_CHAR
self.index = self.index + 1
return t
elif chr == '"' or chr == "'":
begin = self.index
while 1:
self.index = self.index + 1
if self.index >= self.strlen:
raise ParseError
if self.str[self.index] == '\\':
self.index = self.index + 1 # skip escaped char.
continue # You could do more complex
# escape handling depending
# on your needs.
if self.str[self.index] == chr:
break
self.index = self.index + 1
t.value = self.str[begin:self.index]
return t
else:
begin = self.index
while 1:
self.index = self.index + 1
if self.index >= self.strlen:
t.value = self.str[begin:]
return t
if self.str[self.index] == '\\': # see above comment.
self.index = self.index + 1
continue
if self.str[self.index] in ['"', "'", self.lchar, self.rchar]:
t.value = self.str[begin:self.index]
return t

Amit Patel

unread,

Sep 14, 1997, 3:00:00 AM9/14/97

to

>Brad Howes wrote:
>>
>>
>> How about this:
>>
>> from string import splitfields, joinfields
>>
>> #
>> # NestedParse -- parse a string into a list made up of groups delimited by
>> # the given left/right delimiter pair.
>> #

[code snipped]

Michael Scharf <Michael...@takefive.co.at> wrote:
>
>If you do not quote " and \ your might approach will run into problems:
>

>Without quoting, you might get some strange side effects:

Good points. :)

I might try using the 'parser' module to get the structure before it
is evaluated. Unfortunately it looks somewhat complicated. :)

Thanks for your help! I now have lots of ideas running around in my
head now :)

Amit

Amit Patel

unread,

Sep 14, 1997, 3:00:00 AM9/14/97

to

John Viega <jt...@adder.cs.virginia.edu> wrote:
>In article <5vcabg$bkt$1...@nntp.Stanford.EDU> am...@Xenon.Stanford.EDU (Amit Patel) writes:
>
>> I've had this problem come up many times: I receive a string, and I
>> want to parse it. Regexps would do, EXCEPT for handling things like
>> nested parentheses, strings with backslash-quoted characters, etc.
>
>Yeah, matching things leads you into the realm of context-free
>languages and beyond, which regular expressions just aren't powerful
>enough to handle.
>
>> Yes, I know, maybe I need a real parser.
>
>Do you mean parser generator? Sure, you don't need a parser
>generator. But knowing how to write a recursive decent parser is
>invaluable for this sort of problem.

(Yes, I meant a parser)

The main reason I was avoiding a recursive descent parser was that it
seemed like overkill. :) However, I'm starting to think that it's
probably the best way, and you've done all the work for me! :)
Thanks! :)

>Okay, I whipped something up that does this for you. You could tweak
>it for either speed or readibility, probably =) I present it here
>just how it came out of my head.
>

>Hope this helps!

Thanks!

- Amit

Andrew Kuchling

unread,

Sep 15, 1997, 3:00:00 AM9/15/97

to

am...@Xenon.Stanford.EDU (Amit Patel) wrote:
>want to parse it. Regexps would do, EXCEPT for handling things like
>nested parentheses, strings with backslash-quoted characters, etc.

>Apparently I'm not the only one who could benefit from such a parser.

Hmmm... the Perl developers are starting to experiment with
embedding code in a regex, using (?{print "Hello"}). The idea is that
it pushes regexes up toward the realm of parsers, and a program could
simply be a big regex that executed code as it cranked through the
data. Perhaps we should follow suit.

For example, let's assume any code executed in a regex has
magical access to the groups that have been matched so far, via a
group() function. (This isn't a serious interface proposal...)
Imagine writing a regex which only matched Python keywords:

import keyword
...
(\w+\b)(?{keyword.iskeyword( group(1) ) })

keyword.iskeyword() returns true or false depending on the string it
gets, so the expression will either match or fail depending on what
gets returned. Since regexes are first-class objects in Python, you
could write code something like:

parseExpr= regex.compile(""" spam spam spam ...
"(" # Match a parenthesis
"(?{ parseExpr.match( <some args> ) })" # A recursive regex! Run away!
")" # Match a parenthesis
)

This capability would be either very, very good or very, very evil. A
quotation from one of Douglas Adams' Doctor Who episodes comes to
mind: "The concept is staggering. Pointless, but staggering."

Andrew Kuchling
a...@magnet.com
http://starship.skyport.net/crew/amk/

John Viega

unread,

Sep 18, 1997, 3:00:00 AM9/18/97

to

In article <1997091514...@lemur.magnet.com> Andrew Kuchling <a...@magnet.com> writes:

> parseExpr= regex.compile(""" spam spam spam ...
> "(" # Match a parenthesis
> "(?{ parseExpr.match( <some args> ) })" # A recursive regex! Run away!
> ")" # Match a parenthesis
> )

In this example, it is not clear where you are consuming your input.
Here's what I mean: <some args> needs to know what has been matched so
far, so parseExpr can start scanning from there, and then it needs to
communicate somehow where parseExpr should continue scanning when the
recursive call returns.

I think perhaps that working on an LL(k) parser generator for Python
that has a straightforward and intuitive syntax is probably a better
solution, but then again, my opinion is that regexps are highly
overused.

M.-A. Lemburg

unread,

Sep 20, 1997, 3:00:00 AM9/20/97

to Amit Patel

Amit Patel wrote:
>
> I've had this problem come up many times: I receive a string, and I

> want to parse it. Regexps would do, EXCEPT for handling things like
> nested parentheses, strings with backslash-quoted characters, etc.

[...]

>
> 3. Is there contributed code somewhere that would help?
>

I'm currently working on something which might help you, called
tag tables. Basically what you do is assign objects to text slices
and because of it's ability to recurse, it can handle far more
than what regexps can (like recognizing context free languages,
which is what you have in mind here).

But I warn you: this stuff is still undergoing some heavy changes
and is a bit hard core when compared to regexps (but then: at least
you can see and even debug what's going on :-).

Have a look at:

http://starship.skyport.net/~lemburg/

And look for a 'tagging engine'.

BTW: a new version will be out in a few weeks (I'm adding lots
of things, because I started to use it in a real project and
some weaknesses showed up).

--
Marc-Andre Lemburg