Regexp: Matching pairs of characters

40 views
Skip to first unread message

Michael Barth

unread,
Nov 28, 2001, 10:53:18 AM11/28/01
to
Is it possible to match pairs of characters (e.g braces, brackets,
parenthesis, ...) with the regexp command?

Example:

regexp {\((.*)\)} "{(ab (cd)) (ef)" -> match

In the example match is 'ab (cd)) (ef' but it should be 'ab (cd)'. How
can you do this?

Regards,
Mike

Bob Techentin

unread,
Nov 28, 2001, 11:48:03 AM11/28/01
to

You can match "pairs" using a non-greedy quantifier. Just add "?" after
the ".*" in your pattern, and [regexp] will try to match the shortest
string instead of the longest. Like this:

% regexp -inline -all {\((.*?)\)} "junk (ab) (cd) (ef)"
(ab) ab (cd) cd (ef) ef

But counting "nested pairs" is trickier. You might be able to do it
with lookahead or back references. But I'm not that much of a regexp
expert.

Bob
--
Bob Techentin techenti...@mayo.edu
Mayo Foundation (507) 538-5495
200 First St. SW FAX (507) 284-9171
Rochester MN, 55901 USA http://www.mayo.edu/sppdg/

lvi...@yahoo.com

unread,
Nov 28, 2001, 12:15:04 PM11/28/01
to

According to Michael Barth <M.B...@de.bosch.com>:
:regexp {\((.*)\)} "{(ab (cd)) (ef)" -> match

:
:In the example match is 'ab (cd)) (ef' but it should be 'ab (cd)'. How
:can you do this?
:
Actually, the regular expression matched exactly what you said to:

Match a literal left paren, followed by 0 or more occurances of any
character (which would include more left and right parents) followed
by a literal right paren.

I don't know of a way to write a single regular expression which would
do what you are trying to do in a general way. Certainly I can
envision a regular expression to parse the specified string to give
you what you are asking for. It would, however, be unlikely to work
in general.


--
"I know of vanishingly few people ... who choose to use ksh." "I'm a minority!"
<URL: mailto:lvi...@cas.org> <URL: http://www.purl.org/NET/lvirden/>
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.

Phil Powell

unread,
Nov 28, 2001, 5:41:00 PM11/28/01
to
I have had little success with embedded special characters and
regexp/regsub. In my frustration I wrote a proc that just might help
you out, its output is a list consisting of everything inside each
pair of special characters.

proc extractContents {blah char closeChar} {
set foundChar 0; set charCounter 0; set contents ""; set contentsList
""
for {set i 0} {$i < [string length $blah]} {incr i} {
if {[string index $blah $i] == $char && !$foundChar} {
set foundChar 1; incr charCounter
} elseif {[string index $blah $i] == $closeChar && $foundChar &&
$charCounter <= 1} {
set foundChar 0; incr charCounter -1
set contents [string trim $contents]
lappend contentsList "$contents"; set contents ""
} else {
if {[string index $blah $i] == $char} {
incr charCounter
} elseif {[string index $blah $i] == $closeChar} {
incr charCounter -1
}
append contents [string index $blah $i]
}
}

return $contentsList
}

set blah "(ab (cd)) (ef)"
set blah2 [extractContents $blah "(" ")"]; # This will produce {ab
(cd)} ef

I hope that helps
Phil

Michael Barth <M.B...@de.bosch.com> wrote in message news:<3C05086E...@de.bosch.com>...

Bruce Hartweg

unread,
Nov 28, 2001, 7:20:06 PM11/28/01
to
Michael Barth wrote:

finding matching pairs that can be nested can't be done with an RE, you need
to parse the string differently - juts go thru char by char- keeping track of nest
depth with each open paren & when you find final closing paren you're done.

If you just HAVE to use regexp you could do the equivalent by indices -
get position of all open & close parens & then loop thru closeList and find
first index that is greater than than same element in openList but less than
next element in openlist


** UNTESTED CODE ***

set match ""
set openList [regexp -all -inline -indices -- {\(} $string]
set closeList [regexp -all -inline -indices -- {\)} $string]
set i 0
foreach close $closeList {
set open1 [lindex $openList $i]
set open2 [lindex $openList [incr i]]
if {$open1 < $close && $close < $open2}{
set match [string range $string [incr $open1] [incr $close -1]]
break
}
}

Bruce


miguel sofer

unread,
Nov 28, 2001, 7:43:05 PM11/28/01
to
Bruce Hartweg wrote:
>
> Michael Barth wrote:
>
> > Is it possible to match pairs of characters (e.g braces, brackets,
> > parenthesis, ...) with the regexp command?
> >
> > Example:
> >
> > regexp {\((.*)\)} "{(ab (cd)) (ef)" -> match
> >
> > In the example match is 'ab (cd)) (ef' but it should be 'ab (cd)'. How
> > can you do this?
>
> finding matching pairs that can be nested can't be done with an RE, you need
> to parse the string differently - juts go thru char by char- keeping track of nest
> depth with each open paren & when you find final closing paren you're done.


A nice alternative (IF the input string is guaranteed to be "well
behaved" and the amount of white space is inconsequential) is to let the
tcl list parser do the job:

set newStr [string map {( \{ ) \}} $inputString]

set firstMatch [string map {\{ ( \} )} [lindex $newStr 0]]

That is a big IF though ...

Miguel Sofer

Arjen Markus

unread,
Nov 29, 2001, 2:49:29 AM11/29/01
to
lvi...@yahoo.com wrote:
>
> According to Michael Barth <M.B...@de.bosch.com>:
> :regexp {\((.*)\)} "{(ab (cd)) (ef)" -> match
> :
> :In the example match is 'ab (cd)) (ef' but it should be 'ab (cd)'. How
> :can you do this?
> :
> Actually, the regular expression matched exactly what you said to:
>
> Match a literal left paren, followed by 0 or more occurances of any
> character (which would include more left and right parents) followed
> by a literal right paren.
>
> I don't know of a way to write a single regular expression which would
> do what you are trying to do in a general way. Certainly I can
> envision a regular expression to parse the specified string to give
> you what you are asking for. It would, however, be unlikely to work
> in general.
>

I recently posted some code that will look for someting like this. If
you
are interested, I can send it again.

Regards,

Arjen

Michael Barth

unread,
Nov 29, 2001, 3:31:18 AM11/29/01
to
All of your answers show me what I already expexted. It is impossible to
match nested pairs with the regexp command. So, the way to do it is to
go through the string and count the pairs up and down.

To do it in this manner you have posted a lot of ideas which will help
me solving my problem. Thank's a lot.

Michael

Donal K. Fellows

unread,
Nov 29, 2001, 6:57:08 AM11/29/01
to
Michael Barth wrote:
> All of your answers show me what I already expexted. It is impossible to
> match nested pairs with the regexp command.

Yes. This is a hard theoretical limit on what RE's can do, and you need
a different kind of matcher (with a stack or counter) to go beyond it.

However, Tcl does give you exactly the right tools for doing this.

> So, the way to do it is to
> go through the string and count the pairs up and down.

Try this, which returns all matches:

set s "(ab (cd)) (ef)"

set result {}
while {[regexp -indices -- {\(([^()]*)\)} $s rep where]} {
set sub [eval [list string range $s] $where]
lappend result $sub
set s [eval [list string replace $s] $rep [list \x80$sub\x81]]
}
set result [string map {\x80 ( \x81 )} $result]

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- Anyone using MFC desperatly needs a nasal enigma.
-- David Steuber <tras...@david-steuber.com>

Michael Barth

unread,
Nov 30, 2001, 6:48:25 AM11/30/01
to
"Donal K. Fellows" wrote:
[...]

> Try this, which returns all matches:
>
> set s "(ab (cd)) (ef)"
>
> set result {}
> while {[regexp -indices -- {\(([^()]*)\)} $s rep where]} {
> set sub [eval [list string range $s] $where]
> lappend result $sub
> set s [eval [list string replace $s] $rep [list \x80$sub\x81]]
> }
> set result [string map {\x80 ( \x81 )} $result]
>
> Donal.

That's very near on the behaviour I need. I like it. I will do it in
this way. Thank you.

If the code is finished and I am no more as much buzy, I'll report a
little bit more on my results and why it would be nice to have a
solution with regexp.

Michael


> --
> Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
> -- Anyone using MFC desperatly needs a nasal enigma.
> -- David Steuber <tras...@david-steuber.com>

*grin*

Michael Barth

unread,
Dec 4, 2001, 6:19:01 AM12/4/01
to
The solution I realized is similar to Donal's suggestion:

proc nestingpairs {stringname {open \{} {close \}}} {
# input: stringname name of the string to edit (call by ref)
# open opening character
# close closing character
# output: stringname substitude all pairs with @n
# return: $return substitution list
upvar $stringname string
set string [string map [list $open \x80 $close \x81] $string]
set RE {\x80([^\x80\x81]*)\x81}

set result {}
set count 0
while {[regexp -indices -- $RE $string rep where]} {
lappend result [eval [list string range $string] $where]
set string [eval [list string replace $string] $rep @$count]
incr count
}
return $result
}


The procedure replaces the nested pairs of the string stringname with
@n, where n is the index pointing to the replacement in the list of
return value.

The problem I had to solve was a little parser to interprete the header
of MAST code. Therefore I wrote some elements I need for parsing:

set blank {[[:blank:]]}
set space {[[:space:]]+}
set newline {\n}
set nonewline {[^\n]+}
set name {[[:alpha:]][[:alnum:]_]*}
set integer {[0-9]+}
set real {[+-]?(?:[0-9]+\.[0-9]*|\.[0-9]+)(?:[eE][+-]?[0-9]+)?}

set separ1 "(?:${blank}+|${blank}*,${blank}*)"
set separ2 "${blank}*,${blank}*"
set comment "(?n)#.*$"
set type "element(?:${blank}+component)?"
set node "${name}(?::${name})?"
set connects "(?:${noeqsign}+${node}(?:${separ1}${node})*)"
...

This is just like Bachus-Naur-Code to define syntax but in a way the
regexp command can understand:

regexp "($type)?template($name)($connections)" -> t n c

Unfortunatly regexp does not support nesting pairs of characters :-(
Therefore I replaced them with the procedure nestingpairs before
evaluating with regexp.

Michael

Reply all
Reply to author
Forward
0 new messages