Best way to extract from regex in if statement

bwgoudey

unread,

Apr 3, 2009, 9:14:41 PM4/3/09

to pytho...@python.org

I have a lot of if/elif cases based on regular expressions that I'm using to
filter stdin and using print to stdout. Often I want to print something
matched within the regular expression and the moment I've got a lot of cases
like:

...
elif re.match("^DATASET:\s*(.+) ", line):
m=re.match("^DATASET:\s*(.+) ", line)
print m.group(1))

which is ugly because of the duplication but I can't think of a nicer of way
of doing this that will allow for a lot of these sorts of cases. Any
suggestions?
--
View this message in context: http://www.nabble.com/Best-way-to-extract-from-regex-in-if-statement-tp22878967p22878967.html
Sent from the Python - python-list mailing list archive at Nabble.com.

Jon Clements

unread,

Apr 3, 2009, 9:56:54 PM4/3/09

to

On 4 Apr, 02:14, bwgoudey <bwgou...@gmail.com> wrote:
> I have a lot of if/elif cases based on regular expressions that I'm using to
> filter stdin and using print to stdout. Often I want to print something
> matched within the regular expression and the moment I've got a lot of cases
> like:
>
> ...
> elif re.match("^DATASET:\s*(.+) ", line):
> m=re.match("^DATASET:\s*(.+) ", line)
> print m.group(1))
>
> which is ugly because of the duplication but I can't think of a nicer of way
> of doing this that will allow for a lot of these sorts of cases. Any
> suggestions?
> --

> View this message in context:http://www.nabble.com/Best-way-to-extract-from-regex-in-if-statement-...

> Sent from the Python - python-list mailing list archive at Nabble.com.

How about something like:

your_regexes = [
re.compile('rx1'),
re.compile('rx2'),
# etc....
]

for line in lines:
for rx in your_regexes:
m = rx.match(line)
if m:
print m.group(1)
break # if only the first matching regex is required,
otherwise leave black for all

Untested, but seems to make sense

hth,

Jon

Tim Chase

unread,

Apr 3, 2009, 10:11:04 PM4/3/09

to bwgoudey, pytho...@python.org

bwgoudey wrote:
> I have a lot of if/elif cases based on regular expressions that I'm using to
> filter stdin and using print to stdout. Often I want to print something
> matched within the regular expression and the moment I've got a lot of cases
> like:
>
> ...
> elif re.match("^DATASET:\s*(.+) ", line):
> m=re.match("^DATASET:\s*(.+) ", line)
> print m.group(1))
>
>
> which is ugly because of the duplication but I can't think of a nicer of way
> of doing this that will allow for a lot of these sorts of cases. Any
> suggestions?

I've done this in the past with re-to-function pairings:

def action1(matchobj):
print matchobj.group(1)
def action2(matchobj):
print matchobj.group(3)
# ... other actions to perform

searches = [
(re.compile(PATTERN1), action1),
(re.compile(PATTERN2), action2),
# other pattern-to-action pairs
]

# ...
line = ...
for regex, action in searches:
m = regex.match(line)
if m:
action(m)
break
else:
no_match(line)

(note that that's a for/else loop, not an if/else pair)

-tkc

George Sakkis

unread,

Apr 3, 2009, 10:24:19 PM4/3/09

to

Or in case you want to handle each regexp differently, you can
construct a dict {regexp : callback_function} that picks the right
action depending on which regexp matched. As for how to populate the
dict, if most methods are short expressions, lambda comes in pretty
handly, e.g.

{
rx1: lambda match: match.group(1),
rx2: lambda match: sum(map(int, match.groups())),
...
}

If not, you can combine the handler definition with the mapping update
by using a simple decorator factory such as the following (untested):

def rxhandler(rx, mapping):
rx = re.compile(rx)
def deco(func):
mapping[rx] = func
return func
return deco

d = {}

@rxhandler("^DATASET:\s*(.+) ", d)
def handle_dataset(match):
...

@rxhandler("^AUTHORS:\s*(.+) ", d)
def handle_authors(match):
...

HTH,
George

Paul Rubin

unread,

Apr 3, 2009, 10:26:25 PM4/3/09

to

bwgoudey <bwgo...@gmail.com> writes:
> elif re.match("^DATASET:\s*(.+) ", line):
> m=re.match("^DATASET:\s*(.+) ", line)
> print m.group(1))

Sometimes I like to make a special class that saves the result:

class Reg(object): # illustrative code, not tested
def match(self, pattern, line):
self.result = re.match(pattern, line)
return self.result

Then your example would look something like:

save_re = Reg()
....
elif save_re.match("^DATASET:\s*(.+) ", line):
print save_re.result.group(1)

Tim Chase

unread,

Apr 3, 2009, 10:32:55 PM4/3/09

to George Sakkis, pytho...@python.org

> Or in case you want to handle each regexp differently, you can
> construct a dict {regexp : callback_function} that picks the right
> action depending on which regexp matched.

One word of caution: dicts are unsorted, so if more than one
regexp can match a given line, they either need to map to the
same function, or you need to use a list of regexp-to-functions
(see my previous post) for a determinate order.

-tkc

Paul McGuire

unread,

Apr 4, 2009, 10:26:05 AM4/4/09

to

On Apr 3, 9:26 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

> bwgoudey <bwgou...@gmail.com> writes:
> > elif re.match("^DATASET:\s*(.+) ", line):
> > m=re.match("^DATASET:\s*(.+) ", line)
> > print m.group(1))
>
> Sometimes I like to make a special class that saves the result:
>
> class Reg(object): # illustrative code, not tested
> def match(self, pattern, line):
> self.result = re.match(pattern, line)
> return self.result
>

I took this a little further, *and* lightly tested it too.

Since this idiom makes repeated references to the input line, I added
that to the constructor of the matching class.

By using __call__, I made the created object callable, taking the RE
expression as its lone argument and returning a boolean indicating
match success or failure. The result of the re.match call is saved in
self.matchresult.

By using __getattr__, the created object proxies for the results of
the re.match call.

I think the resulting code looks pretty close to the original C or
Perl idiom of cascading "elif (c=re_expr_match("..."))" blocks.

(I thought about cacheing previously seen REs, or adding support for
compiled REs instead of just strings - after all, this idiom usually
occurs in a loop while iterating of some large body of text. It turns
out that the re module already caches previously compiled REs, so I
left my cacheing out in favor of that already being done in the std
lib.)

-- Paul

import re

class REmatcher(object):
def __init__(self,sourceline):
self.line = sourceline
def __call__(self, regexp):
self.matchresult = re.match(regexp, self.line)
self.success = self.matchresult is not None
return self.success
def __getattr__(self, attr):
return getattr(self.matchresult, attr)

This test:

test = """\
ABC
123
xyzzy
Holy Hand Grenade
Take the pebble from my hand, Grasshopper
"""

outfmt = "'%s' is %s [%s]"
for line in test.splitlines():
matchexpr = REmatcher(line)
if matchexpr(r"\d+$"):
print outfmt % (line, "numeric", matchexpr.group())
elif matchexpr(r"[a-z]+$"):
print outfmt % (line, "lowercase", matchexpr.group())
elif matchexpr(r"[A-Z]+$"):
print outfmt % (line, "uppercase", matchexpr.group())
elif matchexpr(r"([A-Z][a-z]*)(\s[A-Z][a-z]*)*$"):
print outfmt % (line, "a proper word or phrase",
matchexpr.group())
else:
print outfmt % (line, "something completely different", "...")

Produces:
'ABC' is uppercase [ABC]
'123' is numeric [123]
'xyzzy' is lowercase [xyzzy]
'Holy Hand Grenade' is a proper word or phrase [Holy Hand Grenade]
'Take the pebble from my hand, Grasshopper' is something completely
different [...]

Nick Craig-Wood

unread,

Apr 16, 2009, 3:14:10 AM4/16/09

to

> import re
>
> class REmatcher(object):
> def __init__(self,sourceline):
> self.line = sourceline
> def __call__(self, regexp):
> self.matchresult = re.match(regexp, self.line)
> self.success = self.matchresult is not None
> return self.success
> def __getattr__(self, attr):
> return getattr(self.matchresult, attr)

That is quite similar to the one I use...

"""
Matcher class encapsulating a call to re.search for ease of use in conditionals.
"""

import re

class Matcher(object):
"""
Matcher class

m = Matcher()

if m.search(r'add (\d+) (\d+)', line):
do_add(m[0], m[1])
elif m.search(r'mult (\d+) (\d+)', line):
do_mult(m[0], m[1])
elif m.search(r'help (\w+)', line):
show_help(m[0])

"""
def search(self, r, s):
"""
Do a regular expression search and return if it matched.
"""
self.value = re.search(r, s)
return self.value
def __getitem__(self, n):
"""
Return n'th matched () item.

Note so the first matched item will be matcher[0]
"""
return self.value.group(n+1)
def groups(self):
"""
Return all the matched () items.
"""
return self.value.groups()

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick