matching multiple regexs to a single line...

Alexander Sendzimir

unread,

Nov 12, 2002, 3:04:31 PM11/12/02

to

How does one match multiple regexs to a single line as in a compiler in
Python? I'm used to doing the following in Perl, for example:

$line = <>;

if ( $line ~= /regex1/ )
{...}
elsif ( $line ~= /regex2/ )
{...}
elsif (...)
{ etc... }

To do this in Python and be able to get at the match if a regex succeeds
means I have to break up the if's because I haven't been able to get
Python to make an assignment in an if-statement. In other words,

if ( myMatch = re1.match( line ) ) :
...
elif ( myMatch = re2.match( line ) ) :
...

doesn't work for me. Perhaps I've missed something simple.

Thanks,

Alex

John Hunter

unread,

Nov 12, 2002, 4:18:01 PM11/12/02

to

>>>>> "Alexander" == Alexander Sendzimir <li...@battleface.com> writes:

Alexander> To do this in Python and be able to get at the match if a
Alexander> regex succeeds means I have to break up the if's because

That's what I usually end up doing

for line in myfile:
m = r1.match(line)
if m:
do_something()
break

m = r2.match(line)
if m:
do_something_else()
break

John Hunter

Alexander Sendzimir

unread,

Nov 12, 2002, 4:35:18 PM11/12/02

to

John,

Thanks for your response. This is what I was afraid of. This seems really
sloppy. You don't think there is any better way of doing such a thing? In
the mean time, what you propose is exactly what I've been doing. Hmmmm.

Alex

Trent Mick

unread,

Nov 12, 2002, 5:21:48 PM11/12/02

to

> On Tue, 12 Nov 2002 15:18:01 +0000, John Hunter wrote:
> > That's what I usually end up doing
> >
> > for line in myfile:
> > m = r1.match(line)
> > if m:
> > do_something()
> > break
> >
> > m = r2.match(line)
> > if m:
> > do_something_else()
> > break
> >
> >

[Alexander Sendzimir wrote]

> Thanks for your response. This is what I was afraid of. This seems really
> sloppy. You don't think there is any better way of doing such a thing? In
> the mean time, what you propose is exactly what I've been doing. Hmmmm.

A slight mod on John's code makes it seem pretty clean to me:

patterns = [re.compile('...'),
re.compile('...')]

for line in myfile:
for pattern in patterns:
match = pattern.match(line)
if match:
// do something with 'match'
break
else:
raise "none of the patterns matched"

This scales well for adding more patterns. I usually use named groups in
my Python regexs so the single "do something" block for matching any of
the patterns works itself out.

> ... because I haven't been able to get

> Python to make an assignment in an if-statement. In other words,
>
> if ( myMatch = re1.match( line ) ) :
> ...
> elif ( myMatch = re2.match( line ) ) :
> ...
>
> doesn't work for me. Perhaps I've missed something simple.

No, you didn't miss anything. You cannot use assignment in the test part
of a Python if-statement.

Trent

--
Trent Mick
Tre...@ActiveState.com

John Hunter

unread,

Nov 12, 2002, 5:55:08 PM11/12/02

to

Trent> A slight mod on John's code makes it seem pretty clean to me:
Trent>
Trent> patterns = [re.compile('...'),
Trent> re.compile('...')]

That does look nicer, but doesn't allow for differential processing
depending on which re instance matched. To do that, you can pair the
regexs with functions in a tuple of (rgx,action) pairs, or a dict if
order doesn't matter. Admittedly the code below is more verbose than
the perl version, but at least the design is clean.

import re

r1 = re.compile('John (\w+) (.*)')
r2 = re.compile('Bill')
r3 = re.compile('Sally (.*)')

def func2(m):
return 'just a func'

actions = ((r1 , lambda m: 'I found ' + m.group(2)),
(r2 , func2),
(r3 , lambda m: m.group(1)),
)

s1 = 'Sally was here'
s3 = 'John was Bill'
s5 = 'Bill and sally'
s2 = 'John was here'
s4 = 'No chance in hell'

lines = (s1,s2,s3,s4,s5)

for line in lines:
for (rgx, func) in actions:
m = rgx.match(line)
if m:
print func(m)
break
else:
print 'Nobody matched line: %s' % line

Noel Minet

unread,

Nov 12, 2002, 6:02:18 PM11/12/02

to

You can also use a more general regex and check if a group is None

something like

>>> import re
>>> regex_set = {'digit':'\d+','lowercase':r'[a-z]+','uppercase':r'[A-Z]+'}
>>> tst = re.compile(''.join(['(?P<%s>^%s$)?'%r for r in
regex_set.items()]))
>>>
>>> m = tst.match('1234')
>>> m.group('digit')
'1234'
>>> m.group('uppercase')
>>> m.group('lowercase')
>>>
>>> m = tst.match('UPPER')
>>> m.group('digit')
>>> m.group('uppercase')
'UPPER'
>>> m.group('lowercase')
>>>
>>> m = tst.match('lower')
>>> m.group('digit')
>>> m.group('uppercase')
>>> m.group('lowercase')
'lower'
>>>
>>> m = tst.match('Mixed')
>>> m.group('digit')
>>> m.group('uppercase')
>>> m.group('lowercase')
>>>

The drawback is that you compile all regexes

HTH

"Alexander Sendzimir" <li...@battleface.com> wrote in message
news:pan.2002.11.12....@battleface.com...

Alexander Sendzimir

unread,

Nov 12, 2002, 7:19:12 PM11/12/02

to

Trent,

Nice response. Thanks. This is more in line with what I'm looking for. I
will work with this and see what happens.

Thanks again. And again to John.

Alex

On Tue, 12 Nov 2002 14:21:48 +0000, Trent Mick wrote:
> A slight mod on John's code makes it seem pretty clean to me:
>
> patterns = [re.compile('...'),
> re.compile('...')]
>
> for line in myfile:
> for pattern in patterns:
> match = pattern.match(line)
> if match:
> // do something with 'match'
> break
> else:
> raise "none of the patterns matched"
>
> This scales well for adding more patterns. I usually use named groups in
> my Python regexs so the single "do something" block for matching any of
> the patterns works itself out.

> Trent

Bengt Richter

unread,

Nov 13, 2002, 12:23:00 AM11/13/02

to

Some might consider this cheating, but you could make a class instance
that you can use to grab a value in an expression for later use. E.g.,

>>> import re
>>> re1 = re.compile(r'one\s+(.*)')
>>> re2 = re.compile(r'two\s+(.*)')
>>> line = 'two This is message two.'

>>> class V:
... def __init__(self, v=None): self.v = v
... def __call__(self, *v):
... if v: self.v = v[0]
... return self.v
...
>>> v=V()

>>> if v(re1.match(line)):
... print v().group(1)
... elif v(re2.match(line)):
... print 'Use it differently:', v().group(1)
...
Use it differently: This is message two.

Regards,
Bengt Richter

Alexander Sendzimir

unread,

Nov 19, 2002, 9:45:38 AM11/19/02

to

# # # SOLUTION BASED ON PREVIOUS POSTS # # #

# For consistency, I've written all the notes as Python comments. So
# anything that's not a comment is code. The original problem is to
# match multiple regular expressions to a single line until one
# regular expression matches. The first solution was a brute force
# approach which entailed a (possibly long) series of match-if-break
# statements. See previous posts in this thread for example. After
# some consultation and experimentation, I've devised the following
# two solutions based on various input from Trent Mick and John
# Hunter. Thanks to both of them.

# Both solutions are of the same design. The second solution is an
# optimization if there are many regular expressions to match with
# corresponding actions to be taken. It uses a dictionary linking
# names to actions.

# The basic design is a list of tuples of the form (name, regex) where
# name is an arbitrary string identifying the regular expression
# regex. In the inner for-loop, the regular expression is matched to
# the current input line. If the match succeeds, then the match object
# (mo) is defined and the if-statement is true and falls through to
# the next if-statement. The second solution replaces this last
# if-statement with a dictionary lookup so that the 'do something'
# comment is replaced with a call to a single handler.

#
# F I R S T S O L U T I O N
#

regexs = [
( 'regex_id1', sre.compile( r'regex1' ) ),
( 'regex_id2', sre.compile( r'regex2' ) ),
( 'regex_id3', sre.compile( r'regex3' ) ),
.
.
.
( 'regex_idN', sre.compile( r'regexN' ) ) ]

for line in lines :
for regex in regexs :
mo = regex[1].match( line )
if mo :
if ( 'regex_id1' == regex[0] ) :
# do something
break
elif ( 'regex_id2' == regex[0] ) :
# do something
break
elif ( 'regex_id3' == regex[0] ) :
# do something
break
.
.
.
elif ( 'regex_idN' == regex[0] ) :
# do something
else :
pass

#
# S E C O N D S O L U T I O N
#

#
# define the handlers
#

def handler_id1 :
pass

def handler_id2 :
pass

def handler_id3 :
pass

.
.
.

def handler_idN :
pass

regexs = [
( 'regex_id1', sre.compile( r'regex1' ) ),
( 'regex_id2', sre.compile( r'regex2' ) ),
( 'regex_id3', sre.compile( r'regex3' ) ),
.
.
.
( 'regex_idN', sre.compile( r'regexN' ) ) ]

#
# define the dictionary of ids --> handlers
#

regex_actions = {
'regex_id1' : handler_id1,
'regex_id2' : handler_id2,
'regex_id3' : handler_id3,
.
.
.
'regex_idN' : handler_idN }

for line in lines :
for regex in regexs :
mo = regex[1].match( line )
if mo :
regex_actions[ regex[0] ]( optional_arguments );
else :
pass

# All this done and said, I wonder if it would be useful to have the
# capacity to assign a delegate method to a regular expression object?
# If the expression matches, then the delegate is called. If no
# delegate is assigned, then, of course, no action is taken which is
# the usual behavior.

Alex Martelli

unread,

Nov 19, 2002, 10:39:14 AM11/19/02

to

Alexander Sendzimir wrote:
...

> regexs = [
> ( 'regex_id1', sre.compile( r'regex1' ) ),
> ( 'regex_id2', sre.compile( r'regex2' ) ),
> ( 'regex_id3', sre.compile( r'regex3' ) ),

Not sure why you're using sre here instead of re. Anyway,
a MUCH faster way is to build a single RE pattern by:

onerepat = '(' + ')|('.join([r.pattern for n, r in regexs]) + ')'
onere = re.compile(onerepat)

of course, it would be even faster without that wasteful compile
to put the compiled-re into regexs followed by having to use the
r.pattern attribute to recover the starting pattern, but anyway...
Then, bind matchobj = onere.match(line) and check matchobj.lastindex.

This doesn't work if the patterns define groups of their own, but
a slightly more sophisticated approach can help -- use _named_
groups for the join...

Akex

Alexander Sendzimir

unread,

Nov 19, 2002, 6:49:42 PM11/19/02

to

# Alex, the reason I used sre is because I feel it is the right module to use
# with Python 2.2.2. Re is a wrapper. However, my knowledge in this area is
# limited and could stand to be corrected.

# I should have stated in my last post, that if speed is an issue,
# then I might not be coding in Python but rather a language that
# compiles to a real processor and not a virtual machine.

# I will say that what I don't like about the approach you propose is that it
# throws information away. Functionally, it leads to brittle code which is hard
# to maintain and can lead to errors which are very hard to track down if one
# is not familiar with the implementation.

# This now said, yours is definitely a fast approach and under the right
# circumstances would be very useful. In my experience, clear writing takes
# precedence over clever implementation (most of the time). There are
# exceptions.

# As for the code itself, the lastindex method counts groups. If
# there are groups within any of the regular expressions defined, then
# lastindex is non-linear with respect to the intended expressions.
# So, it becomes very difficult to determine which outermost expression
# lastindex refers to. Perhaps you can see the maintenance problems
# arising here?

# The code included here produces the following output:

# (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
# ab1234 6
# abcd1234 3
# abcdcd 1
# abcdcdcdcd 1
# zh192 7
# abcdefghij 3

# I've labelled the groups below for reference.

# (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
# 1 2 3 4 5 6 7

# As you can see some expressions are 'identified' by more than one
# group. This is not desirable and is difficult to maintain. If you
# change the grouping in any of the regular expressions, then you have
# some possibly heavy code changes to make down the line. The approach
# you commented on is more generalized and easier to change with required
# adjustments falling into place.

# There might a few other issues. However, I haven't had time to fully
# explore them.

# Thanks, Alex.

# Alex. ;-)

import re
import sys

lines = [
r'ab1234',
r'abcd1234',
r'abcdcd',
r'abcdcdcdcd',
r'zh192',
r'abcdefghij' ]

regex_patterns = [
r'abcd(cd){1,4}',
r'abcd',
r'abcd(cd)?',
r'ab',
r'zh192' ]

regex_bigpat = '(' + ')|('.join( regex_patterns ) + ')'
print regex_bigpat

regex = re.compile( regex_bigpat )

for line in lines :
mo = regex.match( line )
if mo :
print line, mo.lastindex
else :
print line, 'not matched'

sys.exit( 0 );

jdhu...@ace.bsd.uchicago.edu

unread,

Nov 19, 2002, 7:16:02 PM11/19/02

to

Wouldn't it be a simpler and more readily maintainable design drop the
regex_ids and use a list of tuples where the first element is a rgx
and the second element is an action? In your version you have to
maintain a dictionary and a list, as well as (somewhat superfluous)
regex ids.

import re # you do want the re wrapper, not sre
lines = map(str,range(3)) # some dummy lines for testing
handler1 = handler2 = handler3 = lambda x: None # some very dumb handlers

regexs = (
( re.compile( r'regex1' ), handler1 ),
( re.compile( r'regex2' ), handler2 ),
( re.compile( r'regex3' ), handler3 ),
)

for line in lines :
for (rgx, action) in regexs:
m = rgx.match( line )
if m: action(m)

John Hunter

Alexander Sendzimir

unread,

Nov 19, 2002, 8:27:33 PM11/19/02

to

Come to think of it, you're right!

The regex ids are a hold over. Whether using a list or a
dictionary, regex and handler are sufficient.

Why re and no sre?

Thanks, John.

Alex

John Hunter

unread,

Nov 19, 2002, 9:17:02 PM11/19/02

to

>>>>> "Alexander" == Alexander Sendzimir <li...@battleface.com> writes:

Alexander> Why re and no sre?

As I understand it, sre is an update to the regex engine of 1.5.2.
The module re.py is a compatibility wrapper which imports sre. From
python 2.2.2's re.py:

engine = "sre"
# engine = "pre"

if engine == "sre":
# New unicode-aware engine
from sre import *
from sre import __all__
else:
# Old 1.5.2 engine. This one supports 8-bit strings only,
# and will be removed in 2.0 final.
from pre import *
from pre import __all__

So there is nothing wrong with importing sre, but import re does the
same thing on any reasonably modern system and is (AFAIK) The Way
To Do It <wink>.

From the Library Reference:

*Implementation note:* The `re' module has two distinct
implementations: `sre' is the default implementation and includes
Unicode support, but may run into stack limitations for some
patterns. Though this will be fixed for a future release of
Python, the older implementation (without Unicode support) is
still available as the `pre' module.

John Hunter

Alex Martelli

unread,

Nov 20, 2002, 4:59:47 AM11/20/02

to

Alexander Sendzimir wrote:

>
> # Alex, the reason I used sre is because I feel it is the right module to

> # use with Python 2.2.2. Re is a wrapper. However, my knowledge in this
> # area is limited and could stand to be corrected.

sre is an implementation detail, not even LISTED among the modules at:

http://www.python.org/doc/current/lib/modindex.html

Depending on an implementation detail, not documented except for a
mention in an "implementation note" in the official Python docs, seems
a weird choice to me -- and I'm not sure what other knowledge except
what can easily be obtained by a cursory scan of Python's docs is
needed to show that.

> # I should have stated in my last post, that if speed is an issue,
> # then I might not be coding in Python but rather a language that
> # compiles to a real processor and not a virtual machine.

If you had mentioned that, I would have noticed that this is anything but
obvious when regular expression matching is a substantial part of the
computational load: when the process is spending most of its time in the re
engine, the quality of said engine may well dominate other performance
issues. That's quite an obvious thing, of course.

> # I will say that what I don't like about the approach you propose is that

> # it throws information away. Functionally, it leads to brittle code which
> # is hard to maintain and can lead to errors which are very hard to track
> # down if one is not familiar with the implementation.

I suspect it would be possible to construct utteances with which I
could disagree more thoroughly than I disagree with this one, but
it would take some doing. I think that, since re patterns support
the | operator, making use of that operator "throws away" no
information whatsoever and hence introduces no brittleness.

> # This now said, yours is definitely a fast approach and under the right
> # circumstances would be very useful. In my experience, clear writing

> # takes precedence over clever implementation (most of the time). There
> # are exceptions.

I do agree with this, and I think my coding is quite clear for
anybody with a decent grasp of regular expressions -- and people
WITHOUT such a grasp should simply stay away from RE's (for
production use) until they've acquired said grasp.

> # As for the code itself, the lastindex method counts groups. If
> # there are groups within any of the regular expressions defined, then

Really?! Oh, maybe THAT was why I had this in my post:

> This doesn't work if the patterns define groups of their own

which you didn't even have the decency to *QUOTE*?! Cheez... are
you so taken up with the "cleverness" of reducing your posts'
readability by making most of them into comments, that ordinary
conventions of Usenet, such as reading what you're responding to,
and quoting relevant snippets, are forgotten...?!

> # lastindex is non-linear with respect to the intended expressions.
> # So, it becomes very difficult to determine which outermost expression
> # lastindex refers to. Perhaps you can see the maintenance problems
> # arising here?

No, I always say "this doesn't work" when referring to something
I can't see *ANY* problems with -- doesn't everybody? [Not even
worth an emoticon, and I can't find one for "sneer", anyway...]

> # I've labelled the groups below for reference.

If instead of "labeling" in COMMENTS, you took the tiny trouble
of studying the FUNDAMENTALS of Python's regular expressions, and
named groups in particular, you might be able to see how to "label"
*IN THE EXPRESSION ITSELF*. As I continued in my post which you
didn't bother quoting,

> a slightly more sophisticated approach can help -- use named
> groups for the join...

I may as well complete this (even though, on past performance,
you may simply ignore this, not quote anything, and post a
"huge comment" re-saying what I already said -- there MIGHT be
some more normal people following this thread...:-): if the
re-patterns you're joining may in turn contain named groups,
you need to make the outer grouping naming unique, by any of
the usual "naming without a namespace" idioms such as unique
prefixing. Usual generality/performance tradeoffs also apply.

Most typically, documenting that the identifiers the user
passes must not be ALSO used as group names in the patterns
the user also passes will be sufficient -- doing things by
sensible convention rather than by mandated fiat is Pythonic.

In many cases it may not be a problem for client code to avoid
using groups in the RE patterns, and then the trivially simple
solution I gave last time works -- if you want, you can check
whether that's the case, via len(mo.groups()) where mo is the
match object you get, and fallback to the more sophisticated
approach, or raise a suitable exception, etc, otherwise.

In some other cases client code may not only need to have no
constraints against using groups in the RE patterns, but also
want "the match-object" as an argument to the action code. In
this case, an interesting approach is to synthesize a polymorphic
equivalent of the match object that would result without the
or-joining of the original patterns -- however, the tradeoffs
in terms of complication, performance, and generality, are a bit
different in this case.

> # (abcd(cd){1,4})|(abcd)|(abcd(cd)?)|(ab)|(zh192)
> # 1 2 3 4 5 6 7
>
> # As you can see some expressions are 'identified' by more than one
> # group. This is not desirable and is difficult to maintain. If you

Nope, that's not a problem at all -- the re module deals just
fine with nested groups. The issue is, rather, that since group
_numbering_ is flat, i.e. ignores nesting, the group numbers
corresponding to the outer-grouping depend on what othe groups
are used in the caller-supplied re patterns. Identifying which
is the widest (thus from the outermost set of groups) group that
participated in the match is easy (go backwards from lastindex
0 or more steps to the last group that participated in the match)
but unless you've associated a name with that group it's then
non-trivial to recover the associated outer-identifier.

Alex

ma...@pobox.com

unread,

Nov 20, 2002, 11:37:08 AM11/20/02

to

Alex Martelli <al...@aleax.it> wrote:

> Alexander Sendzimir wrote:
>> # I will say that what I don't like about the approach you propose is that
>> # it throws information away. Functionally, it leads to brittle code which
>> # is hard to maintain and can lead to errors which are very hard to track
>> # down if one is not familiar with the implementation.
>
> I suspect it would be possible to construct utteances with which I
> could disagree more thoroughly than I disagree with this one, but
> it would take some doing. I think that, since re patterns support
> the | operator, making use of that operator "throws away" no
> information whatsoever and hence introduces no brittleness.

Just to prove that someone else is reading this thread (barely)... :-)

I can somewhat agree with Alexander, and I think you're overlooking the
way that this approach does force the regexps and their associated
handler code apart in the source code. Most of his objections, yes,
they're misguided, but the loss of the simple nearness of the pattern
with the action is one legitimate complaint against the
all-in-one-regex approach to this.

Of course, the reason the Perl pattern he started with was so clean and
straightforward was due to a globally shared "result of last re
applied" state, wasn't it? Talk about frangible design! OTOH, it does
fall into line with doing things by sensible convnetion rather than
strict rules, which you say (elsewhere, not quoted in reply) is Pythonic...

> some more normal people following this thread...:-): if the

I refuse to admit that I am "normal", please. :-)

> In some other cases client code may not only need to have no
> constraints against using groups in the RE patterns, but also
> want "the match-object" as an argument to the action code. In

This is one point on which I agree with Alexander: it seems to me to be
the usual case that the regexp both identifies and parses the target.
If it isn't being used in that dual mode, then the whole issue
addressed here (and in at least two other threads during the past week)
doesn't exist.

Alex Martelli

unread,

Nov 20, 2002, 6:25:55 PM11/20/02

to

ma...@pobox.com wrote:
...

> I can somewhat agree with Alexander, and I think you're overlooking the
> way that this approach does force the regexps and their associated
> handler code apart in the source code. Most of his objections, yes,

So did the approach in the post I was replying to -- there was a list
of pairs, each made up of an identifier and a compiled re object, and
the code was elsewhere. Personally I like this approach, which is why
I accepted it as posted and built on it -- so I don't see why you're
claiming that this is a differentiator between Alexander's aproach
and mine.

>> In some other cases client code may not only need to have no
>> constraints against using groups in the RE patterns, but also
>> want "the match-object" as an argument to the action code. In
>
> This is one point on which I agree with Alexander: it seems to me to be
> the usual case that the regexp both identifies and parses the target.

It's one case, but it's very far from being the only one.

> If it isn't being used in that dual mode, then the whole issue
> addressed here (and in at least two other threads during the past week)
> doesn't exist.

Why doesn't it? Alexander's post to which I was replying had no
groups at all in the regex patterns -- are you claiming he was
utterly missing the point of the thread, and his example totally
irrelevant?

Alex

Alexander Sendzimir

unread,

Nov 20, 2002, 9:35:28 PM11/20/02

to

So, with the help of those that responded, I'm posting my preferred
solution to the original question of this thread which is, "how does one
apply multiple regexs to a single line as in a compiler in Python?"

There are several ways of solving this problem and each has its merits.
Please see my NOTE AT END about this.

Here's the generalized version:

<code>

import re

#
# define handlers for each matching pattern:
#

def reHandler1 ( matchObject ) :
# do something with matchObject

def reHandler2 ( matchObject ) :
# do something with matchObject

.
.
.

def reHandlerN ( matchObject ) :
# do something with matchObject

#
# associate the regular expressions with their handlers
# in a list of tuples (pairs). The order of the pairs is
# the order in which matching will take place. Like lex.
#

regexs = [
( re.compile( r'...' ), reHandler1 ),
( re.compile( r'...' ), reHandler2 ),
.
.
.
( re.compile( r'...' ), reHandlerN ) ]

#
# now, match each regular expression in turn against the current line.
# When a match is found, break to the next line. Lines comes from
# where ever you want it to.
#

for line in lines :
for ( regex, handler ) in regexs :

mo = regex.match( line )
if mo :

handler( mo )
break
else :
# did not match this expression
pass

</code>

NOTE AT END

There are a number of different ways of solving this problem. Several have
been discussed in this thread. This is my preferred approach because it's
clean and simple in my opinion and makes use of a minimal knowledge
foundation. I like the fact that it is made up of only three parts: [1] the
handler definitions, [2] the regexs list, and [3] the matching loops.
Finally, each of these parts is consistent and relatively easy to understand.
I'm posting this solution because I feel it might be of use to others whether
new or experienced since I haven't been able to find any relevant information
on this topic in this news group.

ma...@pobox.com

unread,

Nov 20, 2002, 10:11:23 PM11/20/02

to

Alex Martelli <al...@aleax.it> wrote:
> I accepted it as posted and built on it -- so I don't see why you're
> claiming that this is a differentiator between Alexander's aproach
> and mine.

Because I had been trying to resist this thread, so when I gave in I
was responding somewhat to the whole context... and overlooked that
Alexander had first introduced that. Quite possibly I didn't read that
one all the way to that code?

>> This is one point on which I agree with Alexander: it seems to me to be
>> the usual case that the regexp both identifies and parses the target.
>
> It's one case, but it's very far from being the only one.

Right, and I didn't say it was. I notice you aren't disagreeing that
it's a very common style of use, and this both-identifying-and-parsing
use has been at least implied all along. It was explicitly described,
though not explicitly shown in the pseudo-code snippets, back in the
first post in this thread.

>> If it isn't being used in that dual mode, then the whole issue
>> addressed here (and in at least two other threads during the past week)
>> doesn't exist.
>
> Why doesn't it? Alexander's post to which I was replying had no
> groups at all in the regex patterns

Not in the sample code, no. There were groups in a lot of non-code
exposition, and as I say, they've been implicitly and explicitly a
major part of the motivation for this, at least IMO. After all, if you
don't have any groups, what do you need the match object for outside of
the conditional test? And then the motivating problem, or at least
what I have seen as the motivating problem, is the dual use of the
result of the regexp's application.

The elephant seems very like a pillar on this side... :-)

Alex Martelli

unread,

Nov 21, 2002, 6:20:37 AM11/21/02

to

ma...@pobox.com wrote:
...

>>> This is one point on which I agree with Alexander: it seems to me to be
>>> the usual case that the regexp both identifies and parses the target.
>>
>> It's one case, but it's very far from being the only one.
>
> Right, and I didn't say it was. I notice you aren't disagreeing that
> it's a very common style of use,

No, I don't disagree it's common -- that why I mentioned it myself!

> and this both-identifying-and-parsing
> use has been at least implied all along. It was explicitly described,
> though not explicitly shown in the pseudo-code snippets, back in the
> first post in this thread.

Guess I must have missed that, because I saw no "at least implied".
For such a crucial issue, making such an important difference, I find
myself hard put to believe that people would "just imply" it without
even showing it. So maybe there was a lot more talking at cross-purposes
throughout the thread -- you yourself remarked that you were commenting
on a post you "quite possibly didn't read all the way", and that kind
of thing seems to insure there will be lots of misunderstanding.

>>> If it isn't being used in that dual mode, then the whole issue
>>> addressed here (and in at least two other threads during the past week)
>>> doesn't exist.
>>
>> Why doesn't it? Alexander's post to which I was replying had no
>> groups at all in the regex patterns
>
> Not in the sample code, no. There were groups in a lot of non-code
> exposition, and as I say, they've been implicitly and explicitly a
> major part of the motivation for this, at least IMO. After all, if you
> don't have any groups, what do you need the match object for outside of
> the conditional test?

The match object gives you more than just groups you put in the
pattern! It's quite typical, for example, that all you need to
know is the exact substring that was matched:

>>> import re
>>> are=re.compile('ab+c')
>>> mo = are.match('abbbcccdd')
>>> mo.group(0)
'abbbc'
>>>

See? No explicit groups in the re's pattern, yet pretty obvious
potential usefulness of the resulting match-object anyway.

So, if your question was meant to be rhetorical, it seems particularly
inappropriate to me. If instead it was asked in earnest, to learn
crucial facts you didn't know about how re's and mo's work or about
how they're often used, then I think you might usefully have refrained
from criticizing what you suspected you didn't fully understand. But
I do gather that trying to understand a subject before criticising
is quite an obsolete approach these days.

> And then the motivating problem, or at least
> what I have seen as the motivating problem, is the dual use of the
> result of the regexp's application.

You can have some level of dual use without _necessarily_ having any
group in the res' patterns. Of course you may often be interested
in having groups, but your repeated attempts to imply that it would
be _necessarily_ so seem quite misplaced to me.

> The elephant seems very like a pillar on this side... :-)

That can be a particularly dangerous partial-perception, should
the pachyderm suddenly decide to shuffle its feet.

I have already outlined quite briefly what I think could be one
interesting approach should one need to pass on a match object
that 'hides' the join-all-with-| approach to matching multiple re
patterns in one gulp -- synthesize a suitable object polymorphic
to a real matchobject. A completely different tack, simpler though
needing some time measurement to check its worth, is to do TWO
matches (still better than doing N one after the other...) --
one vs the patterns joined into one by |, just to identify which
of them is the first (if any) to match; then a second versus the
specific "first matching pattern" only, just to build the match
object one needs. At this point I'm not highly motivate to spend
more time and energy trying to help out with this, so I think I'll
leave it at this "suggestive and somewhat helpful handwaving" level
unless some other reader, perceptive enough to SEE how vastly
superior the join-all-patterns approach is, but needing to get
the specific match-object too, should express interest. As far
as I'm concerned, people who still can't see it probably don't want
to, and so they're welcome to wear out their CPUs looping uselessly
to their hearts' contents.

Alex

John Hunter

unread,

Nov 22, 2002, 1:51:16 AM11/22/02

to

>>>>> "Alex" == Alex Martelli <al...@aleax.it> writes:

Alex> So, if your question was meant to be rhetorical, it seems
Alex> particularly inappropriate to me. If instead it was asked
Alex> in earnest, to learn crucial facts you didn't know about how
Alex> re's and mo's work or about how they're often used, then I
Alex> think you might usefully have refrained from criticizing
Alex> what you suspected you didn't fully understand. But I do
Alex> gather that trying to understand a subject before
Alex> criticising is quite an obsolete approach these days.

I, for one, am interested in learning crucial facts.

For me, the launching point of this thread was on "differential
processing of different match objects". That is, if a string matches
rgx1 (with it's own particular subgroups), do process1. If it matches
rgx2, with a (possibly) different set of subgroups, do process2.

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=mailman.1037141776.20667.python-list%40python.org

That the match object processing functions in the example code in
previous posts did not explicitly process match objects differently
was a time and typing convenience, as illustrated in the comment
string of the example below:
(http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&safe=off&selm=mailman.1037751635.2665.python-list%40python.org):

handler1 = handler2 = handler3 = lambda x: None # some very dumb handlers

That said, I too am unsatisfied with the M*N performance of the (rgx,
func) pattern:

for line in lines[:M]:
for rgx in regexs[:N]:
mo = rgx.match(line)
if mo: do_something(mo)

You've mentioned named rgx's in your previous posts. In the case of
differential processing of match objects based on the regex match, is
there a more efficient way to process the mo's than this M*N approach?

Aside from these details, let me just say that you (Alex) rock.
Thanks for your posts, your recipes, and all that.

John Hunter

Alex Martelli

unread,

Nov 22, 2002, 4:51:49 AM11/22/02

to

John Hunter wrote:
...

> That said, I too am unsatisfied with the M*N performance of the (rgx,
> func) pattern:
>
> for line in lines[:M]:
> for rgx in regexs[:N]:
> mo = rgx.match(line)
> if mo: do_something(mo)

If you need to process the match objects for ALL the RE's that match,
I don't think you can do _substantially_ better in general.

> You've mentioned named rgx's in your previous posts. In the case of
> differential processing of match objects based on the regex match, is
> there a more efficient way to process the mo's than this M*N approach?

If in a given application the case of 'no matches' is very frequent,
then a first-pass check on the line to see whether it does match at
least one of the RE's may give practical advantages, but I don't think
it can possibly change the O() behavior, just potentially give better
multipliers. And if you need to process mo's for all matches with
the various RE's, rather than just the first match with one of the RE's
taken in some order of priority, then I think that's about it in
terms of the speedups that you can get (without getting into detailed
processing of the patterns involed, and even then, whether you can
get any benefit _at all_ depends on WHAT set of paterns you have).

Alex