reusing parts of a string in RE matches?

John Salerno

unread,

May 10, 2006, 10:29:12 AM5/10/06

to

I probably should find an RE group to post to, but my news server at
work doesn't seem to have one, so I apologize. But this is in Python
anyway :)

So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)

This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.

Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?

Thanks.

Murali

unread,

May 10, 2006, 10:52:42 AM5/10/06

to

John Salerno wrote:
> So my question is, how can find all occurrences of a pattern in a
> string, including overlapping matches? I figure it has something to do
> with look-ahead and look-behind, but I've only gotten this far:
>
> import re
> string = 'abababababababab'
> pattern = re.compile(r'ab(?=a)')
> m = pattern.findall(string)
>

Why not something like

import re
string = 'abababababababab'

pattern = re.compile(r"^aba")
ans = []
for x in xrange(len(string)):
m = pattern.match(string[x:])
if m: ans.append( (x+m.start(),x+m.end()))

# now ans is a list of pairs (p,q) where the substring string[p:q]
matches pattern

- Murali

Bo Yang

unread,

May 10, 2006, 11:13:57 AM5/10/06

to pytho...@python.org

John Salerno 写道:

> I probably should find an RE group to post to, but my news server at
> work doesn't seem to have one, so I apologize. But this is in Python
> anyway :)
>
> So my question is, how can find all occurrences of a pattern in a
> string, including overlapping matches? I figure it has something to do
> with look-ahead and look-behind, but I've only gotten this far:
>
> import re
> string = 'abababababababab'
> pattern = re.compile(r'ab(?=a)')
>

Below from the python library reference

*|(?=...)|*
Matches if ... matches next, but doesn't consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match |'Isaac '| only if it's followed by |'Asimov'|.

> m = pattern.findall(string)
>
> This matches all the 'ab' followed by an 'a', but it doesn't include the
> 'a'. What I'd like to do is find all the 'aba' matches. A regular
> findall() gives four results, but really there are seven.
>
>

I try the code , but I give seven results !

BartlebyScrivener

unread,

May 10, 2006, 11:45:17 AM5/10/06

to

Right about now somebody usually jumps in and shows you how to do this
without using regex and using string methods instead.

I'll watch.

rd

John Salerno

unread,

May 10, 2006, 11:56:28 AM5/10/06

to

Bo Yang wrote:

>> This matches all the 'ab' followed by an 'a', but it doesn't include
>> the 'a'. What I'd like to do is find all the 'aba' matches. A regular
>> findall() gives four results, but really there are seven.
>>
>>
> I try the code , but I give seven results !

Sorry, I meant that findall() only returns 4 results when searching for
'aba', when there are actually seven instances of 'aba'. This doesn't
involve the look-ahead RE.

John Salerno

unread,

May 10, 2006, 11:57:37 AM5/10/06

to

Heh heh, I'm sure you're right, but this is more just an exercise for me
in REs, so I'm curious how you might do it, unless the answer is that
it's just too complicated to be worth it (like Murali's example!) That
goes beyond just an RE pattern.

BartlebyScrivener

unread,

May 10, 2006, 1:48:54 PM5/10/06

to

I have to at least try :)

s = "abababababababab"

for x in range(len(s)):
... try:
... s.index("aba", x, x + 3)
... except ValueError:
... pass

rd

John Salerno

unread,

May 10, 2006, 2:28:44 PM5/10/06

to

yeah, looks like index() or find() can be used to do it instead of RE,
but still, i'd like to know if there's a way you can write an RE
expression to do it (and just an RE expression, without all the other
for loops and extra nonsense...otherwise i might as well just use string
methods)

mpet...@gmail.com

unread,

May 10, 2006, 2:52:43 PM5/10/06

to

>From the Python 2.4 docs:

findall( pattern, string[, flags])
Return a list of all ***non-overlapping*** matches of pattern in
string....

By design, the regex functions return non-overlapping patterns.

Without doing some kind of looping, I think you are out of luck.

If you pattern is fixed, then a solution might be:
>>>string = 'abababababababab'
>>>pat = 'aba'
>>>[pat for s in re.compile('(?='+pat+')').findall(string)]
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

If the pattern is not fixed (i.e. 'a.a') then this method can still get
a count of overlapping matches, but cannot get the individual match
strings themselves.

A simple loop should do in this case, though:

>>> for i in range(len(string)):
... r= re.match(pat,string[i:])
... if r: print r.group()
...
aba
aba
aba
aba
aba
aba
aba

BartlebyScrivener

unread,

May 10, 2006, 3:07:20 PM5/10/06

to

>> otherwise i might as well just use string
>> methods

I think you're supposed to use string methods if you can, to avoid the
old adage about having two problems instead of one when using regex.

rd

John Salerno

unread,

May 10, 2006, 3:25:24 PM5/10/06

to

mpet...@gmail.com wrote:

>>>> string = 'abababababababab'
>>>> pat = 'aba'
>>>> [pat for s in re.compile('(?='+pat+')').findall(string)]
> ['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?

John Salerno

unread,

May 10, 2006, 3:37:46 PM5/10/06

to

Nevermind, I get it! The point is that you *aren'* matching anything
(except the empty string), and this only to see how many times it
occurs, then you are replacing those occurrences with the pattern
string. So this is basically like a counter?

Kent Johnson

unread,

May 10, 2006, 3:47:27 PM5/10/06

to

John Salerno wrote:
> I probably should find an RE group to post to, but my news server at
> work doesn't seem to have one, so I apologize. But this is in Python
> anyway :)
>
> So my question is, how can find all occurrences of a pattern in a
> string, including overlapping matches?

You can specify a start location to re.search(), and get the location of
a match from a match object. This allows you to loop, searching the
string following the last match:

import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')

ans = []
start = 0
while True:
m = pattern.search(string, start)
if not m: break
ans.append( (m.start(), m.end()) )
start = m.start() + 1

print ans # => [(0, 2), (2, 4), (4, 6), (6, 8), (8, 10), (10, 12), (12, 14)]

Kent

mpet...@gmail.com

unread,

May 10, 2006, 3:59:57 PM5/10/06

to

Exactly,

Now this will work as long as there are no wildcards in the pattern.
Thus, only with fixed strings. But if you have a fixed string, there
is really no need to use regex, as it will complicate you life for no
real reason (as opposed to simple string methods).

With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.

To actually do that you will need to iterate through the string and
apply the pattern match (which matches only the beginning of a string)
to a indexed subset of the original (see example in the last post)

Ben Cartwright

unread,

May 10, 2006, 8:15:46 PM5/10/06

to

John Salerno wrote:
> So my question is, how can find all occurrences of a pattern in a
> string, including overlapping matches? I figure it has something to do
> with look-ahead and look-behind, but I've only gotten this far:
>
> import re
> string = 'abababababababab'
> pattern = re.compile(r'ab(?=a)')
> m = pattern.findall(string)
>
> This matches all the 'ab' followed by an 'a', but it doesn't include the
> 'a'. What I'd like to do is find all the 'aba' matches. A regular
> findall() gives four results, but really there are seven.
>
> Is there a way to do this with just an RE pattern, or would I have to
> manually add the 'a' to the end of the matches?

Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:

>>> import re
>>> re.findall(r'(?=(aba))', 'abababababababab')

['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

--Ben

Murali

unread,

May 10, 2006, 10:10:13 PM5/10/06

to

> Yes, and no extra for loops are needed! You can define groups inside
> the lookahead assertion:
>
> >>> import re
> >>> re.findall(r'(?=(aba))', 'abababababababab')
> ['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wonderful and this works with any regexp, so

import re

def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)

all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.

- Murali

Ben Cartwright

unread,

May 10, 2006, 10:47:45 PM5/10/06

to

Careful. That won't work as expected for *all* regexps. Example:

>>> import re
>>> re.findall(r'(?=(a.*a))', 'abaca')
['abaca', 'aca']

Note that this does *not* find 'aba'. You might think that making it
non-greedy might help, but:

>>> re.findall(r'(?=(a.*?a))', 'abaca')
['aba', 'aca']

Nope, now it's not finding 'abaca'.

This is by design, though. From
http://www.regular-expressions.info/lookaround.html (a good read, by
the way):

"""As soon as the lookaround condition is satisfied, the regex engine
forgets about everything inside the lookaround. It will not backtrack
inside the lookaround to try different permutations."""

Moral of the story: keep lookahead assertions simple whenever
possible. :-)

--Ben

BartlebyScrivener

unread,

May 10, 2006, 11:29:13 PM5/10/06

to

Thanks, Ben. Quite an education!

rick

Mirco Wahab

unread,

May 11, 2006, 6:18:38 AM5/11/06

to

Hi mpeters42 & John

> With a more complex pattern (like 'a.a': match any character between
> two 'a' characters) this will get the length, but not what character is
> between the a's.

Lets take this as a starting point for another example
that comes to mind. You have a string of characters
interspersed with numbers: tx = 'a1a2a3A4a35a6b7b8c9c'

Now you try to find all _numbers_, which have
symmetrical characters (like a<-2->a) which
are not in 3/3/3... synced groups.

This can easy be done in P(ytho|nerl) etc. by
positive lookahead (even the same pattern does:)
Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)
Pe:
$_ = 'a1a2a3A4a35a6b7b8c9c';
print /(\w)(?=(.)\1)/g;

(should find 1,2,7,9 only, python regex
written to var in order to prevent
clunky lines ;-)

BTW, Py Regex Engine seems to
be very close to the perl one:
Naive (!) matching of a pattern
with 14 o's (intersperded by
anything) against a string of
16 o's takes about exaclty the same
time here in Py(2.4.3) and Pe (5.8.7):

tl = 'oooooooooooooooo'
rg = r'o*o*o*o*o*o*o*o*o*o*o*o*o*o*[\W]'
print re.search(rg, tl)

Py: 101 sec
Pe: 109 sec

(which would find no match because there's
no \W-like character at the end of the
string here)

Regards

Mirco

John Salerno

unread,

May 11, 2006, 10:16:06 AM5/11/06

to

Mirco Wahab wrote:

> Py:
> import re
> tx = 'a1a2a3A4a35a6b7b8c9c'
> rg = r'(\w)(?=(.\1))'
> print re.findall(rg, tx)

The only problem seems to be (and I ran into this with my original
example too) that what gets returned by this code isn't exactly what you
are looking for, i.e. the numbers '1', '2', etc. You get a list of
tuples, and the second item in this tuple contains the number, but also
the following \w character.

So there still seems to be some work that must be done when dealing with
overlapping patterns/look-ahead/behind.

Oh wait, a thought just hit me. Instead of doing it as you did:

rg = r'(\w)(?=(.\1))'

Could you do:

rg = r'(\w)(?=(.)\1)'

That would at least isolate the number, although you'd still have to get
it out of the list/tuple.

John Salerno

unread,

May 11, 2006, 10:17:25 AM5/11/06

to

Ben Cartwright wrote:

> Yes, and no extra for loops are needed! You can define groups inside
> the lookahead assertion:
>
> >>> import re
> >>> re.findall(r'(?=(aba))', 'abababababababab')
> ['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']

Wow, that was like magic! :)

Mirco Wahab

unread,

May 11, 2006, 10:41:16 AM5/11/06

to

Hi John

> rg = r'(\w)(?=(.)\1)'
>
> That would at least isolate the number, although you'd still have to get
> it out of the list/tuple.

I have no idea how to do this
in Python in a terse way - but
I'll try ;-)

In Perl, its easy. Here, the
"match construct" (\w)(?=(.)\1)
returns all captures in a
list (a 1 a 2 a 4 b 7 c 9)
because we capture 2 fields per
comparision:
The first is (\w), needed for
backreference, the second is
the dot (.), which finds the
number in the center (or any-
thing else).

So, in perl you 'filter' by the
'grep' function on each list
element: grep{ /\d/ } - this
means, only numbers (\d) will
pass through:

$_ = 'a1a2a3Aa4a35a6b7b8c9c';
print grep{/\d/} /(\w)(?=(.)\1)/g;
#prints => 1 2 4 7 9

I'll try to fiddle somthing out
that works in Python too ...

Regards

M.

John Salerno

unread,

May 11, 2006, 11:35:38 AM5/11/06

to

Mirco Wahab wrote:

> I have no idea how to do this
> in Python in a terse way - but
> I'll try ;-)
>
> In Perl, its easy. Here, the
> "match construct" (\w)(?=(.)\1)
> returns all captures in a
> list (a 1 a 2 a 4 b 7 c 9)

Ah, I see the difference. In Python you get a list of tuples, so there
seems to be a little extra work to do to get the number out.

Mirco Wahab

unread,

May 11, 2006, 4:58:17 PM5/11/06

to

Hi John

> Ah, I see the difference. In Python you get a list of tuples, so there
> seems to be a little extra work to do to get the number out.

Dohh, after two cups of coffee
ans several bars of chocolate
I eventually mad(e) it ;-)

In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':

char,num = list[:][:]

in a loop and using the apropriate element then:

import re

t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)

for a,b in (l[:][:]) : print b

In the moment, I find this syntax
awkward and arbitary, but my mind
should change if I'm adopted more
to this in the end ;-)

Regards,

M.

Message has been deleted

Mirco Wahab

unread,

May 11, 2006, 5:08:31 PM5/11/06

to

Hi John

> Ah, I see the difference. In Python you get a list of
> tuples, so there seems to be a little extra work to do
> to get the number out.

Dohh, after two cups of coffee

ans several bars of chocolate
I eventually mad(e) it ;-)

In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':

char,num = list[:][:]

in a loop and using the apropriate element then:

import re

t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)

for a,b in l : print b

(l sould implicitly be decoded
sequentially as l[:][->a : ->b]
in the loop context.)

In the moment, I find this syntax

somehow hard to remember, but my

Fredrik Lundh

unread,

May 12, 2006, 4:48:25 AM5/12/06

to pytho...@python.org

Mirco Wahab wrote:

> In Python, you have to deconstruct
> the 2D-lists (here: long list of
> short lists [a,2] ...) by
> 'slicing the slice':
>
> char,num = list[:][:]
>
> in a loop and using the apropriate element then:
>
> import re
>
> t = 'a1a2a3Aa4a35a6b7b8c9c';
> r = r'(\w)(?=(.)\1)'
> l = re.findall(r, t)
>
> for a,b in (l[:][:]) : print b
>
> In the moment, I find this syntax
> awkward and arbitary, but my mind
> should change if I'm adopted more
> to this in the end ;-)

in contemporary Python, this is best done by a list comprehension:

l = [m[1] for m in re.findall(r, t)]

or, depending on what you want to do with the result, a generator
expression:

g = (m[1] for m in re.findall(r, t))

or

process(m[1] for m in re.findall(r, t))

if you want to avoid creating the tuples, you can use finditer instead:

l = [m.group(2) for m in re.finditer(r, t)]
g = (m.group(2) for m in re.finditer(r, t))

finditer is also a good tool to use if you need to do more things with
each match:

for m in re.finditer(r, t):
s = m.group(2)
... process s in some way ...

the code body will be executed every time the RE engine finds a match,
which can be useful if you're working on large target strings, and only
want to process the first few matches.

for m in re.finditer(r, t):
s = m.group(2)
if s == something:
break
... process s in some way ...

</F>

Mirco Wahab

unread,

May 13, 2006, 9:17:41 AM5/13/06

to

Hi Fredrik

you brought up some terse and
somehow expressive lines with
their own beauty ...

> [this] is best done by a list comprehension:

> l = [m[1] for m in re.findall(r, t)]
>

> or, [...] a generator expression:

> g = (m[1] for m in re.findall(r, t))
>
> or
> process(m[1] for m in re.findall(r, t))
>

> ... avoid creating the tuples, ... finditer instead:

> l = [m.group(2) for m in re.finditer(r, t)]
> g = (m.group(2) for m in re.finditer(r, t))
>
> finditer is also a good tool to use

> for m in re.finditer(r, t):
> s = m.group(2)
> ... process s in some way ...

... which made me wish to internalize such wisdom too ;-)

This looks almost beautiful, it made me stand up and
go to some large book stores in order to grab a good
book on python.

Sadly, there were none (except one small 'dictionary',
ISBN: 3826615123). I live in a fairly large city
in Germany w/three large bookstores in the center,
where one can get loads of PHP and Java books, lots
of C/C++ and the like - even some Ruby books (some
"Rails" too) on display (WTF).

Not that I wouldn't order books (I do that all the
time for 'original versions') but it makes one
sad-faced to see the small impact of the Python
language here today on bookstore-tournarounds ...

Thanks & regards

Mirco