So my question is, how can find all occurrences of a pattern in a
string, including overlapping matches? I figure it has something to do
with look-ahead and look-behind, but I've only gotten this far:
import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
m = pattern.findall(string)
This matches all the 'ab' followed by an 'a', but it doesn't include the
'a'. What I'd like to do is find all the 'aba' matches. A regular
findall() gives four results, but really there are seven.
Is there a way to do this with just an RE pattern, or would I have to
manually add the 'a' to the end of the matches?
Thanks.
Why not something like
import re
string = 'abababababababab'
pattern = re.compile(r"^aba")
ans = []
for x in xrange(len(string)):
m = pattern.match(string[x:])
if m: ans.append( (x+m.start(),x+m.end()))
# now ans is a list of pairs (p,q) where the substring string[p:q]
matches pattern
- Murali
*|(?=...)|*
Matches if ... matches next, but doesn't consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match |'Isaac '| only if it's followed by |'Asimov'|.
> m = pattern.findall(string)
>
> This matches all the 'ab' followed by an 'a', but it doesn't include the
> 'a'. What I'd like to do is find all the 'aba' matches. A regular
> findall() gives four results, but really there are seven.
>
>
I try the code , but I give seven results !
I'll watch.
rd
>> This matches all the 'ab' followed by an 'a', but it doesn't include
>> the 'a'. What I'd like to do is find all the 'aba' matches. A regular
>> findall() gives four results, but really there are seven.
>>
>>
> I try the code , but I give seven results !
Sorry, I meant that findall() only returns 4 results when searching for
'aba', when there are actually seven instances of 'aba'. This doesn't
involve the look-ahead RE.
Heh heh, I'm sure you're right, but this is more just an exercise for me
in REs, so I'm curious how you might do it, unless the answer is that
it's just too complicated to be worth it (like Murali's example!) That
goes beyond just an RE pattern.
s = "abababababababab"
for x in range(len(s)):
... try:
... s.index("aba", x, x + 3)
... except ValueError:
... pass
rd
yeah, looks like index() or find() can be used to do it instead of RE,
but still, i'd like to know if there's a way you can write an RE
expression to do it (and just an RE expression, without all the other
for loops and extra nonsense...otherwise i might as well just use string
methods)
findall( pattern, string[, flags])
Return a list of all ***non-overlapping*** matches of pattern in
string....
By design, the regex functions return non-overlapping patterns.
Without doing some kind of looping, I think you are out of luck.
If you pattern is fixed, then a solution might be:
>>>string = 'abababababababab'
>>>pat = 'aba'
>>>[pat for s in re.compile('(?='+pat+')').findall(string)]
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']
If the pattern is not fixed (i.e. 'a.a') then this method can still get
a count of overlapping matches, but cannot get the individual match
strings themselves.
A simple loop should do in this case, though:
>>> for i in range(len(string)):
... r= re.match(pat,string[i:])
... if r: print r.group()
...
aba
aba
aba
aba
aba
aba
aba
I think you're supposed to use string methods if you can, to avoid the
old adage about having two problems instead of one when using regex.
rd
>>>> string = 'abababababababab'
>>>> pat = 'aba'
>>>> [pat for s in re.compile('(?='+pat+')').findall(string)]
> ['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']
Wow, I have no idea how to read that RE. First off, what does it match?
Should something come before the parentheses, and that will be what
matches? Also, what are the '+'s doing? Are they literal +s or still
being used as RE syntax?
Nevermind, I get it! The point is that you *aren'* matching anything
(except the empty string), and this only to see how many times it
occurs, then you are replacing those occurrences with the pattern
string. So this is basically like a counter?
You can specify a start location to re.search(), and get the location of
a match from a match object. This allows you to loop, searching the
string following the last match:
import re
string = 'abababababababab'
pattern = re.compile(r'ab(?=a)')
ans = []
start = 0
while True:
m = pattern.search(string, start)
if not m: break
ans.append( (m.start(), m.end()) )
start = m.start() + 1
print ans # => [(0, 2), (2, 4), (4, 6), (6, 8), (8, 10), (10, 12), (12, 14)]
Kent
Now this will work as long as there are no wildcards in the pattern.
Thus, only with fixed strings. But if you have a fixed string, there
is really no need to use regex, as it will complicate you life for no
real reason (as opposed to simple string methods).
With a more complex pattern (like 'a.a': match any character between
two 'a' characters) this will get the length, but not what character is
between the a's.
To actually do that you will need to iterate through the string and
apply the pattern match (which matches only the beginning of a string)
to a indexed subset of the original (see example in the last post)
Yes, and no extra for loops are needed! You can define groups inside
the lookahead assertion:
>>> import re
>>> re.findall(r'(?=(aba))', 'abababababababab')
['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']
--Ben
Wonderful and this works with any regexp, so
import re
def all_occurences(pat,str):
return re.findall(r'(?=(%s))'%pat,str)
all_occurences("a.a","abacadabcda") returns ["aba","aca","ada"] as
required.
- Murali
Careful. That won't work as expected for *all* regexps. Example:
>>> import re
>>> re.findall(r'(?=(a.*a))', 'abaca')
['abaca', 'aca']
Note that this does *not* find 'aba'. You might think that making it
non-greedy might help, but:
>>> re.findall(r'(?=(a.*?a))', 'abaca')
['aba', 'aca']
Nope, now it's not finding 'abaca'.
This is by design, though. From
http://www.regular-expressions.info/lookaround.html (a good read, by
the way):
"""As soon as the lookaround condition is satisfied, the regex engine
forgets about everything inside the lookaround. It will not backtrack
inside the lookaround to try different permutations."""
Moral of the story: keep lookahead assertions simple whenever
possible. :-)
--Ben
rick
> With a more complex pattern (like 'a.a': match any character between
> two 'a' characters) this will get the length, but not what character is
> between the a's.
Lets take this as a starting point for another example
that comes to mind. You have a string of characters
interspersed with numbers: tx = 'a1a2a3A4a35a6b7b8c9c'
Now you try to find all _numbers_, which have
symmetrical characters (like a<-2->a) which
are not in 3/3/3... synced groups.
This can easy be done in P(ytho|nerl) etc. by
positive lookahead (even the same pattern does:)
Py:
import re
tx = 'a1a2a3A4a35a6b7b8c9c'
rg = r'(\w)(?=(.\1))'
print re.findall(rg, tx)
Pe:
$_ = 'a1a2a3A4a35a6b7b8c9c';
print /(\w)(?=(.)\1)/g;
(should find 1,2,7,9 only, python regex
written to var in order to prevent
clunky lines ;-)
BTW, Py Regex Engine seems to
be very close to the perl one:
Naive (!) matching of a pattern
with 14 o's (intersperded by
anything) against a string of
16 o's takes about exaclty the same
time here in Py(2.4.3) and Pe (5.8.7):
tl = 'oooooooooooooooo'
rg = r'o*o*o*o*o*o*o*o*o*o*o*o*o*o*[\W]'
print re.search(rg, tl)
Py: 101 sec
Pe: 109 sec
(which would find no match because there's
no \W-like character at the end of the
string here)
Regards
Mirco
> Py:
> import re
> tx = 'a1a2a3A4a35a6b7b8c9c'
> rg = r'(\w)(?=(.\1))'
> print re.findall(rg, tx)
The only problem seems to be (and I ran into this with my original
example too) that what gets returned by this code isn't exactly what you
are looking for, i.e. the numbers '1', '2', etc. You get a list of
tuples, and the second item in this tuple contains the number, but also
the following \w character.
So there still seems to be some work that must be done when dealing with
overlapping patterns/look-ahead/behind.
Oh wait, a thought just hit me. Instead of doing it as you did:
rg = r'(\w)(?=(.\1))'
Could you do:
rg = r'(\w)(?=(.)\1)'
That would at least isolate the number, although you'd still have to get
it out of the list/tuple.
> Yes, and no extra for loops are needed! You can define groups inside
> the lookahead assertion:
>
> >>> import re
> >>> re.findall(r'(?=(aba))', 'abababababababab')
> ['aba', 'aba', 'aba', 'aba', 'aba', 'aba', 'aba']
Wow, that was like magic! :)
I have no idea how to do this
in Python in a terse way - but
I'll try ;-)
In Perl, its easy. Here, the
"match construct" (\w)(?=(.)\1)
returns all captures in a
list (a 1 a 2 a 4 b 7 c 9)
because we capture 2 fields per
comparision:
The first is (\w), needed for
backreference, the second is
the dot (.), which finds the
number in the center (or any-
thing else).
So, in perl you 'filter' by the
'grep' function on each list
element: grep{ /\d/ } - this
means, only numbers (\d) will
pass through:
$_ = 'a1a2a3Aa4a35a6b7b8c9c';
print grep{/\d/} /(\w)(?=(.)\1)/g;
#prints => 1 2 4 7 9
I'll try to fiddle somthing out
that works in Python too ...
Regards
M.
> I have no idea how to do this
> in Python in a terse way - but
> I'll try ;-)
>
> In Perl, its easy. Here, the
> "match construct" (\w)(?=(.)\1)
> returns all captures in a
> list (a 1 a 2 a 4 b 7 c 9)
Ah, I see the difference. In Python you get a list of tuples, so there
seems to be a little extra work to do to get the number out.
> Ah, I see the difference. In Python you get a list of tuples, so there
> seems to be a little extra work to do to get the number out.
Dohh, after two cups of coffee
ans several bars of chocolate
I eventually mad(e) it ;-)
In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':
char,num = list[:][:]
in a loop and using the apropriate element then:
import re
t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)
for a,b in (l[:][:]) : print b
In the moment, I find this syntax
awkward and arbitary, but my mind
should change if I'm adopted more
to this in the end ;-)
Regards,
M.
> Ah, I see the difference. In Python you get a list of
> tuples, so there seems to be a little extra work to do
> to get the number out.
Dohh, after two cups of coffee
ans several bars of chocolate
I eventually mad(e) it ;-)
In Python, you have to deconstruct
the 2D-lists (here: long list of
short lists [a,2] ...) by
'slicing the slice':
char,num = list[:][:]
in a loop and using the apropriate element then:
import re
t = 'a1a2a3Aa4a35a6b7b8c9c';
r = r'(\w)(?=(.)\1)'
l = re.findall(r, t)
for a,b in l : print b
(l sould implicitly be decoded
sequentially as l[:][->a : ->b]
in the loop context.)
In the moment, I find this syntax
somehow hard to remember, but my
> In Python, you have to deconstruct
> the 2D-lists (here: long list of
> short lists [a,2] ...) by
> 'slicing the slice':
>
> char,num = list[:][:]
>
> in a loop and using the apropriate element then:
>
> import re
>
> t = 'a1a2a3Aa4a35a6b7b8c9c';
> r = r'(\w)(?=(.)\1)'
> l = re.findall(r, t)
>
> for a,b in (l[:][:]) : print b
>
> In the moment, I find this syntax
> awkward and arbitary, but my mind
> should change if I'm adopted more
> to this in the end ;-)
in contemporary Python, this is best done by a list comprehension:
l = [m[1] for m in re.findall(r, t)]
or, depending on what you want to do with the result, a generator
expression:
g = (m[1] for m in re.findall(r, t))
or
process(m[1] for m in re.findall(r, t))
if you want to avoid creating the tuples, you can use finditer instead:
l = [m.group(2) for m in re.finditer(r, t)]
g = (m.group(2) for m in re.finditer(r, t))
finditer is also a good tool to use if you need to do more things with
each match:
for m in re.finditer(r, t):
s = m.group(2)
... process s in some way ...
the code body will be executed every time the RE engine finds a match,
which can be useful if you're working on large target strings, and only
want to process the first few matches.
for m in re.finditer(r, t):
s = m.group(2)
if s == something:
break
... process s in some way ...
</F>
you brought up some terse and
somehow expressive lines with
their own beauty ...
> [this] is best done by a list comprehension:
> l = [m[1] for m in re.findall(r, t)]
>
> or, [...] a generator expression:
> g = (m[1] for m in re.findall(r, t))
>
> or
> process(m[1] for m in re.findall(r, t))
>
> ... avoid creating the tuples, ... finditer instead:
> l = [m.group(2) for m in re.finditer(r, t)]
> g = (m.group(2) for m in re.finditer(r, t))
>
> finditer is also a good tool to use
> for m in re.finditer(r, t):
> s = m.group(2)
> ... process s in some way ...
... which made me wish to internalize such wisdom too ;-)
This looks almost beautiful, it made me stand up and
go to some large book stores in order to grab a good
book on python.
Sadly, there were none (except one small 'dictionary',
ISBN: 3826615123). I live in a fairly large city
in Germany w/three large bookstores in the center,
where one can get loads of PHP and Java books, lots
of C/C++ and the like - even some Ruby books (some
"Rails" too) on display (WTF).
Not that I wouldn't order books (I do that all the
time for 'original versions') but it makes one
sad-faced to see the small impact of the Python
language here today on bookstore-tournarounds ...
Thanks & regards
Mirco