What algorithm does Python use to evaluate: if substring in string

Tor Erik

unread,

Sep 9, 2006, 10:10:03 AM9/9/06

to

I would be surprised if it is the naive:

m = 0
s1 = "me"
s2 = "locate me"
s1len = len(s1)
s2len = len(s2)
found = False

while m + s1len <= s2len:
if s1 == s2len[m:m+s1len]:
found = True
break
m += 1

Alex Martelli

unread,

Sep 9, 2006, 10:54:31 AM9/9/06

to

Tor Erik <tore...@gmail.com> wrote:

> I would be surprised if it is the naive:

Yep -- it's "a mix between Boyer-Moore and Horspool with a few more
bells and whistles on the top", as documented and implemented in
Objects/stringlib/fastsearch.h in the Python sources and well discussed
and explained at http://effbot.org/zone/stringlib.htm .

Alex

Marc 'BlackJack' Rintsch

unread,

Sep 9, 2006, 10:58:32 AM9/9/06

to

In <eduhvs$46c$1...@news.uit.no>, Tor Erik wrote:

> I would be surprised if it is the naive:

Why?

I guess it simply calls an appropriate C library function.

Ciao,
Marc 'BlackJack' Rintsch

Tor Erik

unread,

Sep 9, 2006, 11:07:55 AM9/9/06

to

Ok. Two questions:

1. Is "a in b" simply an alias for "b.find(a)"?

2. Is this algorithm exclusive to Python 2.5, or is it contained in 2.4
aswell?

Alex Martelli

unread,

Sep 9, 2006, 1:25:35 PM9/9/06

to

Tor Erik <tore...@gmail.com> wrote:

> Alex Martelli wrote:
> > Tor Erik <tore...@gmail.com> wrote:
> >
> >> I would be surprised if it is the naive:
> >
> > Yep -- it's "a mix between Boyer-Moore and Horspool with a few more
> > bells and whistles on the top", as documented and implemented in
> > Objects/stringlib/fastsearch.h in the Python sources and well discussed
> > and explained at http://effbot.org/zone/stringlib.htm .
> >
> >
> > Alex
>
> Ok. Two questions:
>
> 1. Is "a in b" simply an alias for "b.find(a)"?

The 'in' operator can be minutely better optimized, but they share the
underlying algorithm (in 2.5).

> 2. Is this algorithm exclusive to Python 2.5, or is it contained in 2.4
> aswell?

It's 2.5 novelty. Look at the performance on the same machine (my 2.0
GHz MBP, MacOSX 10.4.7):

brain:~ alex$ python2.4 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77' 'x in
y'
100000 loops, best of 3: 9.04 usec per loop
brain:~ alex$ python2.4 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77'
'y.find(x)!=-1'
100000 loops, best of 3: 2.01 usec per loop
brain:~ alex$ python2.5 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77' 'x in
y'1000000 loops, best of 3: 0.452 usec per loop
brain:~ alex$ python2.5 -mtimeit -s'x="foo";y="bar"*99+x+"baz"*77'
'y.find(x)!=-1'
1000000 loops, best of 3: 0.842 usec per loop

find used to be way faster than 'in' -- now they share algorithms and
'in' can be more optimized (no need to track ``where'' it finds a match,
so to speak;-), so find is over twice as fast as it used to be, 'in' is
about 20 times as fast as it used to be, in this example -- it gets even
better if you look at larger and larger strings, e.g...:

brain:~ alex$ python2.4 -mtimeit -s'x="foo"*123;y="bar"*999+x+"baz"*777'
'x in y'
10000 loops, best of 3: 91.9 usec per loop
brain:~ alex$ python2.5 -mtimeit -s'x="foo"*123;y="bar"*999+x+"baz"*777'
'x in y'
100000 loops, best of 3: 3.01 usec per loop

here, we're going _30_ times as fast, not "just" 20;-).

Alex

Fredrik Lundh

unread,

Sep 11, 2006, 6:45:52 AM9/11/06

to pytho...@python.org

"Tor Erik" wrote:

> I would be surprised if it is the naive:

2.4 and earlier uses that algorithm (but with a better implementation).

And "naive" isn't really the right word here; on average, a brute force search is pretty
good for the find/index/in use case. Most fancy algorithms ignore the setup costs and
other overhead, which works well in theory, but not that well in practice.

</F>