I.e. I have a string delivered into my program and I want to see if
any of a set of strings is a substring of the string I have been
given. It's quite OK to stop at the first one found. Ideally the
strings being searched through will be the keys of a dictionary but
this isn't a necessity, they can just be in a list if it could be done
more efficiently using a list.
Is this the best one can do (ignoring the likelihood that I've got
some syntax wrong) :-
# l is the list
# str is the incoming string
answer = ""
for x in l:
if str.find(x) < 0:
continue
answer = x
--
Chris Green
I'd not use 'l' (confused with '1') or 'str' (a standard module) as
variable names. Your code checks every string in the list even when
it's found one... you can reverse the test and break when the first
one is found. Using 'in' rather than testing the return value of find
is nicer as a substring test. Finally, using the 'else' clause lets
you make it clear that answer is set to the empty string when no match
is found.
for answer in l:
if str in answer: break
else:
answer = ''
--
Paul Hankin
This was an order of magnitude faster for me than using str.find or
str.index. That was finding rare words in the entire word-list (w/
duplicates) of War and Peace.
Neither would I, it was just thrown together to show how I was
thinking.
> Your code checks every string in the list even when
> it's found one... you can reverse the test and break when the first
> one is found. Using 'in' rather than testing the return value of find
> is nicer as a substring test. Finally, using the 'else' clause lets
> you make it clear that answer is set to the empty string when no match
> is found.
>
> for answer in l:
> if str in answer: break
> else:
> answer = ''
>
OK, that does improve things somewhat, thanks.
--
Chris Green
However it's the wrong way around, in my case 'sample' is the longer
string and I want to know if s is in it. It's simple enough to do it
the other way around though:-
def foo(sample, strings):
for s in strings:
if s in sample:
return True
return False
Using in rather than find() and making it a function would seem to be
the way to go, thanks.
--
Chris Green
If you test against the same substrings over and over again, an
alternative would be to build a regular expression:
import re
search = re.compile('|'.join(re.escape(x)
for x in substrings)).search
p = search(somestring)
if p is not None:
print 'Found', p.group()
George
A different approach:
>>> words = ["he", "sh", "bla"]
>>> name = "blah"
>>> True in (word in name for word in words)
True
>>> name = "bling"
>>> True in (word in name for word in words)
False
Perhaps not as obvious or readable as Jeff's example, but is
essentially doing the same thing using generator syntax.
That would be an enormous regular expression and eat a lot of memory.
But over an enormous number of substrings, it would be O(log n),
rather than O(n).
That's pretty :)
You could use the Aho-Corasick algorithm <http://en.wikipedia.org/wiki/
Aho-Corasick_algorithm>.
I don't know if there's a Python implementation yet.
Dennis Benzinger
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
Bye,
bearophile
It's even prettier in 2.5:
any(word in name for word in words)
George
And arguably the most readable yet!
http://nicolas.lehuen.com/download/pytst/ can do it as well.