Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

#elements of seq A in seq B

0 views
Skip to first unread message

Neal Becker

unread,
Aug 19, 2009, 7:19:24 PM8/19/09
to pytho...@python.org
What would be a time efficient way to count the number of occurrences of
elements of sequence A in sequence B? (in this particular case, these
sequences are strings, if that matters).

Jan Kaliszewski

unread,
Aug 19, 2009, 8:17:47 PM8/19/09
to Neal Becker, pytho...@python.org

If you mean: to count occurences of each element of A (separately) in B...

Hm, maybe something like this:

# result as a dict {<element of A>: <how many occurences in B>, ...}
dict((element, B.count(element)) for element in A)


If you mean: to count non overlaping occurences of string A in B
-- simply:

B.count(A)


Regards,
*j

--
Jan Kaliszewski (zuo) <z...@chopin.edu.pl>

Simon Forman

unread,
Aug 19, 2009, 10:12:14 PM8/19/09
to
On Aug 19, 8:17 pm, "Jan Kaliszewski" <z...@chopin.edu.pl> wrote:

You don't want to use count() in a case like this because it iterates
through B len(A) times, i.e. this algorithm is O(A x B).

If all you want is the total you can do it like this:

def f(a, b):
a = set(a)
return sum(item in a for item in b)


A = 'some string'
B = 'another string'

print f(A, B) # prints 12


If you want to know the count for each element you can use this:

from collections import defaultdict

def g(a, b):
a = set(a)
d = defaultdict(int)
for item in b:
if item in a:
d[item] += 1
return d

print g(A, B)

# prints defaultdict(<type 'int'>, {' ': 1, 'e': 1, 'g': 1, 'i': 1,
'o': 1, 'n': 2, 's': 1, 'r': 2, 't': 2})

HTH,
~Simon

Tomasz Rola

unread,
Aug 19, 2009, 10:45:46 PM8/19/09
to Neal Becker, pytho...@python.org, Tomasz Rola
On Wed, 19 Aug 2009, Neal Becker wrote:

> What would be a time efficient way to count the number of occurrences of
> elements of sequence A in sequence B? (in this particular case, these
> sequences are strings, if that matters).

If A and B are rather lengthy, then maybe build a tree from elements of A
and use it to count elements of B. I.e., tree nodes would be counters.
Building trees in Python, however, I guess it is a little bit tricky. Of
course it is possible. But maybe you would rather have this tree
maintained in C.

Regards
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:tomas...@bigfoot.com **

John Machin

unread,
Aug 19, 2009, 11:34:51 PM8/19/09
to
On Aug 20, 12:12 pm, Simon Forman <sajmik...@gmail.com> wrote:
> On Aug 19, 8:17 pm, "Jan Kaliszewski" <z...@chopin.edu.pl> wrote:
> > If you mean: to count non overlaping occurences of string A in B
> > -- simply:
>
> >    B.count(A)
>
> You don't want to use count() in a case like this because it iterates
> through B len(A) times, i.e. this algorithm is O(A x B).

There's seems to be multiple confusion here. Jan was talking about
counting non-overlapping occurrences of string A in [string] B. In
that case B.count(A) is nowhere near O(A*B)... in fact in reasonable
cases it is sub-linear i.e. O(B/A)

See http://svn.python.org/view/python/trunk/Objects/stringlib/fastsearch.h?revision=68811&view=markup

And even a naive implementation of str.count() would not "iterate
through B len(A) times".

and list.count() can't be used on its own to solve Neal's problem; its
arg is a single element so it would have to be wrapped in a loop: sum
(B.count(a) for a in A) which would of course be O(A*B)

Simon Forman

unread,
Aug 20, 2009, 1:44:02 AM8/20/09
to
On Aug 19, 11:34 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Aug 20, 12:12 pm, Simon Forman <sajmik...@gmail.com> wrote:
>
> > On Aug 19, 8:17 pm, "Jan Kaliszewski" <z...@chopin.edu.pl> wrote:
> > > If you mean: to count non overlaping occurences of string A in B
> > > -- simply:
>
> > >    B.count(A)
>
> > You don't want to use count() in a case like this because it iterates
> > through B len(A) times, i.e. this algorithm is O(A x B).
>
> There's seems to be multiple confusion here. Jan was talking about
> counting non-overlapping occurrences of string A in [string] B. In
> that case B.count(A) is nowhere near O(A*B)... in fact in reasonable
> cases it is sub-linear i.e. O(B/A)
>
> Seehttp://svn.python.org/view/python/trunk/Objects/stringlib/fastsearch....

>
> And even a naive implementation of str.count() would not "iterate
> through B len(A) times".
>
> and list.count() can't be used on its own to solve Neal's problem; its
> arg is a single element so it would have to be wrapped in a loop: sum
> (B.count(a) for a in A) which would of course be O(A*B)

First, thanks for pointing out str.count(). I didn't know of it and I
didn't notice that that was what was being used in Jan's code. (It's
blindingly obvious-- now that you've pointed it out. :] )


Second, I was referring to this:

dict((element, B.count(element)) for element in A)

not B.count(A). Sorry for not interspersing my comment better.

Regards,
~Simon

Neal Becker

unread,
Aug 20, 2009, 7:01:29 AM8/20/09
to pytho...@python.org
Jan Kaliszewski wrote:

> 20-08-2009 o 01:19:24 Neal Becker <ndbe...@gmail.com> wrote:
>
>> What would be a time efficient way to count the number of occurrences of
>> elements of sequence A in sequence B? (in this particular case, these
>> sequences are strings, if that matters).
>

> If you mean: to count occurences of each element of A (separately) in B...
>
> Hm, maybe something like this:
>
> # result as a dict {<element of A>: <how many occurences in B>, ...}

> dict((element, B.count(element)) for element in A)
>
>

> If you mean: to count non overlaping occurences of string A in B
> -- simply:
>
> B.count(A)
>
>

> Regards,
> *j
>

I meant #occurrences of characters from the set A in string B

Peter Otten

unread,
Aug 20, 2009, 8:05:12 AM8/20/09
to
Neal Becker wrote:

> I meant #occurrences of characters from the set A in string B

If a contains "few" characters:

n = sum(b.count(c) for c in a)

If a contains "many" characters:

identity = "".join(map(chr, range(256)))
n = len(b) - len(b.translate(identity, a))

Peter


Jan Kaliszewski

unread,
Aug 20, 2009, 7:25:38 PM8/20/09
to Simon Forman, pytho...@python.org
20-08-2009 o 04:12:14 Simon Forman <sajm...@gmail.com> wrote:

> If you want to know the count for each element you can use this:
>
> from collections import defaultdict
>
> def g(a, b):
> a = set(a)
> d = defaultdict(int)
> for item in b:
> if item in a:
> d[item] += 1
> return d
>
> print g(A, B)
>
> # prints defaultdict(<type 'int'>, {' ': 1, 'e': 1, 'g': 1, 'i': 1,
> 'o': 1, 'n': 2, 's': 1, 'r': 2, 't': 2})

Yeah, your sollution is better (and more interesting :-)). Thanks!

Jan Kaliszewski

unread,
Aug 20, 2009, 7:44:43 PM8/20/09
to Neal Becker, pytho...@python.org
20-08-2009 o 13:01:29 Neal Becker <ndbe...@gmail.com> wrote:

> I meant #occurrences of characters from the set A in string B

But:

1) separately for each element of A? (see Simon's sollution with
defaultdict)

2) or total number of all occurrences of elements of A? (see below)


20-08-2009 o 14:05:12 Peter Otten <__pet...@web.de> wrote:

> identity = "".join(map(chr, range(256)))
> n = len(b) - len(b.translate(identity, a))

Nice, though I'd prefer Simon's sollution:

a = set(a)
n = sum(item in a for item in b)

Regards,

Jan Kaliszewski

unread,
Aug 20, 2009, 9:42:42 PM8/20/09
to Neal Becker, pytho...@python.org
>> a = set(a)
>> n = sum(item in a for item in b)

> Why set? Does it matter if I say that items in A are already unique?

Sets are hash-based, so it's (most probably) far more efficient for
sets than for sequences (especially if we say about big/long ones).

Message has been deleted

Raymond Hettinger

unread,
Aug 21, 2009, 2:44:29 AM8/21/09
to

Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import collections
>>> A = 'abc'
>>> B = 'abracadabra'
>>> collections.Counter(filter(set(A).__contains__, B))
Counter({'a': 5, 'b': 2, 'c': 1})

Raymond

Peter Otten

unread,
Aug 21, 2009, 3:43:51 AM8/21/09
to
Jan Kaliszewski wrote:

> 20-08-2009 o 13:01:29 Neal Becker <ndbe...@gmail.com> wrote:
>
>> I meant #occurrences of characters from the set A in string B
>
> But:
>
> 1) separately for each element of A? (see Simon's sollution with
> defaultdict)
>
> 2) or total number of all occurrences of elements of A? (see below)
>
>
> 20-08-2009 o 14:05:12 Peter Otten <__pet...@web.de> wrote:
>
>> identity = "".join(map(chr, range(256)))
>> n = len(b) - len(b.translate(identity, a))
>
> Nice, though I'd prefer Simon's sollution:
>
> a = set(a)
> n = sum(item in a for item in b)

Just to give you an idea why I posted the former:

$ python -m timeit -s"a = set('abc'); b = 'abcdefg'*10**5" 'sum(item in a
for item in b)'
10 loops, best of 3: 195 msec per loop

$ python -m timeit -s"a = 'abc'; b = 'abcdefg'*10**5; id=''.join(map(chr,
range(256)))" 'len(b) - len(b.translate(id, a))'
100 loops, best of 3: 4.97 msec per loop

Peter

0 new messages