Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Merging lists has made my brain hurt.

11 views
Skip to first unread message

Huw Lynes

unread,
Oct 2, 2002, 12:05:39 PM10/2/02
to
Hi All,

I have a list containing an arbitrary number of other lists. The
internal lists contain strings. For example
lol = [
['aaa', 'bbb', 'ccc'],
['bbb', 'ccc', 'ddd'],
['ccc', 'ddd', 'eee']
]

I want to merge the three lists into a single list that only contains
the strings present in all three lists. In the above case I want to end
up with
['ccc']

This has me utterly stumped. Any help is appreciated.The only articles
about merging strings that I've managed to find so far have been about
merging strings so that you don't get repeats. Not quite what I'm
looking for.

Thanks,
Huw

Alex Martelli

unread,
Oct 2, 2002, 12:29:19 PM10/2/02
to
Huw Lynes wrote:

The intersection of two "sets" (Sets per se are only added in Python
2.3, but you can use either lists OR dictionaries in Python 2.2 as
sets -- all you need is iterability and ability to use 'in' to test
membership) is pretty easy:

result = copy.copy(firstset)
for item in firstset:
if item not in secondset:
<delete item from result>

dictionaries are far faster than lists for membership-tests and
for deletion of items. Whether it's worth building the dicts
corresponding to your lists depends on how long are your lists:
try both ways, pick the faster one if it matters.

With lists (assuming no duplications in list lol[0]):

result = lol[0][:]
for otherlist in lol[1:]:
for item in result[:]:
if item not in otherlist:
result.remove(item)

With dicts (assuming hashable items -- strings are fine):

result = dict(zip(lol[0], lol[0]))
for otherlist in lol[1:]:
otherdict = dict(zip(otherlist, otherlist))
for item in result.keys():
if item not in otherdict:
del result[item]

result.keys() at the end is a list with all items you want
to keep, but in arbitrary order. If you want items to be
in the same order as they were in lol[0] (again assuming
no duplications in lol[0], in this case), then

[ item for item in lol[0] if item in result ]

is the final result you want.


Alex

Brian E Gallew

unread,
Oct 2, 2002, 12:34:23 PM10/2/02
to
Then <h...@moving-picture.com> spoke up and said:
> NNTP-Posting-Host: soho6.sohonet.co.uk
> User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.0) Gecko/20020529

This seems pretty easy to me. Of course, it assumes that you can
destroy lol.

while len(lol) > 1:
l1 = lol.pop()
l2 = lol.pop()
l3 = [x for x in l1 if x in l2]
lol.append(l3)
lol=lol.pop()

Eddie Corns

unread,
Oct 2, 2002, 12:42:46 PM10/2/02
to
Huw Lynes <h...@moving-picture.com> writes:

>Hi All,

>I have a list containing an arbitrary number of other lists. The
>internal lists contain strings. For example
>lol = [
>['aaa', 'bbb', 'ccc'],
>['bbb', 'ccc', 'ddd'],
>['ccc', 'ddd', 'eee']
>]

>I want to merge the three lists into a single list that only contains
>the strings present in all three lists. In the above case I want to end
>up with
>['ccc']

Basically you need to work out what is common between the first two then move
on down the rest of the list checking to see what is common between the common
list and each of the remaining items, updating the common list at each step.

So given a function like:

def inter (l1, l2):
return [i1 for i1 in l1 if i1 in l2]

to return the intersection of two lists you can then use reduce to do all the
hard work of applying the first two and so on with:

common = reduce (inter, lol)

Eddie

djw

unread,
Oct 2, 2002, 12:40:27 PM10/2/02
to
Alex wrote:
<snip>

>
> The intersection of two "sets" (Sets per se are only added in Python
> 2.3, but you can use either lists OR dictionaries in Python 2.2 as
> sets -- all you need is iterability and ability to use 'in' to test
> membership) is pretty easy:
>
<snip>

I needed to do this in a program in 2.2.1, so I went and grabbed a copy
of set.py from the "Sandbox" in CVS. Works like a charm. Here is the link:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/sets/


Don

Mark McEahern

unread,
Oct 2, 2002, 12:31:58 PM10/2/02
to
> I want to merge the three lists into a single list that only contains
> the strings present in all three lists. In the above case I want to end
> up with
> ['ccc']

Python 2.3 will include a set type:

http://www.python.org/dev/doc/devel/whatsnew/intro.html

In the meantime, try this recipe:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/138982

// m

Mike C. Fletcher

unread,
Oct 2, 2002, 12:31:44 PM10/2/02
to
This is just off the top of my head, there are probably better ways...

PythonWin 2.2.1 (#34, Apr 9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32.
Portions Copyright 1994-2001 Mark Hammond (mham...@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.


>>> lol = [['aaa', 'bbb', 'ccc'],['bbb', 'ccc', 'ddd'],['ccc', 'ddd',
'eee']]

>>> def common( source ):
... if not source:
... return []
... set = {}
... for item in source[0]:
... set[item] = 1
... for other in source[1:]:
... for item in other:
... if set.has_key( item ):
... set[item] = set[item]+1
... return [ key for (key,value) in set.items() if value == len(source)]
...
>>> common(lol)
['ccc']
>>>

I think Python 2.3 is going to have a sets module (hopefully back ported
to Python 2.2.x eventually) which would probably be a more obvious approach.

HTH,
Mike

_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/

Robert

unread,
Oct 2, 2002, 3:28:03 PM10/2/02
to

> lol = [
> ['aaa', 'bbb', 'ccc'],
> ['bbb', 'ccc', 'ddd'],
> ['ccc', 'ddd', 'eee']
> ]
>
> I want to merge the three lists into a single list that only contains
> the strings present in all three lists. In the above case I want to end
> up with
> ['ccc']

using 'reduce' would be nice functional style programming. however calls &
copying, creating, slicing, modifying lists are very expensive while running
is so-la-la. so a simple breaky 2D-loop would almost outperform at minimum
cons-ing and still read best for newbies:

common=[]
for e in lol[0]:
for l in lol:
if e not in l: break
else: common.append(e)

#(the else belongs to for!)
#in certain cases when NX&NY are both huge, a dict would gain O(2) speed but
only if you play economically with one dict! :

common=[]
d={}
for l in lol:
for e in l:
if e in d: d[e]+=1
else: d[e]=1
n=len(lol)
common = [e for e in d if d[e]==n]
print common

#(slightly improveable); but if you have your input sub-lists occasionally
prepared as dicts (sets in the future), the first solution is again
prefereable for small & huge stuff without code modification

Robert


Max M

unread,
Oct 2, 2002, 5:56:23 PM10/2/02
to
Robert wrote:

> common=[]
> d={}
> for l in lol:
> for e in l:
> if e in d: d[e]+=1
> else: d[e]=1
> n=len(lol)
> common = [e for e in d if d[e]==n]
> print common


Slightly improved:

d = {}
for l in lol:
for i in l:
d[i] = d.setdefault(i, 0) + 1


n=len(lol)
common = [e for e in d if d[e]==n]
print common


Or the shortest one I can come up with. Don't know about the brain-pain
though ;-)

d = {}; common = []
for i in reduce(lambda x,y: x+y, lol):
d[i] = d.setdefault(i, 0) + 1
if d[i] == len(lol): common.append(i)
print common


or the perl version:

&_<$>& s_$
print _#($_{})


;-)

--


regards Max M

the Law of Inverse Squares. With sound, for example, a source twice as
far away from the detector (an ear?) provides just one-quarter of the
strength of signal. ESP has been said to show no fall-off at all, let
alone any diminution of strength. Well, we must admit that zero signal
won't show any change...

Max M

unread,
Oct 2, 2002, 6:24:28 PM10/2/02
to
Max M wrote:

> Or the shortest one I can come up with. Don't know about the brain-pain
> though ;-)
>
> d = {}; common = []
> for i in reduce(lambda x,y: x+y, lol):
> d[i] = d.setdefault(i, 0) + 1
> if d[i] == len(lol): common.append(i)
> print common

well I found a shorter one (I can't help it. these short snippets are
like a good puzzle):

l = reduce(lambda x,y:x+y, lol)
l.sort()
common = [l[i] for i in range(len(l)) if l[i:i+len(lol)] == len(lol)*[l[i]]]
print common

Not nice to the brain, nor very efficient I guess. But fun.


;-)

regards Max M

Mark McEahern

unread,
Oct 2, 2002, 6:08:01 PM10/2/02
to
[Max M]

> Slightly improved:
>
> d = {}
> for l in lol:
> for i in l:
> d[i] = d.setdefault(i, 0) + 1
> n=len(lol)
> common = [e for e in d if d[e]==n]
> print common

Slightly improved yet more since the above version assumes each list has
unique elements (which may have been one of the original requirements, I
don't know):

#!/usr/bin/env python

l = [1, 1, 1]
m = [2, 2, 2]
n = [3, 3, 3]

lol = [l, m, n]

d = {}
for l in lol:

unique = dict(zip(l, l))
for i in unique:


d[i] = d.setdefault(i, 0) + 1
n = len(lol)

common = [e for e in d if d[e]==n]
print common

// m


Dennis Lee Bieber

unread,
Oct 2, 2002, 8:44:39 PM10/2/02
to
Huw Lynes fed this fish to the penguins on Wednesday 02 October 2002
09:05 am:


>
> This has me utterly stumped. Any help is appreciated.The only articles
> about merging strings that I've managed to find so far have been about
> merging strings so that you don't get repeats. Not quite what I'm
> looking for.
>

There are no doubt better and faster methods... I haven't done enough
with the new features of 2.x -- my biggest exposure was writing an SMTP
send process for my Amiga back when 1.5 came out...


"""
Brute force scan for common items in nested list
"""

nlist = [ ['aaa', 'bbb', 'ccc'],


['bbb', 'ccc', 'ddd'],

['ccc', 'ddd', 'eee', 'fff']
]


dct = {}
slcount = 0

for sl in nlist:
slcount = slcount + 1
for itm in sl:
if dct.has_key(itm):
dct[itm] = dct[itm] + 1
else:
dct[itm] = 1

olist = []
for itm in dct.keys():
if dct[itm] == slcount:
olist.append(itm)

print olist


--
> ============================================================== <
> wlf...@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
> wulf...@dm.net | Bestiaria Support Staff <
> ============================================================== <
> Bestiaria Home Page: http://www.beastie.dm.net/ <
> Home Page: http://www.dm.net/~wulfraed/ <

Huw Lynes

unread,
Oct 3, 2002, 1:41:29 PM10/3/02
to
Thanks to everyone that gave advice.

I've decided to go with a slightly more verbose version of what Brian E
Gallew proposed. namely:

while len(lol) > 1:
list1 = lol.pop()
list2 = lol.pop()
list3 = []
for x in list1:
if x in list2:
list3.append(x)
lol.append(list3)

final_list = lol.pop()

The set stuff looks interesting and I will take a look at it when I get
the chance. For now the method above solves my problem in a way that is
simple enough for me to understand in three months time when I come back
to make changes to what I've just written.

Thanks,
Huw


--
| Huw Lynes | The Moving Picture Company |
| System Administrator | 127 Wardour Street |
|.........................| London, W1F 0NL |


gyromagnetic

unread,
Oct 4, 2002, 9:05:04 AM10/4/02
to
Huw Lynes <h...@moving-picture.com> wrote in message news:<3D9B1953...@moving-picture.com>...


Hi Huw,
I caught this thread rather late, but here's another variation on the
theme. Not sure how it compares with the other approaches in terms of
speed, efficiency, etc.

-gyro

=====

#!/usr/bin/python

def common_entries(lol):

mst = dict(zip(lol[0],[1]*len(lol[0])))

for kk in mst.keys():
for ll in lol[1:]:
if kk in ll: mst[kk] += 1

return [ss for ss,i in mst.items() if i == len(lol)]

Duncan Booth

unread,
Oct 4, 2002, 10:45:35 AM10/4/02
to
gyroma...@excite.com (gyromagnetic) wrote in
news:4620daca.02100...@posting.google.com:

> I caught this thread rather late, but here's another variation
on the
> theme. Not sure how it compares with the other approaches in
terms of
> speed, efficiency, etc.
>

Here is an alternative inspired by my misreading your solution:

>>> def common_entries(lol):
mst, msk = {}, 1

for ll in lol:
for kk in ll:
mst[kk] = mst.get(kk, 0) | msk
msk *= 2

return [ss for ss,i in mst.items() if i+1 == msk]

>>> lol = [
['aaa', 'bbb', 'bbb', 'ccc'], ['bbb', 'ccc', 'ddd'], ['ccc',
'ddd', 'eee']
]
>>> common_entries(lol)
['ccc']

Thorsten Kampe

unread,
Oct 6, 2002, 12:03:11 PM10/6/02
to
* Alex Martelli

> The intersection of two "sets" (Sets per se are only added in Python
> 2.3, but you can use either lists OR dictionaries in Python 2.2 as
> sets -- all you need is iterability and ability to use 'in' to test
> membership) is pretty easy:
> [...]

> dictionaries are far faster than lists for membership-tests and
> for deletion of items. Whether it's worth building the dicts
> corresponding to your lists depends on how long are your lists:
> try both ways, pick the faster one if it matters.

Treating lists a priori as sets for reasons of speed is definitely not
the way to go because you're "losing information". I think it's better
to treat them as tuples (in a mathematical sense) whereby you can
always delete 'duplicates' afterwards if you don't care about them.

Example: What's the intersection between [2, 1, 2, 3, 2] and
[0, 1, 2, 3, 2, 4]? An intersection (union, symmetric difference) of
tuples can only be meaningful for the /corresponding/ items of the
tuples/lists, so it's neither [2, 1, 3] nor [2, 1, 2, 3, 2] but
[2, 1, 2, 3].

I tried to write a reliable (though maybe slow) program that takes
care of this issue (especially that intersection is _commutative_):

#v+
def boolean(seq0, seq1, boolean_modus):
"""" perform 'and', 'not', 'or', or 'xor' operation on sequences """
if boolean_modus == 'and': # intersection
seq1 = seq1[:]
intersection = []
for item in seq0:
if item in seq1:
intersection.append(item)
seq1.remove(item)
return intersection

elif boolean_modus == 'not': # set difference
setdiff = seq0[:]
for item in seq1:
if item in setdiff:
setdiff.remove(item)
return setdiff

elif boolean_modus == 'or': # union
return seq0 + boolean(seq1, seq0, 'not')

elif boolean_modus == 'xor': # symmetric difference
return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1, 'and'), 'not')
#v-


Thorsten

Alex Martelli

unread,
Oct 6, 2002, 3:05:17 PM10/6/02
to
Thorsten Kampe wrote:
...

> Treating lists a priori as sets for reasons of speed is definitely not
> the way to go because you're "losing information". I think it's better

Of course, you have to know which information matters.

> to treat them as tuples (in a mathematical sense) whereby you can
> always delete 'duplicates' afterwards if you don't care about them.

If you don't care about duplicates you're much better off normalizing
first, and if you care about COUNT of duplicates but not ordering (i.e.,
a bag) you're generally still better off normalizing first.

> Example: What's the intersection between [2, 1, 2, 3, 2] and
> [0, 1, 2, 3, 2, 4]? An intersection (union, symmetric difference) of
> tuples can only be meaningful for the /corresponding/ items of the
> tuples/lists, so it's neither [2, 1, 3] nor [2, 1, 2, 3, 2] but
> [2, 1, 2, 3].

But why not [1, 2, 3, 2] instead? I know of no algebras defining
operations called intersection &c that are not commutative. What
you mean by "corresponding" there is rather mysterious to me,
but wouldn't it mean "same item in the SAME position"?


> I tried to write a reliable (though maybe slow) program that takes
> care of this issue (especially that intersection is _commutative_):

But it's NOT -- [1, 2, 3, 2] != [2, 1, 2, 3]. You're losing information,
specifically order information. A correct definition should be:

def intersect(seq1, seq2):
return [ x for x, y in zip(seq1, seq2) if x == y ]

If what you want is to treat sequences as BAGS, then they're better
represented by dicts again -- order is the only information present
in a list and not in a dict, and you don't seem to consider it anyway.
So to do extensive processing of seqs-as-bags have converter funcs:

def seqToBag(seq):
bag = {}
for item in seq: bag[item] = 1 + bag.get(item, 0)

def bagToSeq(bag):
return [ x for key in bag for x in [key]*bag[key]]

and for example "intersection" of bags then becomes, e.g:

def intersectBags(bag1, bag2):
result = {}
for item in bag1:
count = min(bag1[item], bag2.get(item, 0))
if count>0: result[item] = count
return result


Alex

Thorsten Kampe

unread,
Oct 7, 2002, 8:55:41 AM10/7/02
to
* Alex Martelli

> Thorsten Kampe wrote:
>> Treating lists a priori as sets for reasons of speed is definitely
>> not the way to go because you're "losing information". I think it's
>> better
>
> Of course, you have to know which information matters.

I meant this: in a real world scenario it's often not practical to
say: use this program, but beware, it's relies on an input without
duplicates or relies on all items to be hashable, because you can't
tell in advance whether the conditions will be met.

Not for the example you were answering of course but for a general
purpose 'intersection program'.

>> to treat them as tuples (in a mathematical sense) whereby you can
>> always delete 'duplicates' afterwards if you don't care about them.
>
> If you don't care about duplicates you're much better off
> normalizing first, and if you care about COUNT of duplicates but not
> ordering (i.e., a bag) you're generally still better off normalizing
> first.

What do you mean exactly by "normalizing"?

>> Example: What's the intersection between [2, 1, 2, 3, 2] and [0, 1,
>> 2, 3, 2, 4]? An intersection (union, symmetric difference) of
>> tuples can only be meaningful for the /corresponding/ items of the
>> tuples/lists, so it's neither [2, 1, 3] nor [2, 1, 2, 3, 2] but [2,
>> 1, 2, 3].
>
> But why not [1, 2, 3, 2] instead? I know of no algebras defining
> operations called intersection &c that are not commutative.

Sorry, I wrote quite imprecisely. My interpretation of 'tuple' was
contradictory to the traditional mathematical meaning 'order and
repetition counts' but it meant "repetition counts, order doesn't".

> What you mean by "corresponding" there is rather mysterious to me,
> but wouldn't it mean "same item in the SAME position"?

I was asking myself: what would be a sensible generalization for the
common set operations (intersection, union, etc.) for tuples?

Precisely: what is (or should be) the intersection of [2, 2] and [2,
2, 2]? Treat them as sets? Then the result is '[2]'.

Treat them as tuples? The most common algorithm is: for each item in
first seq: if it is in second seq, then add it to the intersection.

But then:
intersect([2, 2], [2, 2, 2]) = [2, 2]
intersect([2, 2, 2], [2, 2]) = [2, 2, 2]

You "connect" the "corresponding" items (the first '2' in seq1 with
the first in seq2, the second '2' in seq1 with the second '2' in seq2,
the third '2' in seq1 with the third '2' in seq2 ... *ooops*)
[1, 2, 2, 2, 3]
| | | | <- "non proportional font required"
[0, 1, 2, 2, 3, 4]

So the intersection should contain all the items that can be
"paired"/"connected". The last '2' of seq2 has no 'partner' so it
shouldn't be part of the intersection.

>> I tried to write a reliable (though maybe slow) program that takes
>> care of this issue (especially that intersection is _commutative_):
>
> But it's NOT -- [1, 2, 3, 2] != [2, 1, 2, 3]. You're losing
> information, specifically order information. A correct definition
> should be:
>
> def intersect(seq1, seq2):
> return [ x for x, y in zip(seq1, seq2) if x == y ]

This is correct for "if first item of seq1 = first item of seq2" but I
had never use for such an "intersection". I could not see any "real
use" for intersect([1, 2, 3], [3, 1, 2]) = [].

> If what you want is to treat sequences as BAGS, then they're better
> represented by dicts again -- order is the only information present

> in a list and not in a dict, [...]

Well, usually not:
>>> {1: 2, 1: 2} == {1:2}
1

>>> [2, 2] == [2]
0

> So to do extensive processing of seqs-as-bags have converter funcs:
>
> def seqToBag(seq):
> bag = {}
> for item in seq: bag[item] = 1 + bag.get(item, 0)
>
> def bagToSeq(bag):
> return [ x for key in bag for x in [key]*bag[key]]
>
> and for example "intersection" of bags then becomes, e.g:
>
> def intersectBags(bag1, bag2):
> result = {}
> for item in bag1:
> count = min(bag1[item], bag2.get(item, 0))
> if count>0: result[item] = count

^^^^^^^^^^^ <- this is superfluous IMO
> return result

Yes, definitely better than my "append to result, remove from original
list to prevent item being counted twice" version. And a lot faster
(except for the abnormal "very large seq0, very small seq1" case).

This is what I made of your code (it's a little optimized compared to
my previous version so that the loop goes always through the smaller
sequence.) I didn't change the 'union' and "symmetric difference" part
because you have to loop through count0 *and* count1.

#v+
def boolean(seq0, seq1, boolean_modus):
"""" perform 'and', 'not', 'or', or 'xor' operation on sequences """

## auxiliary functions
def tocount(seq): # convert to dict where key is the hashable representation
count = {} # of the object and value is the object count
for item in seq:
count[repr(item)] = count.get(repr(item), 0) + 1
return count

def toseq(count): # convert back to sequence
return [item for key in count for item in [eval(key)] * count[key]]

if boolean_modus == 'and': # intersection

count0 = tocount(seq0)
count1 = tocount(seq1)

if count0 > count1: # loop through the shorter list
count0, count1 = count1, count0
for item in count0:
count0[item] = min(count0[item], count1.get(item, 0))
return toseq(count0)

elif boolean_modus == 'not': # set difference

count0 = tocount(seq0)
count1 = tocount(seq1)

if count0 < count1: # loop through the shorter list
for item in count0:
count0[item] = count0[item] - count1.get(item, 0)
else:
for item in count1:
count0[item] = count0.get(item, 0) - count1[item]
return toseq(count0)

Alex Martelli

unread,
Oct 7, 2002, 11:22:28 AM10/7/02
to
Thorsten Kampe wrote:
...

> I meant this: in a real world scenario it's often not practical to
> say: use this program, but beware, it's relies on an input without
> duplicates or relies on all items to be hashable, because you can't
> tell in advance whether the conditions will be met.

Right, but you can at least check whether they're met, and fall
back to less-efficient means if they aren't.

>> If you don't care about duplicates you're much better off
>> normalizing first, and if you care about COUNT of duplicates but not
>> ordering (i.e., a bag) you're generally still better off normalizing
>> first.
>
> What do you mean exactly by "normalizing"?

Sorry, I was sloppy here -- I mean "moving to a better way to
represent sets" (dicts, or Sets in Python 2.3). Even for bags,
there are more usable representations than "sequences with
repetitions in random order", btw -- see further.

> Sorry, I wrote quite imprecisely. My interpretation of 'tuple' was
> contradictory to the traditional mathematical meaning 'order and
> repetition counts' but it meant "repetition counts, order doesn't".

So what you want are 'bags'. OK by me, but "you can't tell in
advance whether the conditions will be met" (your words!) ALSO
applies -- and here it's quite a bit worse than wrt duplicates
or hashability, because you CAN'T check whether this "condition"
matters... if you're fed lists, there's no way to know whether
those lists' items' ordering are considered meaningful within
the given application program, or no.

>> What you mean by "corresponding" there is rather mysterious to me,
>> but wouldn't it mean "same item in the SAME position"?
>
> I was asking myself: what would be a sensible generalization for the
> common set operations (intersection, union, etc.) for tuples?

For BAGS (order meaningless), the natural representation is as
a mapping item-># of occurrences, and the natural generalization
of intersection and union use min and max (the natural gener. to
integers of and and or on booleans!) of course equating the
mapping to 0 with absence from the bag.


> Precisely: what is (or should be) the intersection of [2, 2] and [2,
> 2, 2]? Treat them as sets? Then the result is '[2]'.
>
> Treat them as tuples? The most common algorithm is: for each item in
> first seq: if it is in second seq, then add it to the intersection.
>
> But then:
> intersect([2, 2], [2, 2, 2]) = [2, 2]
> intersect([2, 2, 2], [2, 2]) = [2, 2, 2]
>
> You "connect" the "corresponding" items (the first '2' in seq1 with
> the first in seq2, the second '2' in seq1 with the second '2' in seq2,
> the third '2' in seq1 with the third '2' in seq2 ... *ooops*)
> [1, 2, 2, 2, 3]
> | | | | <- "non proportional font required"
> [0, 1, 2, 2, 3, 4]
>
> So the intersection should contain all the items that can be
> "paired"/"connected". The last '2' of seq2 has no 'partner' so it
> shouldn't be part of the intersection.

If order doesn't matter, then it does matter WHICH '2's of seq2
are or aren't part of the intersection, of course -- which is
part of why representing bags by mappings is so much more
natural than representing them by sequences with repetitions
in random order.

If you ARE constrained to use sequences (because mappings are
forbidden, e.g. by the non-hashable nature of items in a
context where you're only allowed to use dicts for mappings)
I think you're still better off simulating a mapping than
fighting with the in-natural representation of bags as
sequences with repetitions in random order -- when you don't
care about the order, it just gets in the way!

So, if I had to support non-hashables, I would fall back
to a representation as a list of two-items lists, meaning
item/count. I would actually wrap that up in all the
trappings of a mapping, and use dicts normally, falling
back to the inefficient linear representation only for
non-hashable items. In many cases, I could have a dict
for most stuff, with a small "look-aside list of pairs"
just for those few items that are non-hashable. Another
usable alternative is sometimes to make non-hashables
into hashables, e.g. via pickle (costly, to be sure, but
it can make the differene between O(N) and O(N squared)
work). But I'm getting sidetracked here, sorry.


>> But it's NOT -- [1, 2, 3, 2] != [2, 1, 2, 3]. You're losing
>> information, specifically order information. A correct definition
>> should be:
>>
>> def intersect(seq1, seq2):
>> return [ x for x, y in zip(seq1, seq2) if x == y ]
>
> This is correct for "if first item of seq1 = first item of seq2" but I
> had never use for such an "intersection". I could not see any "real
> use" for intersect([1, 2, 3], [3, 1, 2]) = [].

"What songs are in the same position in our hit list as they were
last week?" is an often-answered question on radio programs that
give "hit parades", for example. "none of them" is a valid
answer. If order is meaningful (and if you're using sequences
it should normally be), then throwing order-information away is
a very suspect design decision.


>> If what you want is to treat sequences as BAGS, then they're better
>> represented by dicts again -- order is the only information present
>> in a list and not in a dict, [...]
>
> Well, usually not:
>>>> {1: 2, 1: 2} == {1:2}
> 1
>
>>>> [2, 2] == [2]
> 0

Not sure what you mean here. If you're representing bags by dicts
in this way, sequence [2, 2] is represented by dict {2: 2}, i.e.,
"item 2 is present twice". Here you don't lose information just
because you consider the two occurrences indistinguishable. More
often you do, e.g. there are three different ways to represent
baf {1:2, 2:1} as a sequence: [1,1,2], [1,2,1], [2,1,1]. Having
many ways to represent what is to be considered the "same" thing
to all intent and purposes should raise suspicion off the bat.

It doesn't matter that you can write dict {1:2} (representing
the same bag as [1,1] in sequence-representation) with many
different dictionary-display strings -- that's just an artifact
of the parser, not inherent in the in-memory representation of
the abstract datum "bag" as a concrete Python object. That you
can write {1:2,1:2} with the same meaning as {1:2} is almost as
irrelevant as the fact that you can add spaces, or a trailing
comma, at your pleasure -- it doesn't change the dictionary
object we're getting from the parser.

>> and for example "intersection" of bags then becomes, e.g:
>>
>> def intersectBags(bag1, bag2):
>> result = {}
>> for item in bag1:
>> count = min(bag1[item], bag2.get(item, 0))
>> if count>0: result[item] = count
> ^^^^^^^^^^^ <- this is superfluous IMO

I prefer to avoid redundant storing of "mapping to 0", since
to all intents and purposes they're equivalent to absence
from the bag. Once again, instinctively I prefer to find
a CANONICAL representation -- where each abstract object is
represented in ONE way as a concrete object. Maybe it's an
artefact of having worked too much in combinatorics (where
double-counting is a perennial worry:-), or maybe it's an
application of Python's motto "there should be one way,
and ideally only one way". Sometimes, for efficiency, I
have to make do with non-canonical implementations during
computation, and normalize them back to canonical form at
the end (e.g., computing with rationals, go back to the
canonical representation with mutually prime numerator and
denominator only after a series of computations), but I'm
always somewhat uncomfortable when such needs arise -- as
I consider them an optimization, I'm VERY loath to perform
them prematurely:-).

>> return result
>
> Yes, definitely better than my "append to result, remove from original
> list to prevent item being counted twice" version. And a lot faster
> (except for the abnormal "very large seq0, very small seq1" case).

Yes, you can check lengths and swap the bags if you want to
ensure O(min(len(bag1), len(bag2))) performance -- again this
seems to me to be more of an optimization than a "deep" issue,
but of course it may have practical importance.


> This is what I made of your code (it's a little optimized compared to
> my previous version so that the loop goes always through the smaller
> sequence.) I didn't change the 'union' and "symmetric difference" part
> because you have to loop through count0 *and* count1.

Given that you delegate some of the implementation of your 'or' to
your 'not', which goes to the trouble of converting back and
forth, I'd bet you would be somewhat better off by simplifyind --
rather than:

> if boolean_modus == 'and': # intersection
> count0 = tocount(seq0)
> count1 = tocount(seq1)
>
> if count0 > count1: # loop through the shorter list

that's NOT what you're testing for here -- use len(count0) etc
to test the lengths.

> count0, count1 = count1, count0
> for item in count0:
> count0[item] = min(count0[item], count1.get(item, 0))
> return toseq(count0)

have another auxiliary function:
def binop(seq0, seq1, op):


count0 = tocount(seq0)
count1 = tocount(seq1)

if len(count0) > len(count1):
count1, count0 = count0, count1
for item in count0:
count0[item] = op(count0[item], count1.get(item, 0))

and implement your 'and' by returning binop(seq0, seq1, min) and
your 'or' by returning binop(seq0, seq1, max) -- I think this
also clarifies the parallelism between the two cases.


A few more points:

> def boolean(seq0, seq1, boolean_modus):
> """" perform 'and', 'not', 'or', or 'xor' operation on sequences """
> ## auxiliary functions
> def tocount(seq): # convert to dict where key is the hashable
> representation
> count = {} # of the object and value is the object count
> for item in seq:
> count[repr(item)] = count.get(repr(item), 0) + 1
> return count
>
> def toseq(count): # convert back to sequence
> return [item for key in count for item in [eval(key)] *
> count[key]]

You're FAR more likely to meet objects where you can't count
on eval(repr(x)) == x, than non-hashable objects.

> elif boolean_modus == 'not': # set difference
> count0 = tocount(seq0)
> count1 = tocount(seq1)
>
> if count0 < count1: # loop through the shorter list
> for item in count0:
> count0[item] = count0[item] - count1.get(item, 0)
> else:
> for item in count1:
> count0[item] = count0.get(item, 0) - count1[item]

I don't understand this. This way you can end up with negative
values for several items in count0 -- what do they MEAN? You may
luck out in that [blah]*N for ANY N<=0 is [], but it seems quite
peculiar to me to COUNT on this (particularly w/o a comment!-).


> elif boolean_modus == 'xor': # symmetric difference
> return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1,
> 'and'), 'not')

This one could also be more speedily implemented with the
already presented binop and another little auxiliary function:

def maxminusmin(onecount, anothercount):
return max(onecount, anothercount) - min(onecount, anothercount)

and your 'xor' is implementable as
return binop(seq0, seq1, maxminusmin)


Alex

Thorsten Kampe

unread,
Oct 7, 2002, 8:04:36 PM10/7/02
to
> Thorsten Kampe wrote:
>> I meant this: in a real world scenario it's often not practical to
>> say: use this program, but beware, it's relies on an input without
>> duplicates or relies on all items to be hashable, because you can't
>> tell in advance whether the conditions will be met.
>
> Right, but you can at least check whether they're met, and fall back
> to less-efficient means if they aren't.

Well, my demands for code are different: short, clean (modular) and
not O(n**2) which means the time to run a program should be linear or
logarithmic, not quadratic.

"Optimization" and error checking makes your code three or four times
as long (and without exception handling even more).

>> Sorry, I wrote quite imprecisely. My interpretation of 'tuple' was
>> contradictory to the traditional mathematical meaning 'order and
>> repetition counts' but it meant "repetition counts, order doesn't".
>
> So what you want are 'bags'. OK by me, but "you can't tell in
> advance whether the conditions will be met" (your words!) ALSO
> applies -- and here it's quite a bit worse than wrt duplicates or
> hashability, because you CAN'T check whether this "condition"
> matters... if you're fed lists, there's no way to know whether those
> lists' items' ordering are considered meaningful within the given
> application program, or no.

Sorry, If I run my program I have to know what I want or I should get
a clue. But I cannot tell if the input stream meets some special
conditions (except by letting the program do the testing and decide
which algorithm to use).

>> You "connect" the "corresponding" items (the first '2' in seq1 with
>> the first in seq2, the second '2' in seq1 with the second '2' in
>> seq2, the third '2' in seq1 with the third '2' in seq2 ... *ooops*)
>> [1, 2, 2, 2, 3]
>> | | | | <- "non proportional font required"
>> [0, 1, 2, 2, 3, 4]
>>
>> So the intersection should contain all the items that can be
>> "paired"/"connected". The last '2' of seq2 has no 'partner' so it
>> shouldn't be part of the intersection.
>
> If order doesn't matter, then it does matter WHICH '2's of seq2 are

> or aren't part of the intersection, of course [...]

That doesn't make sense to me. If order doesn't matter but the count,
then it doesn't matter which two '2' of seq1 are part of the
intersection but that it's two '2's (and not one or three).

[1, 2, 2, 2, 3] [1, 2, 2, 2, 3] [1, 2, 2, 2, 3]
| | | | or | | | | or | | | |
[0, 1, 2, 2, 3, 4] [0, 1, 2, 2, 3, 4] [0, 1, 2, 2, 3,
4]

> If order is meaningful (and if you're using sequences it should


> normally be), then throwing order-information away is a very suspect
> design decision.

The aspect of "order" I was thinking of, was the "output order".
What's the intersection of [1, 2, 3] and [3, 2, 1]? I could not think
of any "natural" reason to declare: "it's [1, 2, 3] and not [3, 2, 1]"
(or vice versa).

>>> If what you want is to treat sequences as BAGS, then they're
>>> better represented by dicts again -- order is the only information
>>> present in a list and not in a dict, [...]
>>
>> Well, usually not:
>>>>> {1: 2, 1: 2} == {1:2}
>> 1
>>
>>>>> [2, 2] == [2]
>> 0

> Not sure what you mean here. [...]

A "naive" explanation of the difference between dictionary and list
would be: in a list repetition and order matters, in a dictionary not:
[1, 2, 3] != [1, 3, 2] != [1, 2, 2, 3]; {1:11, 2:22} == {2: 22, 1:11}
== {1:11, 2:22, 1:11} (at least that's what the interpreter says)

> It doesn't matter that you can write dict {1:2} (representing the
> same bag as [1,1] in sequence-representation) with many different
> dictionary-display strings -- that's just an artifact of the parser,
> not inherent in the in-memory representation of the abstract datum
> "bag" as a concrete Python object. That you can write {1:2,1:2} with
> the same meaning as {1:2} is almost as irrelevant as the fact that
> you can add spaces, or a trailing comma, at your pleasure -- it
> doesn't change the dictionary object we're getting from the parser.

Yes, but what you say is not "naive" anymore but "indepth". ;-)

>>> and for example "intersection" of bags then becomes, e.g:
>>>
>>> def intersectBags(bag1, bag2):
>>> result = {}
>>> for item in bag1:
>>> count = min(bag1[item], bag2.get(item, 0))
>>> if count>0: result[item] = count
>> ^^^^^^^^^^^ <- this is superfluous IMO
>
> I prefer to avoid redundant storing of "mapping to 0", since to all
> intents and purposes they're equivalent to absence from the bag.

Yes, but what is more "costly"? The "if-branch" in the loop or the
possible "null objects" (1:0, 2:0) in memory? I don't know.

> Maybe it's an artefact of having worked too much in combinatorics

> (where double-counting is a perennial worry:-), [...]

Glad to hear that! So you're victimized to have a quick look at my
"combinatoric" program that deals with permutations, combinations and
variations (with and without repetition).

>> This is what I made of your code (it's a little optimized compared to
>> my previous version so that the loop goes always through the smaller
>> sequence.) I didn't change the 'union' and "symmetric difference" part
>> because you have to loop through count0 *and* count1.

> [...]


>> count0 = tocount(seq0)
>> count1 = tocount(seq1)
>>
>> if count0 > count1: # loop through the shorter list
>
> that's NOT what you're testing for here -- use len(count0) etc
> to test the lengths.

Are you sure? I couldn't find a definition for "<" with dictionaries
and "{111: 111} < {11:11, 22:22}" but "{111: 111, 22:22} > {11:11,
22:22}" made me think: "'<' first tests the lengths and if they're the
same the keys are compared", but that was my assumption.

> have another auxiliary function:
> def binop(seq0, seq1, op):
> count0 = tocount(seq0)
> count1 = tocount(seq1)
> if len(count0) > len(count1):
> count1, count0 = count0, count1
> for item in count0:
> count0[item] = op(count0[item], count1.get(item, 0))
>
> and implement your 'and' by returning binop(seq0, seq1, min) and
> your 'or' by returning binop(seq0, seq1, max) -- I think this
> also clarifies the parallelism between the two cases.

It's not that easy. Union and symmetric difference take elements of
both sequences (e. g. union([11, 22], [22, 33])) so you have to loop
through both sequences!

So for the union, I think my "seq0 + (seq1 - seq0)" is shorter (and
probably faster).

> A few more points:
>> [...]


>> count[repr(item)] = count.get(repr(item), 0) + 1
>> return count

>> [...]


>> return [item for key in count for item in [eval(key)] *
>> count[key]]
>
> You're FAR more likely to meet objects where you can't count
> on eval(repr(x)) == x, than non-hashable objects.

When would that be?

>> elif boolean_modus == 'not': # set difference
>> count0 = tocount(seq0)
>> count1 = tocount(seq1)
>>
>> if count0 < count1: # loop through the shorter list
>> for item in count0:
>> count0[item] = count0[item] - count1.get(item, 0)
>> else:
>> for item in count1:
>> count0[item] = count0.get(item, 0) - count1[item]
>
> I don't understand this. This way you can end up with negative
> values for several items in count0 -- what do they MEAN? You may
> luck out in that [blah]*N for ANY N<=0 is [], but it seems quite
> peculiar to me to COUNT on this (particularly w/o a comment!-).

Yes, the implementation of "[item] * N" is quite useful for n < 1. It
means "no occurrence left for item in seq0": [2] - [2, 2, 2, 2] ->
{2: -3}

>> elif boolean_modus == 'xor': # symmetric difference
>> return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1,
>> 'and'), 'not')
>
> This one could also be more speedily implemented with the
> already presented binop and another little auxiliary function:
>
> def maxminusmin(onecount, anothercount):
> return max(onecount, anothercount) - min(onecount, anothercount)

That's abs(onecount - anothercount), isn't it?

> and your 'xor' is implementable as
> return binop(seq0, seq1, maxminusmin)

Same as for union: I'd have to loop through both lists, but that's
cheap compared to the triple todict -> to seq conversion I do at the
moment.


Thorsten

Thorsten Kampe

unread,
Oct 8, 2002, 5:52:01 AM10/8/02
to
* Thorsten Kampe

> * Alex Martelli
>> Thorsten Kampe wrote:
[massive snip]

>> have another auxiliary function:
>> def binop(seq0, seq1, op):
>> count0 = tocount(seq0)
>> count1 = tocount(seq1)
>> if len(count0) > len(count1):
>> count1, count0 = count0, count1
>> for item in count0:
>> count0[item] = op(count0[item], count1.get(item, 0))
>>
>> and implement your 'and' by returning binop(seq0, seq1, min) and
>> your 'or' by returning binop(seq0, seq1, max) -- I think this also
>> clarifies the parallelism between the two cases.
>
> It's not that easy. Union and symmetric difference take elements of
> both sequences (e. g. union([11, 22], [22, 33])) so you have to loop
> through both sequences!
>
> So for the union, I think my "seq0 + (seq1 - seq0)" is shorter (and
> probably faster).

Okay, I rewrote the "seq0 + (seq1 - seq0)" part:
#v+


elif boolean_modus == 'or': # union

union = {}
for item in count0:
union[item] = max(count0[item], count1.get(item, 0))
for item in count1:
union[item] = max(count0.get(item, 0), count1[item])
return toseq(union)
#v-

Like I predicted, with "seq0 = seq1 = [1] * 1000000" it's about a 0.8
sec slower and with "seq0 = [1] * 1000000, seq1 = range(1000000)" it's
about 3.7 sec slower.

>>> elif boolean_modus == 'xor': # symmetric difference
>>> return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1,
>>> 'and'), 'not')
>>
>> This one could also be more speedily implemented with the already
>> presented binop and another little auxiliary function:
>>
>> def maxminusmin(onecount, anothercount):
>> return max(onecount, anothercount) - min(onecount, anothercount)
>
> That's abs(onecount - anothercount), isn't it?
>
>> and your 'xor' is implementable as
>> return binop(seq0, seq1, maxminusmin)
>
> Same as for union: I'd have to loop through both lists, but that's
> cheap compared to the triple todict -> to seq conversion I do at the
> moment.

I rewrote this, too:
#v+


elif boolean_modus == 'xor': # symmetric difference

symdiff = {}
for item in count0:
symdiff[item] = abs(count0[item] - count1.get(item, 0))
for item in count1:
symdiff[item] = abs(count0.get(item, 0) - count1[item])
return toseq(symdiff)
#v-

With the same test as with union, it's three and two times as fast.

Thanks!


Thorsten

Thorsten Kampe

unread,
Oct 8, 2002, 8:50:22 AM10/8/02
to
* Alex Martelli

> Maybe it's an artefact of having worked too much in combinatorics
> (where double-counting is a perennial worry:-), [...]

I would like you to have a look at my "combinatorics" program. It's a
straight translation from the original RPL (mixture between Lisp and
Forth) code. RPL has no dictionaries just lists so the data
representation could be ineffective.

The code is loosely commented so here's what it does: it deals with
the whole field of combinatorics, that is permutation, variation and
combination with and without repetition.

The resulting list can be quite large, so for convenience the main
"if-branch" computes the length of the resulting list rather than the
list itself. If you're just interested in the count, you can always
give the actual list /or/ the size of the list as argument except for
permutation with repetition where you have to give the actual list.

'P' means 'permutation', 'C' combination and 'V' variation
'#' means return the /count/ otherwise the actual list
'R' means 'with repetition' otherwise it's without
combinatoric([1, 2, 3, 4], '#P') -> 24 # permutation without repetition
combinatoric(4, '#P') -> 24
combinatoric([1, 2, 3, 3], '#PR') -> 12 # permutation with repetition
combinatoric([1, 2, 3, 4, 5], 2, '#C') -> 10 # combination without repetition
combinatoric(5, 2, '#C') -> 10
combinatoric([1, 2, 3, 4, 5], 2, '#CR') -> 15 # combination with repetition
combinatoric(5, 2, '#CR') -> 15
combinatoric([1, 2, 3, 4, 5], 2, '#V') -> 20 # variation without repetition
combinatoric(5, 2, '#V') -> 20
combinatoric([1, 2, 3, 4, 5], 2, '#VR') -> 25 # variation with repetition
combinatoric(5, 2, '#VR') -> 25

combinatoric([1, 2, 3, 4], 2, 'C') -> [[2, 3], [3, 4], [1, 3], [2, 4], [1, 4], [1, 2]]
combinatoric([1, 2, 3, 3], 2, 'VR') -> [[1, 1], [1, 2], [1, 3], [1, 3], [2, 1], [2, 2],
[2, 3], [2, 3], [3, 1], [3, 2], [3, 3], [3, 3],
[3, 1], [3, 2], [3, 3], [3, 3]]


One comment on the permutation algorithm: GvR wrote a nice
demonstration (http://www.python.org/doc/current/ref/indentation.html)
that gives the permutations lexicographically sorted but is about 8
times slower for range(9) than my version.

I treat the argument list strictly as a tuple (where repetition
matters) so for combination and variation I'm actually generating the
lists for the index set range(len(seq)) and finally convert this
back to the original objects.

The program uses some standard external routines:
comb, fact, perm, quotient_set, set and cartes listed beneath the code
for combinatoric.

Thorsten

#v+
def combinatoric(seq, k, modus=''):
# combination: order is irrelevant, variation: order is relevant
#from function import comb, fact, perm
if '#' in modus or '#' in str(k):
if isinstance(seq, list) and k != '#PR':
seq = len(seq)
if modus == '#C': # combination without repetition
return comb(seq, k)
elif modus == '#CR': # combination with repetition
return comb(seq+k-1, k)
elif k == '#P': # permutation without repetition
return fact(seq)
elif k == '#PR': # permutation with repetition
return reduce(lambda x, y: x / fact(len(y)),
[fact(len(seq))] + quotient_set(seq, lambda x: x).values())
elif modus == '#V': # variation without repetition
return perm(seq, k)
elif modus == '#VR': # variation with repetition
return seq ** k
else:
if k == 'P': # permutation without repetition
if len(seq) <= 1:
return [seq]
else:
perms = []
for permutation in combinatoric(seq[:-1], 'P'):
for i in range(len(seq)):
perms.append(permutation[:])
perms[-1].insert(i, seq[-1])
return perms
elif k == 'PR': # permutation with repetition
return set(combinatoric(seq, 'P'))
else: # combination/variation with/without repetition
if k == 1:
return [[item] for item in seq]
else:
# auxiliary functions
def setord(seq): # remove sets with duplicates (order is relevant)
def sort(seq):
seq.sort()
return seq
return set([sort(item) for item in seq])

def setrep(seq): # remove sets with duplicates (repetition is relevant)
def delrep(seq): # remove duplicates while maintaining order
result = []
for item in seq:
if item not in result:
result.append(item)
return result
return [item for item in seq if item == delrep(item)]


cartesmodus = 'pair'
# treat 'seq' as a tuple not as a set; use its index set instead
result = range(len(seq))
for i in range(k-1):
cartesprod = cartes(result, range(len(seq)), cartesmodus)
# filter the cartesian product by order or repetition
if modus == 'C': # combination without repetition
result = setrep(setord(cartesprod))
elif modus == 'CR': # combination with repetition
result = setord(cartesprod)
elif modus == 'V': # variation without repetition
result = setrep(cartesprod)
elif modus == 'VR': # variation with repetition
result = cartesprod
cartesmodus = 'triple'
# convert the index sets back to the actual objects
return [[seq[index] for index in indices] for indices in result]
#v-

#v+
def comb(n, k):
return fact(n) / (fact(k) * fact(n-k))

def fact(integer):
factorial = 1
for factor in range(2, integer+1):
factorial *= factor
return factorial

def perm(n, k):
return fact(n) / fact(n-k)

def cartes(seq0, seq1, modus='pair'):
""" return the Cartesian Product of two sequences """
if modus == 'pair':
return [[item0, item1] for item0 in seq0 for item1 in seq1]
elif modus == 'triple':
return [item0 + [item1] for item0 in seq0 for item1 in seq1]

def set(seq):
""" make seq a true set by removing duplicates """
return [item[0] for item in quotient_set(seq, lambda x: x).values()]

def quotient_set(seq, func):
""" partition seq into equivalence classes """
quotient_set = {}
for item in seq:
quotient_set.setdefault(repr(func(item)),[]).append(item)
return quotient_set
#v-

Peter Dobcsanyi

unread,
Oct 9, 2002, 7:51:27 AM10/9/02
to
Thorsten Kampe <thor...@thorstenkampe.de> wrote:
> The program uses some standard external routines:
> comb, fact, perm, quotient_set, set and cartes listed beneath the code
...

> #v+
> def comb(n, k):
> return fact(n) / (fact(k) * fact(n-k))
>
> def fact(integer):
> factorial = 1
> for factor in range(2, integer+1):
> factorial *= factor
> return factorial
>
> def perm(n, k):
> return fact(n) / fact(n-k)

Risking the sin of premature optimization :-), here is a version for
comb() not defined in terms of the factorial function.

def combo(n, k):
if k < 0 or n < k :
return 0
if 2 * k > n :
k = n - k
b = 1
if k > 0 :
for i in xrange(k):
b = (b * (n - i)) / (i + 1)
return b

Try to compare the running times of comb() and combo() with some really
"large" (depending on your machine) numbers.

perm() can be changed similarly.

Peter

Alex Martelli

unread,
Oct 10, 2002, 6:00:38 AM10/10/02
to
Peter Dobcsanyi wrote:

> Thorsten Kampe <thor...@thorstenkampe.de> wrote:
>> The program uses some standard external routines:
>> comb, fact, perm, quotient_set, set and cartes listed beneath the code
> ...
>> #v+
>> def comb(n, k):
>> return fact(n) / (fact(k) * fact(n-k))
>>
>> def fact(integer):
>> factorial = 1
>> for factor in range(2, integer+1):
>> factorial *= factor
>> return factorial
>>
>> def perm(n, k):
>> return fact(n) / fact(n-k)
>
> Risking the sin of premature optimization :-), here is a version for
> comb() not defined in terms of the factorial function.

I think a simpler optimization is to memoize fact (untested, I'm
just typing this in by memory):

_facts = [1]
def fact(integer):
assert integer >= 0
for i in range(len(_facts), integer+1):
_facts.append(i*_facts[-1])
return _facts[integer]

This makes it fast enough in an amortized sense to use freely
as a primitive. And you only need to pre-populate _facts in
the obvious way to a greater extent to reduce the startup
cost to be amortized.

You need _facts = [1L] if you want to support old versions
of Python, of course.


Alex

Peter Dobcsanyi

unread,
Oct 10, 2002, 8:49:20 AM10/10/02
to
Peter Dobcsanyi <dpe...@designtheory.org> wrote:
> Risking the sin of premature optimization :-), here is a version for
> comb() not defined in terms of the factorial function.
>
> def combo(n, k):
> if k < 0 or n < k :
> return 0
> if 2 * k > n :
> k = n - k
> b = 1
> if k > 0 :
> for i in xrange(k):
> b = (b * (n - i)) / (i + 1)
> return b
>

Alex Martelli <al...@aleax.it> wrote:
> I think a simpler optimization is to memoize fact (untested, I'm
> just typing this in by memory):
>
> _facts = [1]
> def fact(integer):
> assert integer >= 0
> for i in range(len(_facts), integer+1):
> _facts.append(i*_facts[-1])
> return _facts[integer]
>
> This makes it fast enough in an amortized sense to use freely
> as a primitive. And you only need to pre-populate _facts in
> the obvious way to a greater extent to reduce the startup
> cost to be amortized.

I don't know which one is simpler, but for "large" numbers the memoizing
version is not a win in this case, it has a rather big memory footprint.

For example, clocking the time of comb(32000, 31000) I got:

time memory size

combo 0.02 sec 2.5 Mbyte
memoizing 0.42 sec 839 Mbyte !!!

The time for the memoizing version was taken the second time around, so
_facts was already populated. Continuing with the populated _facts:

n k combo memoizing

32000 16000 1.51 1.99
15000 7500 0.35 0.42
15000 14000 0.02 0.16
10000 5000 0.17 0.19
10000 9000 0.02 0.09
5000 2500 0.04 0.04
5000 4500 0.01 0.02
1000 500 0.0 0.0

Peter

Alex Martelli

unread,
Oct 10, 2002, 9:25:57 AM10/10/02
to
Peter Dobcsanyi wrote:
...

> Alex Martelli <al...@aleax.it> wrote:
>> I think a simpler optimization is to memoize fact (untested, I'm
...

> I don't know which one is simpler, but for "large" numbers the memoizing
> version is not a win in this case, it has a rather big memory footprint.
>
> For example, clocking the time of comb(32000, 31000) I got:
>
> time memory size
>
> combo 0.02 sec 2.5 Mbyte
> memoizing 0.42 sec 839 Mbyte !!!

Wow, hadn't considered that -- a wonder that everything isn't
thrashing to death with THAT much memory taken!

> The time for the memoizing version was taken the second time around, so
> _facts was already populated. Continuing with the populated _facts:
>
> n k combo memoizing
>
> 32000 16000 1.51 1.99
> 15000 7500 0.35 0.42
> 15000 14000 0.02 0.16
> 10000 5000 0.17 0.19
> 10000 9000 0.02 0.09
> 5000 2500 0.04 0.04
> 5000 4500 0.01 0.02
> 1000 500 0.0 0.0

Incidentally, if you do a lot of combinatorics and need reasonable
performance, you may want to look at gmpy.sf.net:

>>> import gmpy
>>> import time
>>> def timeco(n, k):
... reps = [None]*10
... start = time.clock()
... for x in reps: gmpy.comb(n, k)
... stend = time.clock()
... return (stend-start)/10.0
...
>>> print timeco(32000, 16000)
0.059
>>> print timeco(32000, 31000)
0.001
>>> print timeco(15000, 7500)
0.013
>>> print timeco(15000, 14000)
0.001

Of course my machine's likely different from yours, but it's
oldish & nothing special, just a 1.2 GHz Athlon, 256M DDRAM.


Alex

Peter Dobcsanyi

unread,
Oct 10, 2002, 11:56:53 AM10/10/02
to
Alex Martelli <al...@aleax.it> wrote:
> Incidentally, if you do a lot of combinatorics and need reasonable
> performance, you may want to look at gmpy.sf.net:

Actually, I do. Thank you for the pointer.

Peter

Alex Martelli

unread,
Oct 10, 2002, 5:42:51 PM10/10/02
to
Peter Dobcsanyi wrote:

Peter was also kind enough in followup mail to point out that
the gmpy he downloaded doesn't handle GMP 4.* (which changed
a field name in a struct) -- so it seems the code I have in the
sourceforge CVS gmpy.c, i.e., only relevant snippet:

if(resob) {
#if __GNU_MP__ >= 4
mpz_set(resob->z, randstate->_mp_seed);
#else
mpz_set(resob->z, randstate->seed);
#endif
result = (PyObject*)resob;

didn't make it to the last release -- I apologize and I'll try to make
another release package next week -- meanwhile I suggest that
people who want to try GMPY and have problems with GMP 4 could
try getting the CVS gmpy (it's stable, solid, tested) and build that.
Sorry for the inconvenience, everybody!-(


Alex

Thorsten Kampe

unread,
Oct 11, 2002, 10:08:05 AM10/11/02
to
* Thorsten Kampe

> * Thorsten Kampe
>> * Alex Martelli
>>> Thorsten Kampe wrote:
[massive snip]
> I rewrote this, too:
> [...]

> With the same test as with union, it's three and two times as fast.
>
> Thanks!

While /replying/ to myself, I really didn't want to thank /myself/ but
*Alex Martelli* of course. Just wanted to point this out.

Thorsten

0 new messages