Tough sorting problem: or, I'm confusing myself

david jensen

unread,

Apr 9, 2010, 11:03:01 AM4/9/10

to

Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
res[ i ] = getValues( i ) # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10 )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

What I'm doing now is ugly, and i think is where i'm confusing myself:

----
best2={}
for i in itertools.combinations( range( 2**m), n-1):
scorelist=[]
for j in range( 2**m ):
if j not in i:
k=list(i)
k.append(j)
k=tuple(sorted(k)) #gets the key for looking up the
scores in res
scorelist.append((j,res[k][k.index(j)]))
best2[i]=sorted(scorelist,key=lambda x: -x[1])[:2]
----

Am I missing an obviously better way?

Many thanks!

David

Raymond Hettinger

unread,

Apr 11, 2010, 3:39:25 PM4/11/10

to

The overall algorithm looks about right.
The inner-loop could be tighted-up a bit.
And you could replace the outer sort with a heap.

best2 = {}

for i in itertools.combinations(range( 2**m), n-1):

scorelist = []

for j in range( 2**m ):
if j not in i:

k = tuple(sorted(i + (j,)))
scorelist.append((j, res[k][k.index(j)]))
best2[i] = heapq.nlargest(2, scorelist,
key=operator.itemgetter(1))

Raymond

Paul McGuire

unread,

Apr 11, 2010, 7:22:16 PM4/11/10

to

Add a memoizing decorator to getValues, so that repeated calls will do
lookups instead of repeated calculations.

-- Paul

david jensen

unread,

Apr 13, 2010, 7:25:33 AM4/13/10

to

On Apr 11, 9:39 pm, Raymond Hettinger <pyt...@rcn.com> wrote:

> The overall algorithm looks about right.
> The inner-loop could be tighted-up a bit.
> And you could replace the outer sort with a heap.
>
> best2 = {}
> for i in itertools.combinations(range( 2**m), n-1):
> scorelist = []
> for j in range( 2**m ):
> if j not in i:
> k = tuple(sorted(i + (j,)))
> scorelist.append((j, res[k][k.index(j)]))
> best2[i] = heapq.nlargest(2, scorelist,
> key=operator.itemgetter(1))
>
> Raymond

Thanks for the ideas... I should have seen the k = tuple(sorted(i +
(j,))). I'm not sure a heap will help much, and at least to me,
doesn't improve readability.

Thanks for taking a look, I appreciate it!

David

david jensen

unread,

Apr 13, 2010, 7:31:31 AM4/13/10

to

Thanks for the idea... I'd thought of that, but didn't use it because
I don't think it improves complexity or readability: just makes it a
little more intuitive.

But thanks for your time!

David

Raymond Hettinger

unread,

Apr 14, 2010, 12:06:13 PM4/14/10

to

> I'm not sure a heap will help much, and at least to me,
> doesn't improve readability.

nlargest() should save quite a few comparisons and run much faster
than sorted().

Not sure what the readability issue is. The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable. That makes the programmer's intent more clear than
the slower, but semanticly equivalent form: sorted(iterable)[:2].

The running time for your algorithm can be very long, depending on the
inputs. Do you need an exact answer or can you approximate it with
sampling random combinations (i.e. choose the best two scores out of
10000 samples).

Raymond

Paul Rubin

unread,

Apr 14, 2010, 12:03:02 PM4/14/10

to

Raymond Hettinger <pyt...@rcn.com> writes:
> Not sure what the readability issue is. The phrase "nlargest(2,
> iterable)" does exactly what it says, finds the 2 largest elements
> from an iterable. That makes the programmer's intent more clear than
> the slower, but semanticly equivalent form: sorted(iterable)[:2].

I think you meant

sorted(iterable, reverse=True)[:2]

Raymond Hettinger

unread,

Apr 14, 2010, 12:35:18 PM4/14/10

to

> > Not sure what the readability issue is. The phrase "nlargest(2,
> > iterable)" does exactly what it says, finds the 2 largest elements
> > from an iterable. That makes the programmer's intent more clear than
> > the slower, but semanticly equivalent form: sorted(iterable)[:2].
>
> I think you meant
>
> sorted(iterable, reverse=True)[:2]

:-)

Raymond

david jensen

unread,

Apr 15, 2010, 7:11:00 AM4/15/10

to

You are of course correct. I was running on lack of sleep. Thanks
again for having taken the time to help!
I need an exact answer, but don't anticipate 2**m or n to be huge so
even with e.g. m=10 and n=7 it should still be feasible. I can see how
sampling would become necessary if m is large though. Much larger, and
i'd have to think about optimizing the getValues function first
though.

thank you,
David

Lie Ryan

unread,

Apr 16, 2010, 8:17:00 AM4/16/10

to

or sorted(iterable)[-2:]

Peter Otten

unread,

Apr 16, 2010, 9:14:46 AM4/16/10

to pytho...@python.org

david jensen wrote:

After some tinkering I came up with

def calculate(getvalues):
best2 = defaultdict(list)
for t in combinations(range(2**m), n):
values = getvalues(t)
for i, k in enumerate(t):
best2[t[:i] + t[i+1:]].append((k, values[i]))
return dict((k, nlargest(2, v, key=itemgetter(1))) for k, v in
best2.iteritems())

which is concise but slower than your code with Raymond's improvements. I
then tried inlining the nlargest() operation:

def calculate_inlined(getvalues):
best2 = {}
for t in combinations(range(2**m), n):
values = getvalues(t)
for i, k in enumerate(t):
key = t[:i] + t[i+1:]
value = values[i]
if key not in best2:
best2[key] = [(k, value), (None, None)]
else:
a, b = best2[key]
if value > a[1]:
best2[key] = [(k, value), a]
elif value > b[1]:
best2[key] = [a, (k, value)]
return best2

This gives a speed boost (I measured with m=n=5) but is both messy and
brittle. I've got a hunch that there is a better way; I just don't see it at
the moment...

Peter