Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

heapq.merge with key=

313 views
Skip to first unread message

Kevin D. Smith

unread,
May 7, 2009, 5:23:31 PM5/7/09
to
I need the behavior of heapq.merge to merge a bunch of results from a
database. I was doing this with sorted(itertools.chain(...), key=...),
but I would prefer to do this with generators. My issue is that I need
the key= argument to sort on the correct field in the database.
heapq.merge doesn't have this argument and I don't understand the code
enough to know if it's possible to add it. Is this enhancement
possible without drastically changing the current code?

--
Kevin D. Smith

Chris Rebert

unread,
May 8, 2009, 12:48:43 AM5/8/09
to Kevin D. Smith, pytho...@python.org

I think so. Completely untested code:

def key(ob):
#code here

class Keyed(object):
def __init__(self, obj):
self.obj = obj
def __cmp__(self, other):
return cmp(key(self.obj), key(other.obj))

def keyify(gen):
for item in gen:
yield Keyed(item)

def stripify(gen):
for keyed in gen:
yield keyed.obj

merged = stripify(merge(keyify(A), keyify(B), keyify(C))) #A,B,C being
the iterables


Cheers,
Chris
--
http://blog.rebertia.com

bearoph...@lycos.com

unread,
May 8, 2009, 5:44:20 AM5/8/09
to
Looking for this, Kevin D. Smith?
http://code.activestate.com/recipes/502295/

Bye,
bearophile

Kevin D. Smith

unread,
May 8, 2009, 10:49:23 AM5/8/09
to
On 2009-05-07 23:48:43 -0500, Chris Rebert <cl...@rebertia.com> said:

> On Thu, May 7, 2009 at 2:23 PM, Kevin D. Smith <Kevin...@sas.com> wrote:
>> I need the behavior of heapq.merge to merge a bunch of results from a

>> database. 锟絀 was doing this with sorted(itertools.chain(...), key=
> ...), but
>> I would prefer to do this with generators. 锟組y issue is that I need
> the key
>> argument to sort on the correct field in the database. 锟絟eapq.merge


> doesn't
>> have this argument and I don't understand the code enough to know if it's

>> possible to add it. 锟絀s this enhancement possible without drasticall


> y
>> changing the current code?
>
> I think so. Completely untested code:
>
> def key(ob):
> #code here
>
> class Keyed(object):
> def __init__(self, obj):
> self.obj = obj
> def __cmp__(self, other):
> return cmp(key(self.obj), key(other.obj))
>
> def keyify(gen):
> for item in gen:
> yield Keyed(item)
>
> def stripify(gen):
> for keyed in gen:
> yield keyed.obj
>
> merged = stripify(merge(keyify(A), keyify(B), keyify(C))) #A,B,C being
> the iterables

Ah, that's not a bad idea. I think it could be simplified by letting
Python do the comparison work as follows (also untested).

def keyify(gen, key=lamda x:x):
for item in gen:
yield (key(item), item)

def stripify(gen):
for item in gen:
yield item[1]

After looking at the heapq.merge code, it seems like something like
this could easily be added to that code. If the next() calls were
wrapped with the tuple creating code above and the yield simply
returned the item. It would, of course, have to assume that the
iterables were sorted using the same key, but that's better than not
having the key option at all.

--
Kevin D. Smith

Peter Otten

unread,
May 9, 2009, 4:09:24 AM5/9/09
to
Kevin D. Smith wrote:

> On 2009-05-07 23:48:43 -0500, Chris Rebert <cl...@rebertia.com> said:
>
>> On Thu, May 7, 2009 at 2:23 PM, Kevin D. Smith <Kevin...@sas.com>
>> wrote:
>>> I need the behavior of heapq.merge to merge a bunch of results from a

>>> database. I was doing this with sorted(itertools.chain(...), key=
>> ...), but
>>> I would prefer to do this with generators. My issue is that I need
>> the key
>>> argument to sort on the correct field in the database. heapq.merge


>> doesn't
>>> have this argument and I don't understand the code enough to know if

>>> it's possible to add it. Is this enhancement possible without

This is not as general. It makes sorting unstable and can even fail if the
items are unorderable:

>>> def decorate(gen, key):
... for item in gen:
... yield key(item), item
...
>>> def undecorate(gen):
... for item in gen:
... yield item[-1]
...
>>> sorted(decorate([1j, -1j, 1+1j], key=lambda c: c.real))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: no ordering relation is defined for complex numbers

One way to fix it:

>>> def decorate(gen, key):
... for index, item in enumerate(gen):
... yield key(item), index, item
...
>>> sorted(decorate([1j, -1j, 1+1j], key=lambda c: c.real))
[(0.0, 0, 1j), (0.0, 1, -1j), (1.0, 2, (1+1j))]
>>> list(undecorate(_))
[1j, -1j, (1+1j)]

Peter

Scott David Daniels

unread,
May 9, 2009, 12:51:24 PM5/9/09
to
Peter Otten wrote:
> Kevin D. Smith wrote:
>
>> On 2009-05-07 23:48:43 -0500, Chris Rebert <cl...@rebertia.com> said:
>>
>>> On Thu, May 7, 2009 at 2:23 PM, Kevin D. Smith <Kevin...@sas.com>
>>> wrote:
>>>> I need the behavior of heapq.merge to merge a bunch of results from a
>>>> database.... I would prefer to do this with generators.... I need
>>>> the key argument to sort on the correct field in the database...
>>> I think so.... [some code]

>> Ah, that's not a bad idea. I think it could be simplified by letting
>> Python do the comparison work as follows (also untested).
>>
>> def keyify(gen, key=lamda x:x):
>> for item in gen:
>> yield (key(item), item)
> ... This is not as general. It makes sorting unstable and can even fail
> if the items are unorderable:
> One way to fix it:
>>>> def decorate(gen, key):
> ... for index, item in enumerate(gen):
> ... yield key(item), index, item
>>>> sorted(decorate([1j, -1j, 1+1j], key=lambda c: c.real))
> [(0.0, 0, 1j), (0.0, 1, -1j), (1.0, 2, (1+1j))]
>>>> list(undecorate(_))
> [1j, -1j, (1+1j)]

However, a heap gets faster if equal items are equal, so you
really want the decorated value to be the sole component of a
comparison. (and remember, sorts are now stable).
As in your code, key generation is likely best calculated once.
Here's another fix:

class Keyed(object):
def __init__(self, obj, key):
self.obj = obj
self.key = key

def __lt__(self, other): # cmp is old-style magic
return self.key < other.key

def __eq__(self, other):
return self.key == other.key

def __repr__(self):
return '%s(%r, %r)' % (
self.__class__.__name__, self.obj, self.key)

def kmerge(key, *streams):
keyed_streams = [(Keyed(obj, key(obj)) for obj in stream)
for stream in streams]
for element in heapq.merge(*keyed_streams):
yield element.obj

--Scott David Daniels
Scott....@Acm.Org

0 new messages