heaps in clojure

154 views
Skip to first unread message

Sunil S Nandihalli

unread,
Sep 13, 2011, 7:44:12 AM9/13/11
to clo...@googlegroups.com
Hi Everybody,
 I have a very large, but with finite size, collection. I would like to get like first 10 elements in the sorted list . I would use a heap if I were in c++ .. is there a inbuilt implementation of this in clojure? .. Is there some other way to achieve this? some sort of lazy sort would be perfect. I know I need the full collection to start with .. but that is fine.
Thanks,
Sunil.

Jonathan Fischer Friberg

unread,
Sep 13, 2011, 7:50:20 AM9/13/11
to clo...@googlegroups.com
There is the inbuilt sort function, also sort-by is useful.

In "The joy of clojure", there were an example of a lazy sort.
It can be found here:
http://www.manning.com/fogus/
In the file "q.clj" in the source code.

Jonathan

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Chouser

unread,
Sep 13, 2011, 9:45:26 AM9/13/11
to clo...@googlegroups.com

A seq on a sorted set should be pretty efficient.

(take 3 (sorted-set 8 2 1 4 6 9 7 3))
;=> (1 2 3)

--Chouser

Jason Wolfe

unread,
Sep 13, 2011, 3:09:41 PM9/13/11
to Clojure
There is java.util.PriorityQueue, which is heap-based:

http://download.oracle.com/javase/1,5.0/docs/api/java/util/PriorityQueue.html

-Jason


On Sep 13, 4:44 am, Sunil S Nandihalli <sunil.nandiha...@gmail.com>
wrote:

Sunil S Nandihalli

unread,
Sep 15, 2011, 1:59:13 AM9/15/11
to clo...@googlegroups.com
Hi chouser,
 Thanks for your response. but correct me if I am wrong.. wouldn't sorted-set completely sort the complete collection without actually taking into account that only the first few elements are needed?
Thanks,
Sunil.

Sunil S Nandihalli

unread,
Sep 15, 2011, 1:59:42 AM9/15/11
to clo...@googlegroups.com
Thanks Jason,
 I think some sort of priority queue may be the solution.
Thanks,
Sunil.

Mark Engelberg

unread,
Sep 15, 2011, 3:22:54 AM9/15/11
to clo...@googlegroups.com
You can do one linear pass over your data, accumulating a sorted set of the best 10 items you've found so far.  You seed the sorted set with the first 10 items from your list, then continue traversing your list.  With each new item you encounter, you ask if it is any better than the worst of the best-10-sorted-set.  If it is, then you add it to the sorted-set and boot out the worst.  Since the sorted set never exceeds 10 items, there is a small, constant bound for how long the operations on the sorted set will take.  Thus this is a O(n) algorithm.

David Powell

unread,
Sep 15, 2011, 3:34:49 AM9/15/11
to clo...@googlegroups.com

Does that work?

There is no guarantee that the top 10 of the overall list matches the top 10 of earlier prefixes, so the candidates that get discarded might be part of the overall top 10, and the elements that pushed them out could just be local maxima.

--
Dave

Mark Engelberg

unread,
Sep 15, 2011, 3:54:18 AM9/15/11
to clo...@googlegroups.com
If you maintain the invariant that at each point, your sorted set contains the top 10 you've seen so far, then from that invariant you can conclude that at the end of the traversal, your sorted set contains the top 10 for the overall list.

David Powell

unread,
Sep 15, 2011, 4:04:53 AM9/15/11
to clo...@googlegroups.com

But when an element is dropped from the list, you're effectively resetting its seen-count to zero.  It might be seen again, and it might (if you hadn't reset the seen-count), have ended up in the top 10.

Or have I misunderstood?

--
Dave

Mark Engelberg

unread,
Sep 15, 2011, 4:07:44 AM9/15/11
to clo...@googlegroups.com
The initial problem statement is to figure out what would be the first 10 items if you sorted the list.

I don't see anything about frequency or how many times you've seen a given item in the statement of the problem.

When I talk about a "best" item, I mean it is the first with regard to whatever comparison method you're using.

David Powell

unread,
Sep 15, 2011, 4:41:22 AM9/15/11
to clo...@googlegroups.com

Ah yeah.  Sorry, I'd superimposed something I was thinking about on to the original problem.

--
Dave

Jim Oly

unread,
Sep 16, 2011, 9:22:47 AM9/16/11
to Clojure
Using PriorityQueue should give a good, fast result. Here's my
implementation:

(defn smallest-n [n xs]
(let [q (java.util.PriorityQueue. xs)]
(for [i (range n)] (.poll q))))

(smallest-n 5 (shuffle (range 100)))
;; (0 1 2 3 4)

Jim

Mark Engelberg

unread,
Sep 16, 2011, 5:01:29 PM9/16/11
to clo...@googlegroups.com
That's really no different from just sorting the list and taking the first 5.  O(n log(n)).
And for really large data sets, this is going to consume a lot of memory.

The method I outlined would be O(n) and doesn't force the sequence to all be resident in memory at the same time.

Joop Kiefte

unread,
Sep 16, 2011, 8:06:45 PM9/16/11
to clo...@googlegroups.com
Pepijn de Vos had a blogpost about lazy sorts. One of them seems to be a particularly good fit for this problem: http://pepijndevos.nl/lazy-sorting/index.html (the heap-sort example seems the fastest of those for this use-case...).

Mark Engelberg

unread,
Sep 16, 2011, 8:16:10 PM9/16/11
to clo...@googlegroups.com
That heap-sort example is not really lazy.  The java.util.concurrent.PriorityBlockingQueue essentially sorts the whole thing, and then the heap-sort is just making a lazy seq over that.

Also, note that the "speed test" at that link is only working with lists of 1000, and the original poster described himself as dealing with large datasets.  I doubt these benchmarks will be applicable.
Reply all
Reply to author
Forward
0 new messages