well, presumably the sort needs to look at all the data in order to be
able to sort it, no?! so it is kinda just a tough cookie reality that
the memory will be used.
on the other hand, maybe you can do some form of the Schwartzian Transform.
Sort can't really be lazy without being unbelievably slow.
On Jun 22, 6:46 pm, "beatle...@gmail.com" <beatle...@gmail.com> wrote:
> (take 10 (sort (for [x (range 1000000)] (rand)))))
>
> Now the problem is the memory usage, as it does not merely uses memory
> space for 10 items, but it keeps a reference to the entire sequence.
> If I leave out the sort its all ok, and done lazily.
Selecting the top N out of n items is a O(n*sqrt(N)) operation so it's linear when n dominates N and thus must beat take + sort. Plus it won't retain the whole realized seq in memory.
I don't know if it has an official name but basically it's a modified tree-sort: for each item you insert a value in a sorted coll of size N and remove one item from this sorted coll, both ops are O(sqrt(N)) thus O(n*sqrt(N)) for processing an input whose length is n.
I'm away from a REPL but I had something like that in mind:
(defn top-n [coll n]
(let [add #(assoc %1 %2 (inc (%1 %2 0))
prune-min #(let [[[min n]] (rseq %)] (if (== 1 n) (dissoc % min) (assoc % (dec n))))
seed (reduce add sorted-map (take n coll))]
(reduce (comp prune-min add) seed (drop n coll))))
(top 3 [5 4 3 5 8 2 7]) ; should return {5 1, 7 1, 8 1}
; but
(top 3 [5 4 3 5 8 2]) ; should return {5 2, 8 1}
On Jun 23, 2009, at 2:25 AM, Jules wrote:
> Let N be the total number of elements in your collection (e.g.
> 1,000,000), and n the number of elements that you want to take out of
> this collection (e.g 10).
>
> By sorting the collection of N elements you spend N log N time. By
> repeatedly adding an to the small collection and removing the minimum
> you spend N log n time (if you have a good sorted collection that
> takes log n time to insert).
>
> The algorithm spelled out in imperative pseudocode is:
>
> def top_n(n,coll)
> let result = new SortedCollection(take n from coll)
> for x in (drop n from coll):
> result.add(x)
> result.deleteMinimum() // adding 1 element and removing 1 keeps
> the result collection at size n
> return result
>
> Hope this helps,
Yes, this explains it. :) Thanks!
—
Daniel Lyons
I have to admit I don't really understand your code, so my apologies if I've missed something obvious.I think if you consider each element of N and do an operation that costs sqrt(N) with it, you'd arrive at O(N*sqrt(N)), which I think would be worse than O(N log N)
which is what you'd get from just sorting the whole structure and taking the top n items. Is it really sqrt(N)? I'm under the impression insertion into a balanced binary tree is O(log(N)), but wouldn't you still need to have N of them? This algorithm reminds me of heap sort.
(defn top [n comptr coll]
(let [m (reduce #(assoc %1 %2 true) (sorted-map-by comptr)
(take n coll))]
(keys
(reduce #(let [t (assoc %1 %2 true)]
(dissoc t (first (last t)))) m (drop n coll)))))
Why a map with useless values? There's no sorted-set-by in
clojure.core. :(
So top is using about twice as much time, and one five-hundred-
thousandth as much memory. Lesson: use top when you don't want a big
lazy sequence residing in memory all at once and use sort when you
want speed. :)
This is an interesting characteristic I've noticed in sorting code. In
the past (I'm thinking of one Common Lisp app in particular) I've
spent time carefully crafting tree-based accumulators to collect
values into a sorted collection... only to find that it was slower
than just collecting the whole damn list and sorting it, even for very
large collections. Seems all those tree/map operations take time, and
library sort functions are fast!
I still don't like unconditional add then prune-min. It is more than
2x as fast to check and add only if needed:
(defn top-n
([n coll] (top-n n nil coll))
([n comparator coll]
(let [empty-map (if comparator (sorted-map-by comparator) (sorted-map))
add #(assoc %1 %2 (inc (%1 %2 0)))
seed (reduce add empty-map (take n coll))
add-top #(let [[[min n]] (rseq %)]
(if (> min %2)
(add (if (== 1 n) (dissoc % min) (assoc % min (dec n))) %2)
%))]
(reduce add-top seed (drop n coll)))))
Rich