# Lazy functional sorting

4 views

### Chris Okasaki

Aug 28, 1995, 3:00:00 AM8/28/95
to
Bob Buckley <cis...@icsea.levels.unisa.edu.au> wrote:
>I've been playing with sorting algorithms using Gofer. Here are some of
>my results. nmsort at the bottom seems to be the best overall!
>...

When I saw this subject line, I was expecting something similar to
Simon B Jones' recent "How useful is laziness?" post, i.e. how useful is
laziness in sorting? Most of the time, laziness plays no role -- the language
might as well be eager. However, there is an interesting case where laziness
*does* play an important role: when only some prefix (of unknown size) of the
resulting sorted list will be needed. A good example of when this happens
is Kruskal's minimal spanning tree algorithm. First you sort the edges, then
you process the edges (in sorted order) only until the resulting graph
is connected. This will usually require processing only some small fraction
of the edges.

So, let me rephrase the sorting problem: given n elements, generate the sorted
list containing the k smallest elements. Furthermore, suppose you don't know
k ahead of time. Now, how would you program a sorting routine to take
advantage of the fact that k may be much less than n? Ideally, we would like
an algorithm that ran in O(n + k log n) time.

It turns out that many sorting routines "just work right" under lazy
evaluation, especially quicksort and heapsort (but see below). Interestingly,
only heapsort seems suitable for the same problem under eager evaluation.

Functional heaps have been described several times in this newsgroup in
recent months. However, most of these posts described top-down
heap construction. Note that to support the desired O(n + k log n) time, it
is better to build the initial heap bottom-up than top-down. Top-down heap
construction (i.e. repeated inserts) requires O(n log n) time whereas
bottom-up heap construction requires only O(n) time.

This might be written in Gofer as

>data Heap a = Empty | Node a (Heap a) (Heap a)
>
>heapify :: Ord a => [a] -> Heap a
>heapify xs = fst (h (length xs) xs)
> where h 0 xs = (Empty, xs)
> h n (x:xs) = (siftDown x a b, zs)
> where m = n `div` 2
> (a, ys) = h m xs
> (b, zs) = h (n-m-1) ys
>
>siftDown :: Ord a => a -> Heap a -> Heap a -> Heap a
>siftDown x Empty Empty = Node x Empty Empty
>siftDown x (Node y a b) Empty
> | x <= y = Node x (Node y a b) Empty
> | y <= x = Node y (siftDown x a b) Empty
>siftDown x (Node y a b) (Node z c d)
> | x <= y && x <= z = Node x (Node y a b) (Node z c d)
> | y <= x && y <= z = Node y (siftDown x a b) (Node z c d)
> | z <= x && z <= y = Node z (Node y a b) (siftDown x c d)

(Note that I have written the above more for clarity than raw speed.
There is lots of room for improvement here...)

Chris
--
-------------------------------------------------------------------------------
| Chris Okasaki | Life is NOT a zero-sum game! |
| coka...@cs.cmu.edu | http://foxnet.cs.cmu.edu/people/cokasaki/home.html |
-------------------------------------------------------------------------------

### David Sands

Aug 29, 1995, 3:00:00 AM8/29/95
to
coka...@cs.cmu.edu (Chris Okasaki) writes:
> ...

>So, let me rephrase the sorting problem: given n elements, generate the sorted
>list containing the k smallest elements. Furthermore, suppose you don't know
>k ahead of time. Now, how would you program a sorting routine to take
>advantage of the fact that k may be much less than n? Ideally, we would like
>an algorithm that ran in O(n + k log n) time.

>It turns out that many sorting routines "just work right" under lazy
>evaluation, especially quicksort and heapsort (but see below).

...

<nifty sorting algorithm deleted>

The usual versions of functional quicksort are quadratic in the worst
case. Nevertheless, one might hope/expect that it takes O(nk) to
compute the first k elements. But in fact it is still quadratic in n
in the worst case, even for k=1 (A rigorous analysis of this
particular example is found in [*]). Did you have an "average"
case-analysis in mind? A better example would be mergesort [**],
which I believe has the desired O(n + k log n). I don't know what it
is like in practice.

Cheers,
Dave.

[*]
@Article{Sands:Naive,
semno = "D-173",
author = "D. Sands",
title = "{A Na\"{\i}ve Time Analysis and its
Theory of Cost Equivalence}",
journal = "The Journal of Logic and Computation",
year = "199X",
publisher = "Oxford University Press",
note = "Accepted, to appear
(Preliminary version available as TOPPS report D-173, 1993,
ftp://ftp.diku.dk/diku/semantics/papers/D-173.ps.Z
)",

[**] This is the kind of mergesort I had in mind (some old bit of Miranda(TM)
that I dug up). There is nothing special about it.

mergesort x
= ms x (#x)
where
ms y n = y, if n <= 1 || invariant: length y = n
= merge sort_floor sort_ceil, otherwise
where
k = n div 2
sort_floor = ms us k
sort_ceil = ms vs (k + (n mod 2))
(us,vs) = split k y || invariant: k <= length y
where split 0 ys = ([],ys)
split (m+1) (y:ys) = (y:u, v)
where (u,v) = split m ys

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
David Sands Email: da...@diku.dk
DIKU, University of Copenhagen Phone: +45 35 321 408
Universitetsparken 1 fax: +45 35 321 401
DK-2100 Copenhagen East, DENMARK W3: http://www.diku.dk/users/dave/

### Chris Okasaki

Aug 29, 1995, 3:00:00 AM8/29/95
to
I <coka...@cs.cmu.edu> wrote:
>So, let me rephrase the sorting problem: given n elements, generate the sorted
>list containing the k smallest elements. Furthermore, suppose you don't know
>k ahead of time. Now, how would you program a sorting routine to take
>advantage of the fact that k may be much less than n? Ideally, we would like
>an algorithm that ran in O(n + k log n) time.
>
>It turns out that many sorting routines "just work right" under lazy
>evaluation, especially quicksort and heapsort (but see below).

David Sands <da...@diku.dk> wrote:
>The usual versions of functional quicksort are quadratic in the worst
>case. Nevertheless, one might hope/expect that it takes O(nk) to
>compute the first k elements. But in fact it is still quadratic in n
>in the worst case, even for k=1 (A rigorous analysis of this
>particular example is found in [*]). Did you have an "average"
>case-analysis in mind? A better example would be mergesort [**],
>which I believe has the desired O(n + k log n). I don't know what it
>is like in practice.

You're right, I should have explicitly mentioned that the usual caveats
about quicksort being quadratic in the worst case apply. (While
it is certainly possible to use a linear-time median finding algorithm
to choose the pivot -- making quicksort O(n log n) in the worst case --
no one ever does.)

Still, it is very easy to visualize the behavior of quicksort when
only some prefix of the resulting list is demanded. What happens is that
within "most" of the calls to quicksort, the second recursive call is
ignored.

Mergesort does indeed also have the desired O(n + k log n) behavior, but it
is much harder to visualize. Consider the following picture that illustrates
what might happen for k=4:

1:2:3:4:_
/ \
1:3:7:_ 2:4:_
/ \ / \
1:3:nil 7:_ 2:4:_ 5:_
| | | | | | | |
1 3 7 6 2 4 8 5

Each interior node represents a call to merge. The underscores indicate the
portions of the merged lists that are never demanded (and hence are never
computed). Each of the first k elements participate in O(log n) of the
interior nodes, accounting for O(k log n) time. In addition, almost every
interior node generates one extra element, accounting for an extra O(n).
This, together with the O(n) overhead in splitting the orginial list and
setting up the calls to merge in the first place yields a total of

O(n + k log n).

Chris

### Bert THOMPSON

Aug 29, 1995, 3:00:00 AM8/29/95
to
cis...@icsea.levels.unisa.edu.au (Bob Buckley) writes:

|--

|I've been playing with sorting algorithms using Gofer. Here are some of
|my results. nmsort at the bottom seems to be the best overall!

[Stuff deleted.]

|-- natural merge-sort (see Knuth vol. 2, like Paulson's SAM-sort)
|-- this versions of runs seems to perform very well in Gofer
|-- and may suit foldr optimisation in GHC-0.26?
|-- in Gofer, this beats all the other sorting algorithms I've tested.

|runs xs = foldr op [] xs
| where op z xss@(xs@(x:_):xss') | z<=x = (z:xs):xss'
| | otherwise = [z]:xss
| op z xss = [z]:xss
|nmsort :: (Ord a) => [a] -> [a]
|nmsort [] = [] -- (foldb f []) is undefined
|nmsort xs = foldb merge (runs xs)

You can do a bit better than this by crossing over to insertion sort
on small input. (Well, my natural merge sort did anyway. Yours may
be quicker.) On small lists there's a lot of overhead in merging.

Bert.
----------------------------------------------------------------------
Bert Thompson a...@cs.mu.oz.au
----------------------------------------------------------------------