shuffle vector in Clojure

251 views
Skip to first unread message

Shawn Hoover

unread,
Aug 26, 2008, 11:54:32 PM8/26/08
to clo...@googlegroups.com
I needed to shuffle a vector so I decided to try doing it functionally. Below is the implementation and a couple exercise functions, inspired by http://en.wikipedia.org/wiki/Fisher-Yates_shuffle and http://szeryf.wordpress.com/2007/06/19/a-simple-shuffle-that-proved-not-so-simple-after-all/.

A version of sort-by that computes each key only once does most of the work; shuffle is trivial using that. I haven't tried it on data other than integers yet. The speed is decent under half a million integers, but the space cost is high, requiring a larger Java heap to sort a million. As soon as I got it working I discovered java.util.Collections/shuffle, which blows my implementation away on speed and space, comfortably sorting 10 million.

I'm open to feedback on the algorithm or improving my Clojure. For one thing, it seems like there must be a cleaner way to handle my use of reduce in test-shuffle-uniformity. I like reduce for looping functionally, but in this case I just need to run n times regardless of the specific values in (range n). Is that use case common enough for some kind of reduce-times macro?

(defn sort-by-mem-keys
  "Like clojure/sort-by but only calls keyfn once on each item in coll."
  ([keyfn coll]
     (sort-by-mem-keys keyfn compare coll))
  ([keyfn #^java.util.Comparator comp coll]
     (let [key-vals (map (fn [x] [(keyfn x) x]) coll)
           sorted-pairs (sort (fn [x y] (.compare comp (first x) (first y)))
                              key-vals)]
       (map second sorted-pairs))))

(defn shuffle
  "Returns a seq of coll shuffled in O(n log n) time. Space is at least O(2N). Each item in coll is
  assigned a random number which becomes the sort key."
  [coll]
  (sort-by-mem-keys (fn [_] (rand)) coll))

(import '(java.util ArrayList Collections))

(defn shuffle-java
  "Shuffles coll using a Java ArrayList. Blows shuffle out of the water on
  speed and space."
  [coll]
  (let [l (ArrayList. coll)]
    (Collections/shuffle l)
    (seq l)))

(defn test-shuffle-uniformity
  "Shuffles a 3-item collection n times, mapping each result to the number of
  times it's hit. If the distribution isn't uniform we've let bias into the
  random numbers or the algorithm. Typical results on my machine:
  (2 1 3) : 16611
  (3 2 1) : 16771
  (1 3 2) : 16707
  (1 2 3) : 16766
  (3 1 2) : 16555
  (2 3 1) : 16590"
  ([n] (test-shuffle-uniformity n shuffle))
  ([n shuffle]
     (let [coll [1 2 3]
           results (reduce (fn [results _]
                             (let [result (shuffle coll)]
                               (assoc results result
                                      (inc (or (get results result)
                                               0)))))
                           {} (range n))]
       (doseq r results
         (println (key r) ":" (val r))))))

(defn test-shuffle-time
  "Runs shuffle on collections of varying sizes, printing out the times. I had
  to run java with -Xmx256m to get to a million without running out of
  memory."
  []
  (doseq n [1000 10000 100000 1000000]
    (print n ": ")
    (time (dorun (shuffle (range n))))))

Chouser

unread,
Aug 27, 2008, 12:23:36 AM8/27/08
to clo...@googlegroups.com
On Tue, Aug 26, 2008 at 11:54 PM, Shawn Hoover <shawn....@gmail.com> wrote:
> For one thing, it seems like there must be a cleaner way to handle
> my use of reduce in test-shuffle-uniformity. I like reduce for
> looping functionally, but in this case I just need to run n times
> regardless of the specific values in (range n). Is that use case
> common enough for some kind of reduce-times macro?

I can't say I've needed such a thing very often. Your own
reduce-times macro might clarify your code a bit, though.

> (defn test-shuffle-uniformity
> "Shuffles a 3-item collection n times, mapping each result to the number of
> times it's hit. If the distribution isn't uniform we've let bias into the
> random numbers or the algorithm. Typical results on my machine:
> (2 1 3) : 16611
> (3 2 1) : 16771
> (1 3 2) : 16707
> (1 2 3) : 16766
> (3 1 2) : 16555
> (2 3 1) : 16590"
> ([n] (test-shuffle-uniformity n shuffle))
> ([n shuffle]
> (let [coll [1 2 3]
> results (reduce (fn [results _]
> (let [result (shuffle coll)]
> (assoc results result
> (inc (or (get results result)
> 0)))))
> {} (range n))]
> (doseq r results
> (println (key r) ":" (val r))))))

You could simplify your reduce fn a bit by using merge-with:

(defn test-shuffle-uniformity
"Shuffles a 3-item collection n times, mapping each result to the number of
times it's hit. If the distribution isn't uniform we've let bias into the
random numbers or the algorithm. Typical results on my machine:
(2 1 3) : 16611
(3 2 1) : 16771
(1 3 2) : 16707
(1 2 3) : 16766
(3 1 2) : 16555
(2 3 1) : 16590"
([n] (test-shuffle-uniformity n shuffle))
([n shuffle]
(let [coll [1 2 3]
results (reduce (fn [results _]

(merge-with + results {(shuffle coll) 1}))


{} (range n))]
(doseq r results
(println (key r) ":" (val r))))))

--Chouser

Chouser

unread,
Aug 27, 2008, 3:11:24 AM8/27/08
to clo...@googlegroups.com
> You could simplify your reduce fn a bit by using merge-with:
Or even:

(defn test-shuffle-uniformity
"Shuffles a 3-item collection n times, mapping each result to the number of
times it's hit. If the distribution isn't uniform we've let bias into the
random numbers or the algorithm. Typical results on my machine:
(2 1 3) : 16611
(3 2 1) : 16771
(1 3 2) : 16707
(1 2 3) : 16766
(3 1 2) : 16555
(2 3 1) : 16590"
([n] (test-shuffle-uniformity n shuffle))
([n shuffle]
(let [coll [1 2 3]

results (apply merge-with + {}
(take n (repeatedly (fn [] {(shuffle coll) 1}))))]
(doseq [k v] results
(println k ":" v)))))

It'd be nice to use the #(...) syntax instead of (fn [] ...), but that
won't work here since #({key val}) would try to call the map with no
arguments. You'd have to say #(hash-map key val) and (fn [] {key
val}) is arguably clearer.

Perhaps "repeatedly" could be made to take an optional argument n, but
again I don't know how often that would be useful.

Also note the destructuring in doseq to save a few keystrokes.


I was able to get almost a 20% speed boost by using a Java array for
each item instead of a vector:

(defn sort-by-mem-keys
"Like clojure/sort-by but only calls keyfn once on each item in coll."
([keyfn coll]
(sort-by-mem-keys keyfn compare coll))
([keyfn #^java.util.Comparator comp coll]

(let [key-vals (map #(to-array [(keyfn %) %]) coll)
sorted-pairs (sort (fn [x y] (.compare comp (aget x 0) (aget y 0)))
key-vals)]
(map #(aget % 1) sorted-pairs))))

Or you can rely on the fact that vectors are comparable themselves.
The first member is the most significant, so I'd think this would
produce acceptable results. It's more succinct and appears to double
the speed:

(defn sort-by-mem-keys
"Like clojure/sort-by but only calls keyfn once on each item in coll."
([keyfn coll]

(let [key-vals (map (fn [x] [(keyfn x) x]) coll)]
(map second (sort key-vals)))))

It's still no match for shuffle-java, though.

--Chouser

Rich Hickey

unread,
Aug 27, 2008, 9:59:09 AM8/27/08
to Clojure


On Aug 26, 11:54 pm, "Shawn Hoover" <shawn.hoo...@gmail.com> wrote:
> I needed to shuffle a vector so I decided to try doing it functionally.
> Below is the implementation and a couple exercise functions, inspired byhttp://en.wikipedia.org/wiki/Fisher-Yates_shuffleandhttp://szeryf.wordpress.com/2007/06/19/a-simple-shuffle-that-proved-n...
> .
>
> A version of sort-by that computes each key only once does most of the work;
> shuffle is trivial using that. I haven't tried it on data other than
> integers yet. The speed is decent under half a million integers, but the
> space cost is high, requiring a larger Java heap to sort a million. As soon
> as I got it working I discovered java.util.Collections/shuffle, which blows
> my implementation away on speed and space, comfortably sorting 10 million.
>
> I'm open to feedback on the algorithm or improving my Clojure. For one
> thing, it seems like there must be a cleaner way to handle my use of reduce
> in test-shuffle-uniformity. I like reduce for looping functionally, but in
> this case I just need to run n times regardless of the specific values in
> (range n). Is that use case common enough for some kind of reduce-times
> macro?
>

While I appreciate the intellectual exercise (and would point you to
http://okmij.org/ftp/Haskell/perfect-shuffle.txt if you haven't seen
it), one of the nice things about Clojure is that you aren't trapped
without mutability when you need it, and you need it here since there
is a known O(n) in-place implementation and you can't do better than
O(n log n) otherwise. Your implementation leveraging Java's
Collections/shuffle _is_ functional, it neither alters its arg nor has
any other side effects, and is the right way to implement this, since
Collections/shuffle has seen a lot of use/testing. See Clojure's sort,
for instance. It's ok to leverage Java's speed inside a functional
interface. For example, Clojure's persistent maps and vectors couldn't
be as nearly fast as they are if they were implemented purely
functionally.

Rich

Shawn Hoover

unread,
Aug 27, 2008, 2:20:33 PM8/27/08
to clo...@googlegroups.com
On Wed, Aug 27, 2008 at 12:11 AM, Chouser <cho...@gmail.com> wrote:

> You could simplify your reduce fn a bit by using merge-with:
Or even:

(defn test-shuffle-uniformity
 "Shuffles a 3-item collection n times, mapping each result to the number of
 times it's hit. If the distribution isn't uniform we've let bias into the
 random numbers or the algorithm. Typical results on my machine:
 (2 1 3) : 16611
 (3 2 1) : 16771
 (1 3 2) : 16707
 (1 2 3) : 16766
 (3 1 2) : 16555
 (2 3 1) : 16590"
 ([n] (test-shuffle-uniformity n shuffle))
 ([n shuffle]
    (let [coll [1 2 3]
          results (apply merge-with + {}
                         (take n (repeatedly (fn [] {(shuffle coll) 1}))))]
      (doseq [k v] results
        (println k ":" v)))))

Interesting suggestion. I think reduce expresses my intent a little better, but I need to keep merge-with in mind for other uses.
 
It'd be nice to use the #(...) syntax instead of (fn [] ...), but that
won't work here since #({key val}) would try to call the map with no
arguments.  You'd have to say #(hash-map key val) and (fn [] {key
val}) is arguably clearer.

Heh, having #( ) available gives me a concision addiction. I'm trying to heed the advice of the docs not to replace fn everywhere, but it's always tempting to use. I definitely tried to write #(rand) in shuffle, but ran into the same no-arguments problem.

Also note the destructuring in doseq to save a few keystrokes.

Ah, I didn't realize at the time that doseq supported destructuring, but that rocks.
 
Or you can rely on the fact that vectors are comparable themselves.
The first member is the most significant, so I'd think this would
produce acceptable results.  It's more succinct and appears to double
the speed:

(defn sort-by-mem-keys
 "Like clojure/sort-by but only calls keyfn once on each item in coll."
 ([keyfn coll]
    (let [key-vals (map (fn [x] [(keyfn x) x]) coll)]
      (map second (sort key-vals)))))

That's very clean. When duplicate keys arise that implementation is probably has similar behavior to the original, assuming the values in coll are Comparable. Either way, I haven't tested mine for uniformity with large inputs, and given the superiority of the Java solution, I'm done.

Thanks for the pointers!

Shawn

Shawn Hoover

unread,
Aug 27, 2008, 2:44:05 PM8/27/08
to clo...@googlegroups.com

On Wed, Aug 27, 2008 at 6:59 AM, Rich Hickey <richh...@gmail.com> wrote:
While I appreciate the intellectual exercise (and would point you to
http://okmij.org/ftp/Haskell/perfect-shuffle.txt if you haven't seen
it), one of the nice things about Clojure is that you aren't trapped
without mutability when you need it, and you need it here since there
is a known O(n) in-place implementation and you can't do better than
O(n log n) otherwise. Your implementation leveraging Java's
Collections/shuffle _is_ functional, it neither alters its arg nor has
any other side effects, and is the right way to implement this, since
Collections/shuffle has seen a lot of use/testing. See Clojure's sort,
for instance. It's ok to leverage Java's speed inside a functional
interface. For example, Clojure's persistent maps and vectors couldn't
be as nearly fast as they are if they were implemented purely
functionally.

Rich

It was a good exercise, but it's getting old given the superior easy solution :)

Thanks, I hadn't seen the perfect-shuffle you linked. I still think the sort by random keys is pretty good for how simple it is, but the little nugget about random number bias when N!>M helps show how it can fall short of perfection for large sequences.

Now that you say it, it's obvious that shuffle-java is functional. I only discovered late in the game how sort uses toArray. To do it over without Collections/shuffle, I would throw out sort-by-mem-keys, get a Java array, and shuffle that in place.

Frankly, any opportunity to call Java is satisfying just because the integration is so seamless! (As long as it doesn't harm concurrency in the program, of course.)

Shawn
Reply all
Reply to author
Forward
0 new messages