Trouble using r/fold to optimise a reduce

127 views
Skip to first unread message

Divyansh Prakash

unread,
Apr 2, 2016, 4:24:47 PM4/2/16
to Clojure
Hi! 
I'm solving the problem described here
I've got the solution, but it takes ~50 s to compute.
I tried optimising it by replacing the main reduce operation with r/fold, but it doesn't seem to have any effect on the performance for whatever n (batch size) I use.
Any suggestions?

Note: I'm using a MacBook Pro with 8 cores. JDK 7. Clojure 1.8.

Francis Avila

unread,
Apr 2, 2016, 5:08:26 PM4/2/16
to Clojure
Your input collection to r/fold is provided by cmds-from-input which returns a lazy-seq (from map) which is not a parallel-foldable type. You can try mapv instead: vectors are parallel-foldable. (Note only PersistentVector and PersistentHashMap have useful coll-fold implementations: all other objects (including sets) fall back on normal reduction: https://github.com/clojure/clojure/blob/d5708425995e8c83157ad49007ec2f8f43d8eac8/src/clj/clojure/core/reducers.clj#L347-L367)

Additionally, I'm not sure how this algorithm could be parallelized because the order in which you apply the toggle operation matters! I suspect if you make the mapv change I suggest you will get different final answers.

Divyansh Prakash

unread,
Apr 2, 2016, 7:07:48 PM4/2/16
to Clojure
Thanks a lot, Francis!
I've made the changes you suggest - replaced map with mapv and modified apply-cmd to take/return a vector instead of a set (it converts to set internally).
My understanding is that reducers preserve the order of operation, so it should give the correct answer.
Instead, the simple version now takes ~120 s to run, whereas the r/fold version fails with a "UnsupportedOperationException: nth not supported on this type: Long" somewhere inside reducers/foldvec/fn. Strange.

Divyansh Prakash

unread,
Apr 2, 2016, 7:35:12 PM4/2/16
to Clojure
Even stranger - the parallel version seems to at least produce an output (not sure how correct) if run for the first 50 commands instead of 300.


On Sunday, April 3, 2016 at 2:38:26 AM UTC+5:30, Francis Avila wrote:

Divyansh Prakash

unread,
Apr 2, 2016, 7:45:16 PM4/2/16
to Clojure
Just verified - it works and gives the correct answer for shorter collections. 
No performance boost though. And fails (?) on larger collections.
Not sure what's happening. The foldvec implementation is also pretty hard to understand.


On Sunday, April 3, 2016 at 2:38:26 AM UTC+5:30, Francis Avila wrote:

Divyansh Prakash

unread,
Apr 2, 2016, 7:50:42 PM4/2/16
to Clojure
Updated code here.

Francis Avila

unread,
Apr 2, 2016, 11:36:10 PM4/2/16
to Clojure
You misunderstand how r/fold works.

My understanding is that reducers preserve the order of operation, so it should give the correct answer.

This is not quite true. There are two fundamental principles behind reducers. The first is that the process of reduction, stripped of all non-essentials, is "apply this function individually to every item in a collection, accumulating the result". Notice that there is no mention of order.

The second principle is, when reduction has been thus "reduced" to its essentials, it is possible for a *collection* to control how reductions are done over itself. Whether a reduction process has a particular order is a non-essential property that is provided or not-provided by the collection itself. The IReduceInit protocol is purposefully agnostic to order. Vectors implement it in an ordered way, but hash maps have no meaningful order.

So that's IReduceInit. The coll-fold protocol takes advantage of delegating the reduction implementation to the collection by allowing collections to provide a potentially-parallel reduction strategy. How this works is that the collection divides itself up into chunks, runs a normal reduction over each chunk using the same init value, then applies a "combine" function to combine the chunks. In the case of vectors, the chunks are combined in order, so the operation you are performing does not need to be commutative. However, the combine function still needs to be *associative*, because each individual sub-reduction is not initialized with the result of the previous reduction.


Even stranger - the parallel version seems to at least produce an output (not sure how correct) if run for the first 50 commands instead of 300.

The problem here is misunderstanding the "combine" step of fold. You supply the same function apply-cmd for both reducing and combining. However, the reducing function is called with a set of "on" lights and a command, but the combine function is called with two sets of on lights (i.e., it is called with the result of two reductions). So in the combine phase, r/fold is calling apply-cmd something like this: (apply-cmd #{[1 1]} #{[2 2]}), which is not what it expects, hence your strange exception about Longs.

Anyway, the combine function must be associative, but this advent problem is *not* associative, because you cannot rearrange the order in which the toggle functions are run. Each toggle function must always have the final state of every previous line done. You could compute the commands *in between* the toggle commands in parallel (essentially coalescing them into a single "turn on" command), but you still need to run the whole sequence of toggle commands, from start to finish, in order.

Here is a simple example to illustrate:

; Our commands. :toggle simplified to illustrate the point.
(def cmds [[:on #{1 2}] [:toggle 2] [:on #{3}]])
;=> #'advent.a6/cmds
; Our reduction function, with initializing arity
(defn apply-cmd
 
([] #{})
 
([on [cmd arg]]
   
(case cmd
     
:on (into on arg)
     
:toggle (if (contains? on arg)
               
(disj on arg)
               
(conj on arg)))))
;=> #'advent.a6/apply-cmd
;; The correct answer, for reference
(r/reduce apply-cmd cmds)
;=> #{1 3}




Now let's perform the steps an r/fold would perform. We have three commands and will do two chunks, but we will divide the chunks differently. Here is what happens when we split between the second and third command:


; Chunk one
(
r/reduce apply-cmd (subvec cmds 0 2))
;=> #{1}
; Chunk two
(r/reduce apply-cmd (subvec cmds 2))
;=> #{3}
; Combine
(into *1 *2)
;=> #{1 3}
; Looks ok...


Now lets try splitting between the first and second command:

; Chunk one
(
r/reduce apply-cmd (subvec cmds 0 1))
;=> #{1 2}
; Chunk two
(r/reduce apply-cmd (subvec cmds 1))
;=> #{3 2}
;; Combine
(into *1 *2)
;=> #{1 3 2}
; WRONG ANSWER!!


Notice *how you split the commands up* affects the final answer, which means this algorithm can't be safely implemented with r/fold!

Francis Avila

unread,
Apr 2, 2016, 11:38:56 PM4/2/16
to Clojure


Even stranger - the parallel version seems to at least produce an output (not sure how correct) if run for the first 50 commands instead of 300.

I forgot to mention: the reason why r/fold sometimes seems to still work is because there is only one chunk (i.e., no parallelism), so the combine function is never called.

This is also why r/fold worked when you used a lazy-sequence of commands instead of a vector--combine was never called because there was no parallelism.

Francis Avila

unread,
Apr 3, 2016, 2:32:48 AM4/3/16
to Clojure
I had some fun with playing around with faster solutions. https://gist.github.com/favila/0573e3f644dea252bdaaed5be9d1519f

The biggest speedup comes from avoiding set creation in expanded-range (i.e., the function that produces the collection of affected coordinates) and ensuring that the ops run on the accumulating set of on-lights using transients.  clojure.set/* functions require both items be sets and does not use transients internally, so it was much slower.

Another big speedup comes from encoding the light coordinates more efficiently. You can encode a light as a number (in my case, a long, with high bits the x coordinate and low bits the y coordinate) instead of a vector. This creates fewer objects which are easier to hash.

Finally, I tried an approach which doesn't use sets, but instead naively creates a 1000x1000 array of booleans and mutates it in place with every op. This is the fastest approach: 4 seconds on a 2010-era i3! I'm sure a proper matrix library (e.g. core.matrix) could do even better.

Divyansh Prakash

unread,
Apr 3, 2016, 3:27:42 AM4/3/16
to Clojure
Thanks a lot for taking the time out to explain this stuff in detail! 
Will go through your solution shortly.
Reply all
Reply to author
Forward
0 new messages