Herhut, Stephan A
unread,Dec 6, 2012, 1:59:46 PM12/6/12You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to dev-tech-js-en...@lists.mozilla.org
Dear all,
after (several) conversations with Niko and Rick on potential parallel implementations for scatter, I felt the urge to document the current proposals and also put my thoughts of late up for discussion. I think we all agree that there is not one single good implementation for scatter. Depending on how it is invoked, scatter serves different purposes. I currently have identified three use cases that have unique characteristics and ask for different implementations. Note that my classification is directed by expressed intent not actual behavior.
1) permutation
PA.scatter(v)
These are the operations where the programmer does not specify a reduction operator for collisions. Therefore, I assume the programmer does not expect collisions. Most importantly, the runtime in this case only needs to detect collisions but not handle them specifically. Throwing an error is good enough.
I would implement this using a single result buffer, which all threads update and partitioning the input (scatter vector). Collisions are detected by keeping a write log for each thread. Once all threads have finished, the write logs are compared and if one index was written by more than one thread we throw an error. The checking phase can also be performed in parallel. Even more, it suffices to store 1 bit per index position. The advantage of this approach is that it avoids synchronization between threads at the cost of some extra book keeping. One main drawback is that it may trigger bad cache behavior due to the rather messy writes (pun from Rick).
Alternatively, one might implement it by partitioning the write buffer instead of the input buffer. In this scenario, each worker thread is assigned a portion of the result buffer but the input (scatter vector) is shared across all threads. Each thread then scans the input for indices in its range and updates the output buffer and write log accordingly. If a conflict is detected, the operation is aborted. This approach has potentially better cache behavior at the price of additional traversals of the input vector.
2) few collisions
PA.scatter(v, d, f)
Here, the programmer has specified a collision function, yet no new length is given. My crystal ball in this case predicts few collisions. However, other than before, we need to specifically handle each collision.
The first approach above does no longer work here, as we have to detect collisions as they happen. Niko considered to use CAS. This can either be used to store the value or on an additional shared write log. However, CAS is costly in particular as there is not much else going on to hide latencies.
For this scenario, the second approach above seems to be the clear winner.
3) histogram style operation
PA.scatter(v, d, f, l) where l << v.length
As the programmer has specified a new length, many collisions are to be expected. As in 2) we have to handle each of them. However, as the result buffer is small relative to the input data, we might be able to tolerate the cost of allocating and merging multiple buffers.
The approach for 2) could lead to highly unbalanced work distribution. My proposal is to use a variant of the first algorithm for use case 1 with one result buffer and write log per thread. Each thread then computes his local result, resolving conflicts as they happen. Once all threads are finished, the local result buffers are combined into a global result, potentially in parallel. This is also the point where empty cells are filled with the default value.
So, in summary, we have three implementation of scatter
1) a) shared buffer, local write log
1) b) partitioned buffer, partitioned write log
2) partitioned buffer, partitioned write log
3) local buffer, local write log
The tipping point between 2 and 3 depends on the imbalance and the buffer sizes involved. Only experimentation will tell.
This is my current state of thought. Please amend with any ideas you had.
Stephan