parallel implementation of scatter

Herhut, Stephan A

unread,

Dec 6, 2012, 1:59:46 PM12/6/12

to dev-tech-js-en...@lists.mozilla.org

Dear all,

after (several) conversations with Niko and Rick on potential parallel implementations for scatter, I felt the urge to document the current proposals and also put my thoughts of late up for discussion. I think we all agree that there is not one single good implementation for scatter. Depending on how it is invoked, scatter serves different purposes. I currently have identified three use cases that have unique characteristics and ask for different implementations. Note that my classification is directed by expressed intent not actual behavior.

1) permutation

PA.scatter(v)

These are the operations where the programmer does not specify a reduction operator for collisions. Therefore, I assume the programmer does not expect collisions. Most importantly, the runtime in this case only needs to detect collisions but not handle them specifically. Throwing an error is good enough.

I would implement this using a single result buffer, which all threads update and partitioning the input (scatter vector). Collisions are detected by keeping a write log for each thread. Once all threads have finished, the write logs are compared and if one index was written by more than one thread we throw an error. The checking phase can also be performed in parallel. Even more, it suffices to store 1 bit per index position. The advantage of this approach is that it avoids synchronization between threads at the cost of some extra book keeping. One main drawback is that it may trigger bad cache behavior due to the rather messy writes (pun from Rick).

Alternatively, one might implement it by partitioning the write buffer instead of the input buffer. In this scenario, each worker thread is assigned a portion of the result buffer but the input (scatter vector) is shared across all threads. Each thread then scans the input for indices in its range and updates the output buffer and write log accordingly. If a conflict is detected, the operation is aborted. This approach has potentially better cache behavior at the price of additional traversals of the input vector.

2) few collisions

PA.scatter(v, d, f)

Here, the programmer has specified a collision function, yet no new length is given. My crystal ball in this case predicts few collisions. However, other than before, we need to specifically handle each collision.

The first approach above does no longer work here, as we have to detect collisions as they happen. Niko considered to use CAS. This can either be used to store the value or on an additional shared write log. However, CAS is costly in particular as there is not much else going on to hide latencies.

For this scenario, the second approach above seems to be the clear winner.

3) histogram style operation

PA.scatter(v, d, f, l) where l << v.length

As the programmer has specified a new length, many collisions are to be expected. As in 2) we have to handle each of them. However, as the result buffer is small relative to the input data, we might be able to tolerate the cost of allocating and merging multiple buffers.

The approach for 2) could lead to highly unbalanced work distribution. My proposal is to use a variant of the first algorithm for use case 1 with one result buffer and write log per thread. Each thread then computes his local result, resolving conflicts as they happen. Once all threads are finished, the local result buffers are combined into a global result, potentially in parallel. This is also the point where empty cells are filled with the default value.

So, in summary, we have three implementation of scatter

1) a) shared buffer, local write log

1) b) partitioned buffer, partitioned write log

2) partitioned buffer, partitioned write log

3) local buffer, local write log

The tipping point between 2 and 3 depends on the imbalance and the buffer sizes involved. Only experimentation will tell.

This is my current state of thought. Please amend with any ideas you had.

Stephan

Hudson, Rick

unread,

Dec 6, 2012, 2:19:38 PM12/6/12

to Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org

I would implement "partitioned result buffer, partitioned write log" and then measure the implementation. If we are memory bound then I would not expect your other solutions be winners but I'm almost always wrong before I build and measure. If use case 3 partitions share cache lines then they might result in lots of MESI / RFO messages but if that is not the case then I'd still bet on "partitioned result buffer, partitioned write log".

Note that there really isn't a difference between partitioned write log and local write log since only the thread owning the partition will need to look at the log.

- Rick

_______________________________________________
dev-tech-js-engine-rivertrail mailing list dev-tech-js-en...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tech-js-engine-rivertrail

Herhut, Stephan A

unread,

Dec 6, 2012, 2:32:40 PM12/6/12

to Hudson, Rick, dev-tech-js-en...@lists.mozilla.org

For use case 3 I am thinking of small result lengths (like 255 for rgb histogram. How many buckets does your typical classifier have...). So cache behavior might be a killer. I am still more worried about imbalance, though.

I use "partitioned write log" to mean one write log that all threads share but that models a global state, whereas local write log refers to a write log per thread where the global state has to be explicitly constructed.

Stephan

> dev-tech-js-engine-rivertrail mailing list dev-tech-js-engine-
> river...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-tech-js-engine-rivertrail

Hudson, Rick

unread,

Dec 6, 2012, 2:45:11 PM12/6/12

to Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org

Build and measure. 255 buckets at 4 bytes / bucket with 4 threads and cache line of 64 (or even 128) bytes means no cache line conflicts if everything is aligned properly.

Why do you need to explicitly construct the global state for a write log if the result buffer is partitioned?

Herhut, Stephan A

unread,

Dec 6, 2012, 3:16:48 PM12/6/12

to Hudson, Rick, dev-tech-js-en...@lists.mozilla.org

I actually agree. I was just replying to add information that was missing in my earlier email regarding the size of l. Again, caches _might_ be an issue but unless we measure we won't know. Imbalance will probably be the bigger problem.

Of course we do not need to construct a global state for the write log in the algorithm you propose. That is why I call those partitioned write logs. In the other algorithm where each thread uses its own write log, which I refer to as local write log, a global state has to be explicitly constructed. This was in reply to your suggestion that partitioned and local write logs are essentially the same.

Stephan

> -----Original Message-----
> From: Hudson, Rick
> Sent: Thursday, December 06, 2012 11:45 AM
> To: Herhut, Stephan A; dev-tech-js-en...@lists.mozilla.org
> Subject: RE: parallel implementation of scatter
>

> dev-tech-js-engine-rivertrail mailing list dev-tech-js-engine-
> river...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-tech-js-engine-rivertrail

Hudson, Rick

unread,

Dec 6, 2012, 3:18:53 PM12/6/12

to Herhut, Stephan A, dev-tech-js-en...@lists.mozilla.org

Yes, that makes sense. I think we are on the same page here.