Hello All,
I'm new to the ispc mailing list but I've been following the project itself with great interest since its first public release. I've been interested in using SIMD parallelism for years at this point (since I read the first public spec for the AVX2 instruction set) and have a particular interest in using them to more efficiently perform tasks which, while inherently parallel, fall outside the computation-heavy-inner-loop category which, as low-hanging fruit, received the most initial attention as CPUs began to include wider and more general SIMD units.
Does anyone know if there is a way to induce ispc to emit the vpconflictd instruction (or its equivalent on targets without such an instruction)? If not, I think it would be a worthwhile thing to add to the Cross-Program Instance Operations already supported, in the form of a function that, given a varying parameter, would return a lanemask of "conflicting" lower-numbered lanes.
In particular, the SPMD model is particularly attractive for realizing significant performance gains for certain class of problem where the core activity involves pulling incoming events off a queue, looking up the correct state machine instance, and processing the event in the context of its state machine instance, potentially producing an output event and generally updating the state machine instance as a result. Some practical examples of this type of problem are:
- Packet processing (where the state machine that is looked up and updated may be the connection the packet belongs to for instance).
- Simulations (where the state machine instances could be the individual agents whose behavior the simulation models).
- Scanning streams of events for specified patterns (where each state machine instance is a potential match-in-progress).
- and many more...
One peculiarity about this class of applications is that while they are almost "embarrassingly parallel" there is the problem of data dependent resource conflicts; If you're processing packets, for instance, and you have thousands of active connections most of the time you can grab the next eight packets off the queue and each will belong to a different connection (thus they can be processed completely in parallel). Sometimes, however rarely, you will grab the next eight packets off the queue and discover that two of them belong to the same connection and thus they must be serialized with respect to one another to avoid incorrect results (i.e. the state machine for that connection must be updated with the result of processing the first packet before attempting to process the second packet since the first packet may initiate a state transition that fully changes the context in which the second must be evaluated).
Unlike many problems to which SIMD parallelism is applied, this class of problem does not have static loop-carried dependencies that are predictable in advance. Rather it has dynamic (data dependent) dependencies which are generally rare, but correctness requires that they be identified such that no two events that require serialization are processed concurrently. Intel was clearly aware of the degree to which this class of problem would benefit from SIMD parallelism and in the AVX-512CD instruction set extension they provided instructions like vpconflictd/vpconflictq which, in combination with a few other AVX-512 instructions, provide an effective way to detect when it is required to serialize two work items based on their shared need of a resource. (For instance, if you're looking up state machine instances in a hash table you could use vpconflictd to detect when two lanes are going to require access to the same bucket). Detecting such lane-to-lane hazards early and dispatching a batch as two smaller batches to avoid the hazard allows the program logic for processing each event to remain simple and efficient.
One might wonder why it makes sense to go through the effort of parallelizing a seemingly computationally trivial task like updating simple state machines in light of a stream of events. There are two reasons which surprised me quite a bit when I first started looking into this:
- When you're processing a stream of events each of which can update one state machine instance out of thousands, it doesn't take too many state machine instances to reach the point where your cache hit rate in looking up the state machines becomes unavoidably low. This means that beyond some threshold number of state machine instances your event processing rate is ultimately limited by DRAM latency. If you can use a gather to look up eight state machines from your hash table in parallel the component loads which the CPU breaks the gather into end up, on average, being dispatched to different DRAM channels and banks. My experiments have shown that the upshot of this is that the average time it takes to fetch eight items at random from a gigabyte-sized hash table is only about 2x the average time to fetch a single item with a normal scalar load instruction.
- Many simple state machines (if implemented in the most elementary way, with if/else or switch/case constructs) end up being computationally trivial but very "branchy" which is especially hard on modern CPUs with deep pipelines and a relatively high cost for mispredicted branches. This makes it advantageous to re-state (no pun intended) these state machines in terms of data flow rather than control flow (e.g. use look-up tables, compute both sides of an if/else and use a conditional move or similar to select the correct one. Modern CPUs can easily exploit the instruction-level parallelism generated by computing both branches and selecting one later). Once you've re-stated most of your control flow in terms of data flow already, it is only a small step to use SIMD instructions to process multiple event+instance pairs in parallel.
I feel that as more of the low-hanging fruit is picked, and as SIMD instruction sets that are more general and closer to orthogonal (compare AVX-512 with SSE for instance) become more widely available it will become more and more important to apply this sort of parallelism to general purpose computing tasks as well (rather than only number crunching).
Cheers,
-lars