Data dependent lane conflicts

48 views
Skip to first unread message

Lars Friend

unread,
Jul 18, 2020, 8:23:36 PM7/18/20
to Intel SPMD Program Compiler Users
Hello All,

    I'm new to the ispc mailing list but I've been following the project itself with great interest since its first public release.  I've been interested in using SIMD parallelism for years at this point (since I read the first public spec for the AVX2 instruction set) and have a particular interest in using them to more efficiently perform tasks which, while inherently parallel, fall outside the computation-heavy-inner-loop category which, as low-hanging fruit, received the most initial attention as CPUs began to include wider and more general SIMD units.

    Does anyone know if there is a way to induce ispc to emit the vpconflictd instruction (or its equivalent on targets without such an instruction)?  If not, I think it would be a worthwhile thing to add to the Cross-Program Instance Operations already supported, in the form of a function that, given a varying parameter, would return a lanemask of "conflicting" lower-numbered lanes.

    In particular, the SPMD model is particularly attractive for realizing significant performance gains for certain class of problem where the core activity involves pulling incoming events off a queue, looking up the correct state machine instance, and processing the event in the context of its state machine instance, potentially producing an output event and generally updating the state machine instance as a result.  Some practical examples of this type of problem are:
  • Packet processing (where the state machine that is looked up and updated may be the connection the packet belongs to for instance).
  • Simulations (where the state machine instances could be the individual agents whose behavior the simulation models).
  • Scanning streams of events for specified patterns (where each state machine instance is a potential match-in-progress).
  • and many more...
    One peculiarity about this class of applications is that while they are almost "embarrassingly parallel" there is the problem of data dependent resource conflicts; If you're processing packets, for instance, and you have thousands of active connections most of the time you can grab the next eight packets off the queue and each will belong to a different connection (thus they can be processed completely in parallel).  Sometimes, however rarely, you will grab the next eight packets off the queue and discover that two of them belong to the same connection and thus they must be serialized with respect to one another to avoid incorrect results (i.e. the state machine for that connection must be updated with the result of processing the first packet before attempting to process the second packet since the first packet may initiate a state transition that fully changes the context in which the second must be evaluated).

    Unlike many problems to which SIMD parallelism is applied, this class of problem does not have static loop-carried dependencies that are predictable in advance.  Rather it has dynamic (data dependent) dependencies which are generally rare, but correctness requires that they be identified such that no two events that require serialization are processed concurrently.  Intel was clearly aware of the degree to which this class of problem would benefit from SIMD parallelism and in the AVX-512CD instruction set extension they provided instructions like vpconflictd/vpconflictq which, in combination with a few other AVX-512 instructions, provide an effective way to detect when it is required to serialize two work items based on their shared need of a resource.  (For instance, if you're looking up state machine instances in a hash table you could use vpconflictd to detect when two lanes are going to require access to the same bucket).  Detecting such lane-to-lane hazards early and dispatching a batch as two smaller batches to avoid the hazard allows the program logic for processing each event to remain simple and efficient.

    One might wonder why it makes sense to go through the effort of parallelizing a seemingly computationally trivial task like updating simple state machines in light of a stream of events.  There are two reasons which surprised me quite a bit when I first started looking into this:
  1. When you're processing a stream of events each of which can update one state machine instance out of thousands, it doesn't take too many state machine instances to reach the point where your cache hit rate in looking up the state machines becomes unavoidably low.  This means that beyond some threshold number of state machine instances your event processing rate is ultimately limited by DRAM latency.  If you can use a gather to look up eight state machines from your hash table in parallel the component loads which the CPU breaks the gather into end up, on average, being dispatched to different DRAM channels and banks.  My experiments have shown that the upshot of this is that the average time it takes to fetch eight items at random from a gigabyte-sized hash table is only about 2x the average time to fetch a single item with a normal scalar load instruction.
  2. Many simple state machines (if implemented in the most elementary way, with if/else or switch/case constructs) end up being computationally trivial but very "branchy" which is especially hard on modern CPUs with deep pipelines and a relatively high cost for mispredicted branches.  This makes it advantageous to re-state (no pun intended) these state machines in terms of data flow rather than control flow (e.g. use look-up tables, compute both sides of an if/else and use a conditional move or similar to select the correct one.  Modern CPUs can easily exploit the instruction-level parallelism generated by computing both branches and selecting one later).  Once you've re-stated most of your control flow in terms of data flow already, it is only a small step to use SIMD instructions to process multiple event+instance pairs in parallel.
    I feel that as more of the low-hanging fruit is picked, and as SIMD instruction sets that are more general and closer to orthogonal (compare AVX-512 with SSE for instance) become more widely available it will become more and more important to apply this sort of parallelism to general purpose computing tasks as well (rather than only number crunching).

        Cheers,
                -lars

Dmitry Babokin

unread,
Jul 21, 2020, 3:12:52 AM7/21/20
to ispc-...@googlegroups.com
Hi Lars,

Thank you for a detailed motivating examples describing the need for conflict detection capability in the language!

Currently there's no way to generate vpconflictd instruction in ISPC. I see two paths to generate it:
  (1) introduce a library function, which has semantic of vpconflitd/q and emulate it on the platforms that don't support it - as you suggested, or
  (2) introduce a generic way to call arbitrary intrinsics as proposed in https://github.com/ispc/ispc/issues/1803

Given the importance of the instruction for many applications that you outlined and as you see value for emulating it, I would suggest going with a dedicated library function. I would say that your description of the problem even sounds like we are missing a special type of "foreach" loop, which would serialize conflicting iterations. If we get enough motivating examples and prove that we can implement it efficiently enough, I would support it as well.

I can implement the library function pretty easily. Could you please file an issue for that? 

Also, if you can contribute an example (as a self sufficient program that we can distribute as part of our example set) of the algorithm, where this instruction is useful, what would be really cool. All the cases that you've described do make sense and sound very convincing, but I personally don't have experience with these kinds of algorithms. This also sounds like an interesting assignment for someone who is studying algorithms class.

And one more question, do you see the use for VP2INTERSECTD instructions in these algorithms?

Dmitry.

--
You received this message because you are subscribed to the Google Groups "Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ispc-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ispc-users/2c895a62-f77f-47c7-a187-c52fce4cc807o%40googlegroups.com.

Lars Friend

unread,
Jul 21, 2020, 11:58:07 PM7/21/20
to Intel SPMD Program Compiler Users
I can certainly create an issue for this.  However, there is one bit of subtlety which makes things a bit more interesting, so I figure I'll give it a bit more thought and try to write something up this weekend.

One tricky bit is that the simplest motivating examples, e.g.:



typedef struct {
    uint32 account
;
    int32  delta_cents
;
} acct_delta_t;
// for compactness, bounds checking omitted
void apply_deltas(uniform int32 balances[], uniform acct_delta_t deltas[], uniform int num_deltas)
{
   
foreach (i = 0 ... num_deltas) {
        uint32 acct
= deltas[i].account;
        int32 cents
= deltas[i].delta_cents;
        balances
[acct] += cents;
   
}
}


(which will obviously get the wrong answer if two instances in one gang gather, add, and scatter, one of those scatters will win and the balance will not come out correct. Really, this is just a class of mutual exclusion problem that is actually *much* simpler than the traditional forms of these problems because the gang advances in lock step so we *can* use something like vpconflictd to check in advance).

The above example is the simplest case imaginable.  However, you could easily imagine a closely related form of this problem (which is much closer to real-world) where your transactions look like this instead:

typedef struct {
    uint32 from_account
;
    uint32 to_account
;
    uint32 cents
;
} fancy_acct_delta_t;

where you'd need to check that no two transactions in your batch conflict with one another in _either_ account number...  This isn't hard for a human to do with some thought or for a compiler to construct algebraically but of course it is possible to get into a situation where the non-trivial cost of the conflict detection may suggest a different approach altogether (e.g. hashing all incoming items into N separate queues so you know a priori that you can grab the head item off each queue and know it will not conflict, though this carries the opportunity cost of needing to partition your resources N ways and live with any uneven/adverserial hash abberations that may leave some partitions underutilized and others full).

As for vp2intersectd, it seems like it ought to be useful for _something_ but just what hasn't come to mind yet.  I expect once I get the opportunity to play with it a bit more something may resolve itself.

    Cheers,
        -lars

Dmitry Babokin

unread,
Jul 22, 2020, 1:02:57 AM7/22/20
to ispc-...@googlegroups.com
Ok, so I read it as:
(1) we may want have more general mechanism to deal with conflicts.
(2) for vpconflictd instruction the question if it needs to be exposed as a library function (available on all platforms with fall-back implementation) or as an intrinsic, which is available only for AVX512 instruction set.

Dmitry.

--
You received this message because you are subscribed to the Google Groups "Intel SPMD Program Compiler Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ispc-users+...@googlegroups.com.

Lars Friend

unread,
Jul 22, 2020, 11:03:49 AM7/22/20
to ispc-...@googlegroups.com
Yes, that's essentially it. I suspect that most people will just use the general conflict mechanism any time they have a pattern where a received/computed/derived field selects which data are in play for any given lane dynamically at runtime.  People who have some other oddball use for that function could still use the intrinsic (which would emit the special instruction on platforms that had it, or an equivalent sequence otherwise).

    -lars

        -lars

Manodeep Sinha

unread,
Jul 23, 2020, 12:08:17 AM7/23/20
to Intel SPMD Program Compiler Users
Hi all,

Correlation functions are another example where such capability in ispc would be fantastic. Correlation functions require computing pairwise separations (up to some Rmax) and then computing a histogram of these separations. Correlation functions are widely used in Astronomy (100k+ references in papers). A trivial code for a correlation function is:

for(int64_t i=0;i<Npart;i++) {
    for(int64_t j=0;j<Npart;j++) {
        const double dx = xpos[i] - xpos[j];
        const double dy = ypos[i] - ypos[j];
        const double dz = zpos[i] - zpos[j];
           
        const double r2 = dx*dx + dy*dy + dz*dz;
        if(r2 < sqr_rmin || r2 >= sqr_rmax) continue;
        const double r = sqrt(r2);
        const int index = convert_sep_to_histogram_index(r);
        histogram[index]++;
    }
}

where, the convert_sep_to_histogram function scales "r" into [0, nbins-1]. The pairwise computation can be done with SIMD but the histogram update needs to be serialised in general. If ispc adds the equivalent of the vpconflict instruction that would be great!

Here is a self-contained simple correlation function repo (with tests): https://github.com/manodeep/simple-corr

Cheers,
Manodeep
Reply all
Reply to author
Forward
0 new messages