--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.
Hi Stephan,having being involved in the OpenMP committee for a while now, I like the proposal and think it will be useful to define parallel constructs within MLIR. I believe the current proposal below may need further clarification, esp with respect to synchronization.Starting with a 1D parallel region:loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { ... }
the traditional model is a fork/join model, where all the treads work in parallel until all iterations are done and then they wait for others in a barrier, after which only the main tread continue execution.When extending this model more dimensions, the proposal has to be specific if, for exampleloop.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%s1, %s2) { ... }is simply a shorthand forloop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) {loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
}or not. If it is, this means that the code execute N+1 barriers (where N is the trip count of the inner loop), as each of the threads that participate to one inner parallel loop are expected to synchronize at the end of it.
While this is often desired (and the default model in OpenMP), it is not always best. There are cases, esp when 2+D tiling where you want all of the tiles to go in parallel. OpenMP has a fairly unsatisfying approach to this (collapsing multiple loops in a parallel).For the barriers, we also have to be a bit more specific on which threads are expected to synchronize with which other threads. I also know it is customary in GPU to use a synchronization barrier inside loops, as there is the expectation that each of the loop iterations is mapped to a thread in a very large pool of warps. This is typically not the case on CPUs, where a single CPU thread is sequentially iterating over the parallel iterations that were mapped to that CPU. This is why barriers are not allowed in parallel-for loops (like the construct you propose), and if you need such a barrier, the parallel loop with a barrier would need to be split into two parallel loops.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/OF26FDE6BF.F559A4FD-ON002584BA.006AD745-002584BA.006D483A%40notes.na.collabserv.com.
Hi Stephan,having being involved in the OpenMP committee for a while now, I like the proposal and think it will be useful to define parallel constructs within MLIR. I believe the current proposal below may need further clarification, esp with respect to synchronization.Starting with a 1D parallel region:loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { ... }
the traditional model is a fork/join model, where all the treads work in parallel until all iterations are done and then they wait for others in a barrier, after which only the main tread continue execution.When extending this model more dimensions, the proposal has to be specific if, for exampleloop.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%s1, %s2) { ... }is simply a shorthand forloop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) {loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
}or not. If it is, this means that the code execute N+1 barriers (where N is the trip count of the inner loop), as each of the threads that participate to one inner parallel loop are expected to synchronize at the end of it. While this is often desired (and the default model in OpenMP), it is not always best. There are cases, esp when 2+D tiling where you want all of the tiles to go in parallel. OpenMP has a fairly unsatisfying approach to this (collapsing multiple loops in a parallel).
For the barriers, we also have to be a bit more specific on which threads are expected to synchronize with which other threads. I also know it is customary in GPU to use a synchronization barrier inside loops, as there is the expectation that each of the loop iterations is mapped to a thread in a very large pool of warps. This is typically not the case on CPUs, where a single CPU thread is sequentially iterating over the parallel iterations that were mapped to that CPU. This is why barriers are not allowed in parallel-for loops (like the construct you propose), and if you need such a barrier, the parallel loop with a barrier would need to be split into two parallel loops.
My suggestion would be to state that a "loop.parallel (%iv1...%ivn)" correspond with a loop where all the iterations 1..n are fully parallel with only a single barrier at the end. This is good for 2D tiling. If there are inner loops that need to be synchronized, then using a nested construct can be employed. I would not allow a barrier in the middle, unless it is a shorthand for splitting the loop into two:loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) {//abarrier//b}is a shorthand forloop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* a */ }loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* b */ }where either no local variables can span the barrier, or they have to be expanded into arrays. If you come from GPU land, this sort of come for free when using thread-private locals... but this is not the general CPU mode.
Hi Stephan,
+Jeremy +Tim Shen
On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR
<ml...@tensorflow.org> wrote:
> I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)
>
> Cheers
> Stephan
>
> Rationale
> ----
>
> Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.
>
> To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.
>
> Proposed Syntax
> ----
>
> Below is an example of a parallel loop (assuming it lives in the loop dialect).
>
> loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
> // some computation
> }
>
> Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.
>
> Semantics
> ----
>
> Like with the for operation, the lower bound is inclusive and the upper bound is exclusive.
Can the IVs wrap?
> The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations.
This seems a bit restrictive IMO. If I had the following loop:
loop.parallel (%iv1) = (0) to (2) step (1) {
if (%iv == 0) {
*%ptr_a = 5
*%ptr_b = 10
} else {
%r0 = *%ptr_b
%r1 = *%ptr_a
}
}
I would like to be able to reorder the two stores if ptr_a and ptr_b
don't alias. But I think this current semantic does not allow this
because no sequential interleaving allows (assuming all locations are
zero initially) r0 = 10 && r1 = 0.
What do you think about the stronger invariant: there are no data
races between parallel iterations (data race us undefined behavior).
The only "communication" allowed is via structured reduction
primitives.
> I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.
>
> I propose to use a multi-dimensional construct mostly for convenience. It ensures structurally that loop-bands we want to transform into a GPU call later keep their form. With usual loop-nests, loop invariant code motion or transformations on bound and step values can insert extra computations between bands which have to be cleaned up. If required, the n-dimensional loop can still be split into more traditional 1d-loop nests or mixed-dimensional loop nests. Doing so does not impact semantics.
>
> Synchronization
> ----
>
> To support fusing parallel loops even in the presence of dependencies, I propose a synchronization primitive with the following syntax:
>
> loop.barrier {levels = 1}
Is there a natural extension of loop.barrier to sequential loops?
Otherwise we'll have to say that some parallel loops cannot be
trivially (assuming we only care about correctness) lowered into
sequential loops. I think this is fine but is still worth mentioning.
> The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.
I'd rather say: the reduction operation will be applied in arbitrary
order, not that it is associative. For instance, I can imagine the
frontend reduces with `lambda x, y: return x` because they don't
really care about which value "wins".
> Loops also support multiple results, in which case the body needs to contain multiple yield operations. For example, computing the minimum and maximum at the same time would look like the following
>
> %res0, %res1 = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
> %v0 = load %mem[%iv1, %iv2, %iv3]
> "loop.yield"(%v0) {
> bb0 (%lhs, %rhs):
> %min = min %lhs, %rhs
> return %min
> }
> "loop.yield"(%v0) {
> bb0 (%lhs, %rhs):
> %max = max %lhs, %rhs
> return %max
> }
> }
>
> In the above example, res0 will contain the minimum and res1 the maximum after the loop has finished.
>
> An alternative would be to have a single variadic yield operation that computes multiple results at once. I am leaning against that approach as it would require to keep inputs to reductions alive across barriers. With separate yields, one can return a value early before joining the barrier.
>
> For some scenarios it might be helpful to also specify a neutral value for the reduction. This could be done as an extra set of operands to the loop, for example
>
> %res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) initial (%init0) {
> %v0 = load %mem[%iv1, %iv2, %iv3]
> "loop.yield"(%v0) {
How does the loop.yield "know" its initial value? Should we instead have:
"loop.yield"(%v0, %init0) {
or something like that?
> bb0 (%lhs, %rhs):
> %sum = add %lhs, %rhs
> return %sum
> }
> }
>
> An alternative to the body region that specifies the computation would be to specify an operation kind (like add). This would be helpful when lowering to platforms that have a set of built-in reduction operations (SPIR-V for example has this concept). While I consider this use-case important, I also think that using a region is preferable for its added expressiveness. Also, even if the parallel loop would support built-in operations, we would need to raise this information out of higher-level dialects that might not have this notion. For example, HLO would not provide this information as it also allows for arbitrary functions in reductions. If we have to raise anyway, we can also do this when lowering out of parallel loops.
>
> --
> Stephan
>
> --
> You received this message because you are subscribed to the Google Groups "MLIR" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.
--
>
>
> It knows in the same way that it knows which result value of the for loop it is associated with: By order of appearance in the loop body. I would also be fine with passing it explicitly.
I think having an explicit use-def chain (both for loop.yield inputs
and outputs) is useful here. Otherwise I suspect it will be easy to
reorder a loop.yield and "forget" to update the corresponding
input/output.
-- Sanjoy
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CABBcqdGb_DXhqpKbNLEEFJR1SBVy4kDCMQzbCq%3D5m0mcLmaFdg%40mail.gmail.com.
A common pattern in the code we want to generate are reductions. It is often beneficial to fuse them into the parallel computation that produces the value to be reduced. To facilitate this on the loop level, I propose to add reduction support to parallel loops. This could be done in the form of a special yield operation. An example would look like the following
%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
%v0 = load %mem[%iv1, %iv2, %iv3]
"loop.yield"(%v0) {
bb0 (%lhs, %rhs):
%sum = add %lhs, %rhs
return %sum
}
}
The yield operations receives as its sole argument the value of the reduction produced by the current thread. It also needs to specify how values are actually reduced. This is done by providing a region that expects two arguments of the same type as the value yielded. The region computes and returns the produced value.
The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.
Loops also support multiple results, in which case the body needs to contain multiple yield operations. For example, computing the minimum and maximum at the same time would look like the following
%res0, %res1 = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
%v0 = load %mem[%iv1, %iv2, %iv3]
"loop.yield"(%v0) {
bb0 (%lhs, %rhs):
%min = min %lhs, %rhs
return %min
}
"loop.yield"(%v0) {
bb0 (%lhs, %rhs):
%max = max %lhs, %rhs
return %max
}
}
In the above example, res0 will contain the minimum and res1 the maximum after the loop has finished.
An alternative would be to have a single variadic yield operation that computes multiple results at once. I am leaning against that approach as it would require to keep inputs to reductions alive across barriers. With separate yields, one can return a value early before joining the barrier.
For some scenarios it might be helpful to also specify a neutral value for the reduction. This could be done as an extra set of operands to the loop, for example
%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) initial (%init0) {
%v0 = load %mem[%iv1, %iv2, %iv3]
"loop.yield"(%v0) {
bb0 (%lhs, %rhs):
%sum = add %lhs, %rhs
return %sum
}
}
An alternative to the body region that specifies the computation would be to specify an operation kind (like add). This would be helpful when lowering to platforms that have a set of built-in reduction operations (SPIR-V for example has this concept). While I consider this use-case important, I also think that using a region is preferable for its added expressiveness. Also, even if the parallel loop would support built-in operations, we would need to raise this information out of higher-level dialects that might not have this notion. For example, HLO would not provide this information as it also allows for arbitrary functions in reductions. If we have to raise anyway, we can also do this when lowering out of parallel loops.--Stephan
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.
-- Sanjoy
I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)
CheersStephan
Rationale
----
Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.
To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.
Proposed Syntax
----
Below is an example of a parallel loop (assuming it lives in the loop dialect).
loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
// some computation
}
Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.
Semantics
----
Like with the for operation, the lower bound is inclusive and the upper bound is exclusive. The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations. I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.
----Stephan
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.
Hi Stephan-Would we need some modelling of atomic/locking operations and the concept of shared/private variables ?
Below is an example of a parallel loop (assuming it lives in the loop dialect).
loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
// some computation
}
Hi Stephan,
...
only a single barrier at the end. This is good for 2D tiling. If there are inner loops that need to be synchronized, then using a nested construct can be employed. I would not allow a barrier in the middle, unless it is a shorthand for splitting the loop into two:
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) {//abarrier//b}is a shorthand for
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* a */ }loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* b */ }
Thanks Stephan for the proposal!Hi Alexandre, long time :)Others have noted the "must run in parallel", we also had this discussion internally.Allow me to dump here some pseudo IR that describes how I've been thinking about the problem.We can go the "omp pragma style" implicit parallelism.We can additionally think of this in terms of explicit SPMD style loops (cuda-style and/or MPI style).```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id) out_of (%num_proc) {...
loop.sync(%proc_id) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors}```I don't have a very strong preference for the next point but here is how I have been thinking.In terms of representation, I would think `loop` is a low-level concept so I would think that we want 1 loop op per iterator.Higher-levels of IR like structuredops/linalg have "multi-for" semantics.This allows a 1-iterator -> N-D virtual processor dimension mapping.```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id_dim1, %proc_id_dim2) out_of (%num_proc_dim_1, %num_proc_dim_2) {...
loop.sync(% proc_id_dim1, % proc_id_dim2) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors}```Then you should start seeing how we can map one loop to both (gridDim.x and blockDim.x).This is still compatible with imperfectly nested parallelism that I think we really want as Alex mentioned.
In the multi-for context, mapping k-iterators to N-processor dimensions I think it will be trickier to explain what maps where (in addition to how do we express imperfectly nested parallel loops).Dropping some parallelism seems simple:```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id_dim2) out_of (%num_proc_dim_2) {...
loop.sync(% proc_id_dim2) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors}```Here is after dropping %1
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/59b947c6-7f60-4ef3-a95c-411dcc82d23a%40tensorflow.org.
If I have to restate what you meant, the loop.yield has an implicit barrier at the end of it.I am not sure as stated it would be trivial to split a loop.parallel with a loop.barrier into two separate loop.parallel operations. To do that you need additional structural constraints on the ops. Off the top of my head1) The loop.parallel must have a region with a single basic block.2) loop.barrier must be only in the region within a loop.parallel.Both of these are traits on ops in MLIR. So that might be OK.
-- Sanjoy
--Mahesh
Thanks Stephan for the proposal!Several folks already mentioned w.r.t. CPU, data race, and atomic/locking semantics, etc. Assuming that we want to support all the following lowering targets:* single CPU thread.* multiple CPU threads.* GPU.
Revolving around this idea, other comments are inline.On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)
CheersStephan
Rationale
----
Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.
To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.
Proposed Syntax
----
Below is an example of a parallel loop (assuming it lives in the loop dialect).
loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
// some computation
}
Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.
Semantics
----
Like with the for operation, the lower bound is inclusive and the upper bound is exclusive. The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations. I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.If we also consider CPUs as lowering targets, we may want to look at their memory models. IIRC Power and ARM don't guarantee a global total order of stores (https://www.youtube.com/watch?v=VogqOscJYvk).I agree with Sanjoy's example (of exchanging store orders), in that for architecture inclusiveness we don't need to over specify the ordering.
------Stephan
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.Regards,Tim Shen
To me this feels somewhat like a ‘fast-loose-math’ type of tradeoff to me. It’s really relevant in floating point only, since integer ops are perfectly commutative and associate anyway, so reduction order is completely hidden from the user. In many ML use cases, runtime nondeterminism of addition order is absolutely acceptable, and enforcing it strictly in all cases definitely may lead to performance losses. I feel like it’s use case specific enough regardless that having an attribute which implies strictness may be a reasonable solution. If it’s on, optimization passes / lowering which would lead to non-determinism would be disabled, otherwise they would be allowed.
-- 
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
mlir+uns...@tensorflow.org.
To view this discussion on the web visit 
https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAP8FB4J4hTcgN%2Bb-DjFvi4pvasd2P7MT6c4_WQ7V-R_3my8R%3Dg%40mail.gmail.com.
----- Original message -----
From: "Bruestle, Jeremy" <jeremy....@intel.com>
To: Sanjoy Das <san...@google.com>
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/9D6ABEAE-6F8A-4C56-9A58-BFB6CE640A0A%40intel.com .
Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe. Hopefully I'm not too late to the game. Some comments:
1) The decision as to whether or not to include barriers is a significant choice:
Adding barriers prevents the simple lowering to serial loops. For example, imagine the following case which implements an 'in-place' transpose:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation. A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering. If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially. That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction. I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact. Notably, the stipe implementation decided to forgo barriers (see #3 below)
2) Reduction support:
- I think that most reductions are in practice both associative and commutative, and some methods of implementing reductions will benefit from allowing the reduction to be commutative. For example, multiple threads and atomic in-memory add require commutative reductions, since the order the additions across threads is arbitrary.
- I think that nothing should be specified regarding the specific order or method of performing the reduction. In cases where the reduction is truly exactly commutative and associative (say integer addition), there is no reason to limit the method of reduction. Where reduction is 'weakly' associative (say floating point addition), it's true that the method of reduction can effect results, but if your problem is truly numerically unstable enough that the order matters, you need to be using other techniques.
- It may be useful to specify or require an 'identity' value (i.e. 0 in the case of add, -inf in the case of max), to be specified for an reduction. There are two possible reasons: A) In the case of a parallel for loop with a trip-count of zero, the result of the reduction is otherwise unspecified, B) In some implementations (say threaded atomic add again) either the output must be pre-initialized to the identity value, or one trip of the parallel loop must be peeled off to estabilish an initial value.
Presuming we add identity, I would suggest describing the semantics as follows:
The output of the reduction must be equivalent the evaluation of some tree of calls to the reduction region, where there may be any number of identity leaf nodes, and each value yielded must appears in exactly one leaf. If there are zero values yielded, the result must be the identity. Perhaps the wording there is a bit turgid, but conceptually what I'm trying to convey is: order is arbitrary (as is associativity) and any number of 'identity' elements may be combined in. The second bit is to allow hardware that has say multiple accumulators (each of which is independently initialized to the identity) which are combined at the end (perhaps by say a tree rollup) to be valid.
Of course for normal reductions (say integer addition), all of the above is trivially true, but if we want to support arbitrary reductions via a region as opposed to fixed set, I think it's important to clarify just exactly how much flexibility we want to allow in terms of hardware specific lowering. In my opinion, we should provide maximal flexibility. Now of course, as defined above if you do use an inexactly associative reduction like floating point addition, you may be any order of additions, but I think once you allow any parallel reduction at all, you've pretty much decided that such a thing is acceptable, and to require deterministic behavior in such a case basically prevents most real world use cases and optimizations almost immediately. Note: the additional requirements of an identity element per above have no effect of floating point add, as 0.0 is an exact identity in floating point addition anyway.
3) Comparison to Stripe/Tile
As a side note, Stripe and Tile implement parallel loops and reductions, in a similar but distinct way.
- Our parallel loop has basically identical semantics, other than that we have chosen not to support barriers, since lowering to serial implementation is very important. Instead, we do any final barrier requiring fusions post lowering to GPU.
- We support an 'in-memory' reduction semantics, where there is an 'aggregate' instruction similar to 'store' but with apparently 'atomic' in memory reductions. These reductions currently come from a fixed set (each with it's own identity values), although we have considered the region based approach. However we decided against it due to the fact that we had no specific use cases, and it makes pattern matching much harder to do with DRR.
We semantically allow arbitrary reordering, changes in associativity, etc, and despite looking like 'in-memory' reductions, in practice most aggregations are lowered to some in-register operations, adder trees, etc, since basically none of the platforms of interest support atomic floating point addition (the common case) The advantage of the 'in-memory' reduction semantics is that it allows a single parallel loops to produce multiple independent reductions. It does require some complex aliasing analysis however, which we can get away with due to the affine nature of the accesses.
parallel proposal
When designing a parallel abstraction, we have some basic design choices. Here are two.
GPU-centric vs Target agnostic:
Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior.
High-level vs multi-level
Consider a reduction on a GPU. We may have a high level way to say "reduce this across warps and thread-blocks" and be done with it (aka a high-level representation). Alternatively, we may want a representation that also let us lower such reduction into a lower optimized form where we would have each warp performing a reduction on the data assigned to them, have a phase where data is combined to one value per thread block, and finally having one value for all thread blocks.
For CPUs, we may want to express some corse parallelism for tiles, and then using threads within a core to process in parallel work within a tile.
Such multi-level approach forces you to have nested parallelism with barriers.
Suggestion
There is nothing wrong with doing a GPU-centric version of parallel loop, it would probably be more appropriate in a GPU-specific dialect.
I would argue for a target-agnostic representation, structured in a way to map relatively easily into GPU-specific loops, CPUs, as well as more specialized hardware such as systolic arrays.
If we go that route, probably having two types of parallel loops, one with a barrier joining all the thread at the end, and one without, that would be a good start. Using Nicolas' suggested notation:
loop.parallel (%i) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { A }
would execute the work A for all iterations and then when each participating thread has completed, they would wait in a barrier (e.g. a kernel at the thread-block level, or a CUDA syncthreads for threads in all warps.
A non blocking version would be
loop.parallel_nowait (%j) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { B }
where threads don’t synchronize at the end of the parallel loop.
Loops with multiple indices:
loo.parallel(%i, %j) = (0, 0) to (10, 10) by (1,1) {A}
corresponds to
loo.parallel(%i) = (0) to (10) by (1) {
loop.parallel_nowait (%j) = (0) to (10) by (1) {
A
}
}
where the above is essentially a CUDA kernel if the outer loop is iterating over thread blocks and the inner loop is iterating over the warps.
Loops with no sequential code between parallel regions are “perfect” GPU code; sequences of inner parallel loops can be merged into one big inner loop with CUDA syncthreads when there is one thread per data point.
Hi Jeremy. Thanks for the comments!On Tue, Dec 3, 2019 at 7:10 PM Bruestle, Jeremy <jeremy....@intel.com> wrote:Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe. Hopefully I'm not too late to the game. Some comments:
1) The decision as to whether or not to include barriers is a significant choice:
Adding barriers prevents the simple lowering to serial loops. For example, imagine the following case which implements an 'in-place' transpose:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation. A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering. If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially. That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction. I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact. Notably, the stipe implementation decided to forgo barriers (see #3 below)I agree with the notion that barriers are heavyweight and model a behavior that is not available on all targets. They are also somewhat optional. As long as one designs the lowering pipeline such that no higher-level dialect creates barriers one can use the parallel loop without. We could make this a design criterion in that transformations on parallel loops should not rely on barriers unless they are strictly required to express the generated code. As an example, we could have parallel loop fusion that does not insert barriers instead of having a naive fusion that inserts barriers and an optimization to remove them (just an example).
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7R%2Bew6taO4XQjgPcOa7CxXemCMBEUSbUP9pT%2B2QQfESppA%40mail.gmail.com.
parallel proposal
When designing a parallel abstraction, we have some basic design choices. Here are two.
GPU-centric vs Target agnostic:
Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/OF8EC26B4A.522016F4-ON002584C6.00550F0D-002584C6.00551D20%40notes.na.collabserv.com.
loop.parallel (%i0, %j0) = (0) to (10) step (5) {
%x[%i1, %j1] = load %ref[%i0+%i1, %j0 + %j1]
loop.barrier { levels = 1 }
store %x[%i1, %j1], %ref[%i0+%i1, %j0+%j1i]
}
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CANF-O%3DaPEhtJZ0zL4t0oSBK759tE-rQgAk%2Bz2K5erWz6yQX%3DCw%40mail.gmail.com.
From:
Mehdi AMINI <joke...@gmail.com>
Date: Wednesday, December 4, 2019 at 8:23 AM
To: Stephan Herhut <her...@google.com>
Cc: "Bruestle, Jeremy" <jeremy....@intel.com>, MLIR <ml...@tensorflow.org>
Subject: Re: [mlir] [RFC][GPU] Parallel Loop Operation/Dialect
On Wed, Dec 4, 2019 at 3:12 PM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:
Hi Jeremy. Thanks for the comments!
On Tue, Dec 3, 2019 at 7:10 PM Bruestle, Jeremy <jeremy....@intel.com> wrote:
Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe. Hopefully I'm not too late to the game. Some comments:
1) The decision as to whether or not to include barriers is a significant choice:
Adding barriers prevents the simple lowering to serial loops. For example, imagine the following case which implements an 'in-place' transpose:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation. A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering. If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially. That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction. I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact. Notably, the stipe implementation decided to forgo barriers (see #3 below)
I agree with the notion that barriers are heavyweight and model a behavior that is not available on all targets. They are also somewhat optional. As long as one designs the lowering pipeline such that no higher-level dialect creates barriers one can use the parallel loop without. We could make this a design criterion in that transformations on parallel loops should not rely on barriers unless they are strictly required to express the generated code. As an example, we could have parallel loop fusion that does not insert barriers instead of having a naive fusion that inserts barriers and an optimization to remove them (just an example).
Is barrier pessimizing anything fundamentally? I understand that some extra memory can be required, but that seems like a fundamental problem: if you split into two loops instead of having as barrier, you also need the buffer to materialize the data reuse from the first loop to the second one, wouldn't you?
A fundamental question to me is: are there cases where you *can't* perform this transformations?
--
Mehdi
I believe such a transformation can always be done (or at least I can’t think of any counter-examples), however I would argue that barriers (especially level=2 or above) complicate all of the analysis as well as lowering extensively, and moreover, barriers of this sort are really only meaningful in GPU like parallelism (and even then, only within a given work group). Having thought about it more, I’m now of the opinion that making the general case less complex and more amenable to analysis (i.e. removing barriers) and moving barrier semantics into the GPU specific dialect is probably a better way to go. Additionally, any ‘loop splitting’ algorithm would need to make concrete buffers at some point, which including thing like deciding layout, etc. In the GPU case, you could use loop fusion with a barrier after lowering, which is even easier than normal loop fusion, and additionally only ‘removes’ buffers so pass ordering with respect to layout is more flexible. In most cases, the original representation is unfused anyway, so it also seems more natural in terms of overall pass/lowering flow to add loop fusion than to make a ‘loop splitting’.
-Jeremy
parallel proposal
When designing a parallel abstraction, we have some basic design choices. Here are two.
GPU-centric vs Target agnostic:
Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior.
High-level vs multi-level
Consider a reduction on a GPU. We may have a high level way to say "reduce this across warps and thread-blocks" and be done with it (aka a high-level representation). Alternatively, we may want a representation that also let us lower such reduction into a lower optimized form where we would have each warp performing a reduction on the data assigned to them, have a phase where data is combined to one value per thread block, and finally having one value for all thread blocks.
For CPUs, we may want to express some corse parallelism for tiles, and then using threads within a core to process in parallel work within a tile.
Such multi-level approach forces you to have nested parallelism with barriers.
Suggestion
There is nothing wrong with doing a GPU-centric version of parallel loop, it would probably be more appropriate in a GPU-specific dialect.
I would argue for a target-agnostic representation, structured in a way to map relatively easily into GPU-specific loops, CPUs, as well as more specialized hardware such as systolic arrays.
If we go that route, probably having two types of parallel loops, one with a barrier joining all the thread at the end, and one without, that would be a good start. Using Nicolas' suggested notation:
loop.parallel (%i) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { A }
would execute the work A for all iterations and then when each participating thread has completed, they would wait in a barrier (e.g. a kernel at the thread-block level, or a CUDA syncthreads for threads in all warps.
A non blocking version would be
loop.parallel_nowait (%j) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { B }
where threads don’t synchronize at the end of the parallel loop.
Loops with multiple indices:
loo.parallel(%i, %j) = (0, 0) to (10, 10) by (1,1) {A}
corresponds to
loo.parallel(%i) = (0) to (10) by (1) {
loop.parallel_nowait (%j) = (0) to (10) by (1) {
A
}
}
where the above is essentially a CUDA kernel if the outer loop is iterating over thread blocks and the inner loop is iterating over the warps.
Loops with no sequential code between parallel regions are “perfect” GPU code; sequences of inner parallel loops can be merged into one big inner loop with CUDA syncthreads when there is one thread per data point.
From:
Stephan Herhut <her...@google.com>
Date: Wednesday, December 4, 2019 at 6:12 AM
To: "Bruestle, Jeremy" <jeremy....@intel.com>
Cc: MLIR <ml...@tensorflow.org>
Subject: Re: [mlir] [RFC][GPU] Parallel Loop Operation/Dialect
Hi Jeremy. Thanks for the comments!
I prefer the first one, since the idea of a non-loop-invariant identity seems like madness to me, although it is a bit less syntactically local unfortunately.
3) Comparison to Stripe/Tile
As a side note, Stripe and Tile implement parallel loops and reductions, in a similar but distinct way.
- Our parallel loop has basically identical semantics, other than that we have chosen not to support barriers, since lowering to serial implementation is very important. Instead, we do any final barrier requiring fusions post lowering to GPU.
- We support an 'in-memory' reduction semantics, where there is an 'aggregate' instruction similar to 'store' but with apparently 'atomic' in memory reductions. These reductions currently come from a fixed set (each with it's own identity values), although we have considered the region based approach. However we decided against it due to the fact that we had no specific use cases, and it makes pattern matching much harder to do with DRR.
I assume easier pattern matching here refers to fixed ops vs. regions? That is true. Would it be good enough for pattern matching to expect a form where the body of the yield is essentially a function call? We could also add a version of yield that gets a function as attribute but then has an empty region.
Yes, currently matching inner regions in DRR has been causing us issues, and we recently rewrote one of our core constructions to be regionless to reduce complexity. However I imagine DRR will improve in regard to support for regions. Still, if you are lowering to hardware that has special support some fixed set of reductions, you will always end up pattern matching back into what is effectively an enum anyway. The function as an attribute is interesting to me, and actually, it brings up a thought: Isn’t a reduction effectively a combination of the actual binary reduction function *and* the identity? Maybe some method where reductions were declared elsewhere and then only referenced in the yield would help both pattern matching and handling of identity. Most reductions will be used over and over again throughout a program (and in fact, in most ML use cases, all reduction will be max or add).
We semantically allow arbitrary reordering, changes in associativity, etc, and despite looking like 'in-memory' reductions, in practice most aggregations are lowered to some in-register operations, adder trees, etc, since basically none of the platforms of interest support atomic floating point addition (the common case) The advantage of the 'in-memory' reduction semantics is that it allows a single parallel loops to produce multiple independent reductions. It does require some complex aliasing analysis however, which we can get away with due to the affine nature of the accesses.
This is interesting. The proposal supports multiple reductions via multiple yield operations in a single body. I am not sure that is the same as what your approach supports. Do you see any use cases from your experience that could not be modelled with multiple yields? So in essence, you can have
%prod, %sum = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) identity (%id0, %id1) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %prod = mul %lhs, %rhs
      return %prod
  }  
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}
Let me begin by saying that what follows in purely an aside, as I think the semantics for the proposal are good, and our use case is distinct and I wouldn’t suggest adopting it. That said…
So we allow something roughly like (in a close version of the new syntax):
loop.parallel (%i, %j) = (0, 0) to (10, 10) step(1, 1) {
  %v0 = load %mat[%i, %j]
  aggregate add %vec[%i], %v0
}
Which does reduces a matrix into a vector.
This can be implementing in the existing proposal as:
loop.parallel (%i) = (0) to (10) step(1) {
  %sum = loop.parallel (%j) = (0) to (10) step(1) identity(0) {
    %v0 = load %mat[%i, %j]
loop.yield (%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
    }
}
store %vec[%i], %sum
}
However in general this transformation is complex to implement. For example, in tile we can actually write (again in a different syntax):
loop.parallel (%i, %j) = (0, 0) to (10, 10) step(1, 1) {
  %v0 = load %mat[%i, %j]
  aggregate add %vec[%i + %j], %v0
}
This can be converted into two-loop form (and indeed we do eventually do that), however it requires an affine transformation of the index space along the lines of:
%x = %i + %j
%y = %j
So that
%i -> %x - %y
%j -> %y
Additionally, constraints must be added as well to limit loop bound on the reading side, and loop sizes change so we get something like:
loop.parallel (%x) = (0) to (19) step(1) {
  %bound = max %x, 10
  %sum = loop.parallel (%y) = (0) to (%bound) step(1) identity(0) {
    %v0 = load %mat[%x - %y, %y]
loop.yield (%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
    }
}
store %vec[%x], %sum
}
Now, I’m not in any way suggesting we should adopt something like our ‘aggregate’ for this proposal. In fact, I think it’s a poor choice, as lowering ‘aggregate’ to efficient code is complex and relies on a number of ‘affine/polyhedral’ tricks and probably should be kept in tile/stripe only, or other similar dialects. I just think it’s an interesting thing to compare/contrast.
-Jeremy
----- Original message -----
From: Stephan Herhut <her...@google.com>
To: Alexandre Eichenberger <al...@us.ibm.com>, Nicolas Vasilache <n...@google.com>
Cc: George Karpenkov <ches...@google.com>, "Bruestle, Jeremy" <jeremy....@intel.com>, MLIR <ml...@tensorflow.org>, Sanjoy Das <san...@google.com>
Subject: [EXTERNAL] Re: [mlir] [RFC][GPU] Parallel Loop Operation/Dialect
Date: Wed, Dec 4, 2019 12:37 PM
On Wed, Dec 4, 2019 at 4:29 PM Alexandre Eichenberger <al...@us.ibm.com> wrote:parallel proposal
When designing a parallel abstraction, we have some basic design choices. Here are two.
GPU-centric vs Target agnostic:
Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:
loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}
as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior.
I don't see the barrier as GPU specific. What I agree with is that more uses of the barrier have a direct mapping to GPU than to CPU. In its most general form, it is not supported by any platform without splitting loops. Even for GPUs, one might want to tile multiple levels to reduce the number of threads and hence will need to split the loop and materialize values.I also see the barrier as orthogonal to the parallel loop itself. One can perfectly well use parallel loop without using barriers.
AEE: Barriers are used both on CPUs and GPUs, absolutely. What is unusual with the above representation is that it reflect a "typical" GPU usage where there is one thread per iteration (with guaranteed progress on the threads). It's a level of abstraction that is more removed from most implementation on CPUs (and GPUs, as you mentioned where you may want to tile for performance). I would rather not have a representation that relies on the one thread per iteration & guaranteed forward progress for all threads.
High-level vs multi-level
Consider a reduction on a GPU. We may have a high level way to say "reduce this across warps and thread-blocks" and be done with it (aka a high-level representation). Alternatively, we may want a representation that also let us lower such reduction into a lower optimized form where we would have each warp performing a reduction on the data assigned to them, have a phase where data is combined to one value per thread block, and finally having one value for all thread blocks.
For CPUs, we may want to express some corse parallelism for tiles, and then using threads within a core to process in parallel work within a tile.
Such multi-level approach forces you to have nested parallelism with barriers.
I would argue that this is a specific lowering (or implementation detail) of parallel execution on CPUs. Can this not be modelled in the same way that it is on GPUs? The difference being that the inner loops after tiling are mapped to sequential loops. Something likeloop.parallel (%iv1, %iv2) [...] { A }could turn into (ignoring boundaries for now)loop.parallel (%iv1_s) = (%lb1) to (%ub1) step (%ub1 / proc_ids) {for %inner = 0 to %ub1 / proc_ids {%iv1 = %iv1_s + %inner;for %iv2 = %lb2 to %ub2 step %s2 {A}}}where now only the outermost dimension is parallel and has proc_ids many "threads". In a next lowering step, one could then offload the parallel op to workers. This could have been done gradually by a transformation that can tile loop.parallel in interesting ways and a second one that turns some loop.parallel into sequential loops.
AEE: I agree with your point. If you wanted the inner loop to go in parallel, e.g. within the threads on a core, or the threads in a warp, you a need nested parallelism representation. You also need to define if there is a barrier at the end of the parallel loop.
Suggestion
There is nothing wrong with doing a GPU-centric version of parallel loop, it would probably be more appropriate in a GPU-specific dialect.
I would argue for a target-agnostic representation, structured in a way to map relatively easily into GPU-specific loops, CPUs, as well as more specialized hardware such as systolic arrays.
If we go that route, probably having two types of parallel loops, one with a barrier joining all the thread at the end, and one without, that would be a good start. Using Nicolas' suggested notation:
loop.parallel (%i) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { A }
I do not understand the meaning of the "on (%proc_id) out_of (%num_proc)" part here.
AEE: I reused here the notation from Nicolas in a prior post on this exchange. %num_proc expresses the total number of threads in the parallel loop, and %proc_id provides an id for a given threads (range 0.. %num_proc-1, inclusive)
AEE: the difference is in the number of barriers executed. Assume N parallel threads in the outer loop, and M threads for each of the inner loops. In the second of 3 examples above, N + N*M barriers are executed. In the first, the two loops can be considered as a single unit that can be mapped to threads. For example, if the two loops were participating in a 2D mapping of tiles, this 2D tiles would be mapped all together to threads without forcing one row of tiles to completely finish before starting the next row. So if you have X threads participating to the two loops, you would have X barriersThe semantic of the 3rd example above, some_op() is executed once before guaranteeing that all of the work in the inner loop is completedThanksStephan
--Stephan