[RFC][GPU] Parallel Loop Operation/Dialect

519 views
Skip to first unread message

Stephan Herhut

unread,
Nov 22, 2019, 2:15:05 PM11/22/19
to MLIR
I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)

Cheers
  Stephan

Rationale
----

Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.

To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.

Proposed Syntax
----

Below is an example of a parallel loop (assuming it lives in the loop dialect).

loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
  // some computation
}

Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.

Semantics
----

Like with the for operation, the lower bound is inclusive and the upper bound is exclusive. The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations. I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.

I propose to use a multi-dimensional construct mostly for convenience. It ensures structurally that loop-bands we want to transform into a GPU call later keep their form. With usual loop-nests, loop invariant code motion or transformations on bound and step values can insert extra computations between bands which have to be cleaned up. If required, the n-dimensional loop can still be split into more traditional 1d-loop nests or mixed-dimensional loop nests. Doing so does not impact semantics.

Synchronization
----

To support fusing parallel loops even in the presence of dependencies, I propose a synchronization primitive with the following syntax:

loop.barrier {levels = 1}

where levels gives the number of surrounding parallel loop operations that are impacted by this constraint. Semantically, a barrier operation constrains the execution order of the threads defined by the surrounding loops. All impacted threads need to first reach the barrier before any of them may continue. As an example, consider

loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  %v2 = add %v2, %v2
  loop.barrier { levels = 1 }
  store %v2, %othermem[%iv1, %iv2, %iv3]
}

Here, all threads first read a value from mem before storing to othermem. So even if both memories alias, this operation is safe. Consequently, it is always safe to fuse two parallel loops by concatenating their bodies and inserting a barrier in between.

Lowering the above example to GPU or a sequential implementation requires that all values that are live across the barrier are preserved. On the GPU this can be expressed in the target IR like ptx (within limits) while a sequential lowering for CPU could for example allocate a region of memory to store intermediate values. An example would be

%tmp = alloc(%ub1, %ub2, %ub3) : memref<?x?x?xf32>
loop.for %iv1 = %lb1 to %ub1 step %s1 {
  loop.for %iv2 = %lb2 to %ub2 step %s2 {
    loop.for %iv3 = %lb3 to %ub3 step %s3 {
      %v0 = load %mem[%iv1, %iv2, %iv3]
      %v2 = add %v2, %v2
      store %v2, %tmp[%iv1, %iv2, %iv3]
    }
  }
}
loop.for %iv1 = %lb1 to %ub1 step %s1 {
  loop.for %iv2 = %lb2 to %ub2 step %s2 {
    loop.for %iv3 = %lb3 to %ub3 step %s3 {
      %v2 = load %tmp[%iv1, %iv2, %iv3]
      store %v2, %othermem[%iv1, %iv2, %iv3]
    }
  }
}

Ideally, one would want to elide the barrier operation. This impacts the order of computation and hence may only be done if it is safe. In the above example, the sync can be elided if mem and othermem do not alias.

Tiling
----

Parallel loops support tiling in the natural way by manipulating the step value and inserting nested parallel loops into the body. Below is an example (with inline constants for readability)

loop.parallel (%o1, %o2, %o3) = (0, 0, 0) to (32, 32, 32) step (4, 4, 4) {
  loop.parallel (%t1, %t2, %t3) = (0, 0, 0) to (4, 4, 4) step (1, 1, 1) {
    %iv1 = add %o1, %t1
    %iv2 = add %o2, %t2
    %iv3 = add %o3, %t3
    %v0 = load %mem[%iv1, %iv2, %iv3]
    %v2 = add %v2, %v2
    store %v2, %othermem[%iv1, %iv2, %iv3]
  }
}

Alternatively, if the shape allows (or padding is applied) tiling can also be performed by reshaping the iteration space of the loop. In the above example, one could for example use a shape of [8,4,8,4,8,4].

Reductions
----

A common pattern in the code we want to generate are reductions. It is often beneficial to fuse them into the parallel computation that produces the value to be reduced. To facilitate this on the loop level, I propose to add reduction support to parallel loops. This could be done in the form of a special yield operation. An example would look like the following

%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

The yield operations receives as its sole argument the value of the reduction produced by the current thread. It also needs to specify how values are actually reduced. This is done by providing a region that expects two arguments of the same type as the value yielded. The region computes and returns the produced value.

The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.

Loops also support multiple results, in which case the body needs to contain multiple yield operations. For example, computing the minimum and maximum at the same time would look like the following

%res0, %res1 = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %min = min %lhs, %rhs
      return %min
  }  
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %max = max %lhs, %rhs
      return %max
  }  
}

In the above example, res0 will contain the minimum and res1 the maximum after the loop has finished.

An alternative would be to have a single variadic yield operation that computes multiple results at once. I am leaning against that approach as it would require to keep inputs to reductions alive across barriers. With separate yields, one can return a value early before joining the barrier.

For some scenarios it might be helpful to also specify a neutral value for the reduction. This could be done as an extra set of operands to the loop, for example

%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) initial (%init0) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

An alternative to the body region that specifies the computation would be to specify an operation kind (like add). This would be helpful when lowering to platforms that have a set of built-in reduction operations (SPIR-V for example has this concept). While I consider this use-case important, I also think that using a region is preferable for its added expressiveness. Also, even if the parallel loop would support built-in operations, we would need to raise this information out of higher-level dialects that might not have this notion. For example, HLO would not provide this information as it also allows for arbitrary functions in reductions. If we have to raise anyway, we can also do this when lowering out of parallel loops.

--
Stephan

Alexandre Eichenberger

unread,
Nov 22, 2019, 2:53:52 PM11/22/19
to her...@google.com, ml...@tensorflow.org
Hi Stephan,
 
having being involved in the OpenMP committee for a while now, I like the proposal and think it will be useful to define parallel constructs within MLIR. I believe the current proposal below may need further clarification, esp with respect to synchronization.
 
Starting with a 1D parallel region:
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { ... }
 
the traditional model is a fork/join model, where all the treads work in parallel until all iterations are done and then they wait for others in a barrier, after which only the main tread continue execution.
 
When extending this model more dimensions, the proposal has to be specific if, for example
 
loop.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%s1, %s2) { ... }
 
is simply a shorthand for 
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
}
 
or not. If it is, this means that the code execute N+1 barriers (where N is the trip count of the inner loop), as each of the threads that participate to one inner parallel loop are expected to synchronize at the end of it. While this is often desired (and the default model in OpenMP), it is not always best. There are cases, esp when 2+D tiling where you want all of the tiles to go in parallel. OpenMP has a fairly unsatisfying approach to this (collapsing multiple loops in a parallel). 
 
For the barriers, we also have to be a bit more specific on which threads are expected to synchronize with which other threads. I also know it is customary in GPU to use a synchronization barrier inside loops, as there is the expectation that each of the loop iterations is mapped to a thread in a very large pool of warps. This is typically not the case on CPUs, where a single CPU thread is sequentially iterating over the parallel iterations that were mapped to that CPU. This is why barriers are not allowed in parallel-for loops (like the construct you propose), and if you need such a barrier, the parallel loop with a barrier would need to be split into two parallel loops.
 
My suggestion would be to state that a "loop.parallel (%iv1...%ivn)" correspond with a loop where all the iterations 1..n are fully parallel with only a single barrier at the end. This is good for 2D tiling. If there are inner loops that need to be synchronized, then using a nested construct can be employed. I would not allow a barrier in the middle, unless it is a shorthand for splitting the loop into two:
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  //a
  barrier
  //b
}
 
is a shorthand for
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* a */ }
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* b */ }
 
where either no local variables can span the barrier, or they have to be expanded into arrays. If you come from GPU land, this sort of come for free when using thread-private locals... but this is not the general CPU mode.
 

Alexandre

-----------------------------------------------------------------------------------------------------
Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
- research: compiler optimization (OpenMP, GPU, SIMD)
- info: al...@us.ibm.com http://www.research.ibm.com/people/a/alexe
- phone: 914-945-1812 (work), 914-312-3618 (cell)
 
 
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.
 

Mehdi AMINI

unread,
Nov 22, 2019, 3:03:09 PM11/22/19
to Alexandre Eichenberger, her...@google.com, ml...@tensorflow.org
On Fri, Nov 22, 2019 at 11:53 AM Alexandre Eichenberger <al...@us.ibm.com> wrote:
Hi Stephan,
 
having being involved in the OpenMP committee for a while now, I like the proposal and think it will be useful to define parallel constructs within MLIR. I believe the current proposal below may need further clarification, esp with respect to synchronization.
 
Starting with a 1D parallel region:
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { ... }
 
the traditional model is a fork/join model, where all the treads work in parallel until all iterations are done and then they wait for others in a barrier, after which only the main tread continue execution.
 
When extending this model more dimensions, the proposal has to be specific if, for example
 
loop.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%s1, %s2) { ... }
 
is simply a shorthand for 
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
}
 
or not. If it is, this means that the code execute N+1 barriers (where N is the trip count of the inner loop), as each of the threads that participate to one inner parallel loop are expected to synchronize at the end of it.

Is it semantically important when the loops are perfectly nested?
How do you observe the difference if there are these extra barriers or not?

While this is often desired (and the default model in OpenMP), it is not always best. There are cases, esp when 2+D tiling where you want all of the tiles to go in parallel. OpenMP has a fairly unsatisfying approach to this (collapsing multiple loops in a parallel). 
 
For the barriers, we also have to be a bit more specific on which threads are expected to synchronize with which other threads. I also know it is customary in GPU to use a synchronization barrier inside loops, as there is the expectation that each of the loop iterations is mapped to a thread in a very large pool of warps. This is typically not the case on CPUs, where a single CPU thread is sequentially iterating over the parallel iterations that were mapped to that CPU. This is why barriers are not allowed in parallel-for loops (like the construct you propose), and if you need such a barrier, the parallel loop with a barrier would need to be split into two parallel loops.

The need for splitting is also needed on a GPU as well in general if the whole loop isn’t mapped to a single “work-group” I believe. But when the bounds and the targets properties are know during the lowering of the loop, you can use the most efficient construct.

The important part to me would be to know it is always trivial to split a loop or if there are correctness aspects that can make it tricky.
(The problem of array expansion you mention below for example seems like an important point to address if we go this route).


Sanjoy Das

unread,
Nov 22, 2019, 4:45:31 PM11/22/19
to Stephan Herhut, Tim Shen, MLIR, jeremy....@intel.com
Hi Stephan,

+Jeremy +Tim Shen

On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR
<ml...@tensorflow.org> wrote:
> I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)
>
> Cheers
> Stephan
>
> Rationale
> ----
>
> Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.
>
> To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.
>
> Proposed Syntax
> ----
>
> Below is an example of a parallel loop (assuming it lives in the loop dialect).
>
> loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
> // some computation
> }
>
> Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.
>
> Semantics
> ----
>
> Like with the for operation, the lower bound is inclusive and the upper bound is exclusive.

Can the IVs wrap?

> The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations.

This seems a bit restrictive IMO. If I had the following loop:

loop.parallel (%iv1) = (0) to (2) step (1) {
if (%iv == 0) {
*%ptr_a = 5
*%ptr_b = 10
} else {
%r0 = *%ptr_b
%r1 = *%ptr_a
}
}

I would like to be able to reorder the two stores if ptr_a and ptr_b
don't alias. But I think this current semantic does not allow this
because no sequential interleaving allows (assuming all locations are
zero initially) r0 = 10 && r1 = 0.

What do you think about the stronger invariant: there are no data
races between parallel iterations (data race us undefined behavior).
The only "communication" allowed is via structured reduction
primitives.

> I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.
>
> I propose to use a multi-dimensional construct mostly for convenience. It ensures structurally that loop-bands we want to transform into a GPU call later keep their form. With usual loop-nests, loop invariant code motion or transformations on bound and step values can insert extra computations between bands which have to be cleaned up. If required, the n-dimensional loop can still be split into more traditional 1d-loop nests or mixed-dimensional loop nests. Doing so does not impact semantics.
>
> Synchronization
> ----
>
> To support fusing parallel loops even in the presence of dependencies, I propose a synchronization primitive with the following syntax:
>
> loop.barrier {levels = 1}

Is there a natural extension of loop.barrier to sequential loops?

Otherwise we'll have to say that some parallel loops cannot be
trivially (assuming we only care about correctness) lowered into
sequential loops. I think this is fine but is still worth mentioning.
Just to be more complete, a full example of a sum-reduction will look like:

%res = loop.parallel (%iv1) = (0) to (1000) step (1) {
%v0 = load %mem[%iv1]
%sum = "loop.yield"(%v0) {
bb0 (%lhs, %rhs):
%sum = add %lhs, %rhs
return %sum
}
if %iv1 == 0:
store %result, %sum
}

> The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.

I'd rather say: the reduction operation will be applied in arbitrary
order, not that it is associative. For instance, I can imagine the
frontend reduces with `lambda x, y: return x` because they don't
really care about which value "wins".

> Loops also support multiple results, in which case the body needs to contain multiple yield operations. For example, computing the minimum and maximum at the same time would look like the following
>
> %res0, %res1 = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
> %v0 = load %mem[%iv1, %iv2, %iv3]
> "loop.yield"(%v0) {
> bb0 (%lhs, %rhs):
> %min = min %lhs, %rhs
> return %min
> }
> "loop.yield"(%v0) {
> bb0 (%lhs, %rhs):
> %max = max %lhs, %rhs
> return %max
> }
> }
>
> In the above example, res0 will contain the minimum and res1 the maximum after the loop has finished.
>
> An alternative would be to have a single variadic yield operation that computes multiple results at once. I am leaning against that approach as it would require to keep inputs to reductions alive across barriers. With separate yields, one can return a value early before joining the barrier.
>
> For some scenarios it might be helpful to also specify a neutral value for the reduction. This could be done as an extra set of operands to the loop, for example
>
> %res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) initial (%init0) {
> %v0 = load %mem[%iv1, %iv2, %iv3]
> "loop.yield"(%v0) {

How does the loop.yield "know" its initial value? Should we instead have:

"loop.yield"(%v0, %init0) {

or something like that?


> bb0 (%lhs, %rhs):
> %sum = add %lhs, %rhs
> return %sum
> }
> }
>
> An alternative to the body region that specifies the computation would be to specify an operation kind (like add). This would be helpful when lowering to platforms that have a set of built-in reduction operations (SPIR-V for example has this concept). While I consider this use-case important, I also think that using a region is preferable for its added expressiveness. Also, even if the parallel loop would support built-in operations, we would need to raise this information out of higher-level dialects that might not have this notion. For example, HLO would not provide this information as it also allows for arbitrary functions in reductions. If we have to raise anyway, we can also do this when lowering out of parallel loops.
>
> --
> Stephan
>

Stephan Herhut

unread,
Nov 22, 2019, 7:10:34 PM11/22/19
to Alexandre Eichenberger, MLIR
Hi Alexandre.

Thanks for the feedback. Comments inline.

On Fri, Nov 22, 2019 at 11:53 AM Alexandre Eichenberger <al...@us.ibm.com> wrote:
Hi Stephan,
 
having being involved in the OpenMP committee for a while now, I like the proposal and think it will be useful to define parallel constructs within MLIR. I believe the current proposal below may need further clarification, esp with respect to synchronization.
 
Starting with a 1D parallel region:
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { ... }
 
the traditional model is a fork/join model, where all the treads work in parallel until all iterations are done and then they wait for others in a barrier, after which only the main tread continue execution.
 
When extending this model more dimensions, the proposal has to be specific if, for example
 
loop.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%s1, %s2) { ... }
 
is simply a shorthand for 
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
}
 
or not. If it is, this means that the code execute N+1 barriers (where N is the trip count of the inner loop), as each of the threads that participate to one inner parallel loop are expected to synchronize at the end of it. While this is often desired (and the default model in OpenMP), it is not always best. There are cases, esp when 2+D tiling where you want all of the tiles to go in parallel. OpenMP has a fairly unsatisfying approach to this (collapsing multiple loops in a parallel). 

The semantics of loop.parallel does not require the sync at the end of the inner loop in your example. For perfect loop nests, it is not observable in what order the inner loops terminate as there are no instructions that follow after the inner loop completes. If it was not a perfect loop nest, then you are correct that an implementation would need to make sure that all threads of the inner loop terminate before the remainder of the loop body executes. This could also be expressed by splitting the loop. So for example, if you had

loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
  someothercode
}

would be equivalent to

loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  loop.parallel (%iv2) = (%lb2) to (%ub2) step (%s2) { ... }
}
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  someothercode
}

with the requirement to preserve data that is between the split parts of the loop. Or, if the outer loop already provides enough parallelism, one could decide to parallelize the outer loop and run the inner loop sequential or the other way round. The loop.parallel operation only describes the constraints.
 
For the barriers, we also have to be a bit more specific on which threads are expected to synchronize with which other threads. I also know it is customary in GPU to use a synchronization barrier inside loops, as there is the expectation that each of the loop iterations is mapped to a thread in a very large pool of warps. This is typically not the case on CPUs, where a single CPU thread is sequentially iterating over the parallel iterations that were mapped to that CPU. This is why barriers are not allowed in parallel-for loops (like the construct you propose), and if you need such a barrier, the parallel loop with a barrier would need to be split into two parallel loops.

That would be a valid lowering strategy when mapping a loop.parallel to something useful for execution on CPU.
 
 My suggestion would be to state that a "loop.parallel (%iv1...%ivn)" correspond with a loop where all the iterations 1..n are fully parallel with only a single barrier at the end. This is good for 2D tiling. If there are inner loops that need to be synchronized, then using a nested construct can be employed. I would not allow a barrier in the middle, unless it is a shorthand for splitting the loop into two:
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  //a
  barrier
  //b
}
 
is a shorthand for
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* a */ }
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* b */ }
 
where either no local variables can span the barrier, or they have to be expanded into arrays. If you come from GPU land, this sort of come for free when using thread-private locals... but this is not the general CPU mode. 

It is a shorthand if there are no values live across the barrier. If there are values live across, a CPU implementation would need to materialize them. My goal with this operation is to have a flexible way to express these kinds of constraints. When using this in a pipeline for CPU, one might not generate barriers inside of loops because that pattern is not beneficial for the expected lowering. Not having a barrier on the other hand would preclude us from expressing this, even if the target supports it.

Cheers
  Stephan


--
Stephan

Stephan Herhut

unread,
Nov 22, 2019, 7:31:32 PM11/22/19
to Sanjoy Das, Tim Shen, MLIR, Bruestle, Jeremy
On Fri, Nov 22, 2019 at 1:45 PM Sanjoy Das <san...@google.com> wrote:
Hi Stephan,

+Jeremy +Tim Shen

On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR
<ml...@tensorflow.org> wrote:
> I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)
>
> Cheers
>   Stephan
>
> Rationale
> ----
>
> Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.
>
> To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.
>
> Proposed Syntax
> ----
>
> Below is an example of a parallel loop (assuming it lives in the loop dialect).
>
> loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
>   // some computation
> }
>
> Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.
>
> Semantics
> ----
>
> Like with the for operation, the lower bound is inclusive and the upper bound is exclusive.

Can the IVs wrap?

What do you mean by wrap?
 
> The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations.

This seems a bit restrictive IMO.  If I had the following loop:

loop.parallel (%iv1) = (0) to (2) step (1) {
  if (%iv == 0) {
    *%ptr_a = 5
    *%ptr_b = 10
  } else {
    %r0 = *%ptr_b
    %r1 = *%ptr_a
  }
}

I would like to be able to reorder the two stores if ptr_a and ptr_b
don't alias.  But I think this current semantic does not allow this
because no sequential interleaving allows (assuming all locations are
zero initially) r0 = 10 && r1 = 0.

The semantic of the loop does not allow this because generally it would be observable. If you can prove that for all execution orders that are valid under the semantics of the operation your change does not impact the expected overall outcome, you can still decide to do this rewrite. That is no different than reordering loads/stores in sequential code. That is also not generally allowed by the semantics of the program but you can do it if you prove it to be ok.

The semantics of the parallel loop are specified such that I can do certain reorderings of execution without proving them to be correct. So if you had

loop.parallel (%iv1) - (0) to (2) step (1) {
  *ptr = %iv1
}

then the value of ptr after the loop can be 0 or 1.   

What do you think about the stronger invariant: there are no data
races between parallel iterations (data race us undefined behavior).
The only "communication" allowed is via structured reduction
primitives.

We could specify that if there are data races, then the outcome is undefined. That is weaker (and probably works better with actual memory models). I did 
 
> I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.
>
> I propose to use a multi-dimensional construct mostly for convenience. It ensures structurally that loop-bands we want to transform into a GPU call later keep their form. With usual loop-nests, loop invariant code motion or transformations on bound and step values can insert extra computations between bands which have to be cleaned up. If required, the n-dimensional loop can still be split into more traditional 1d-loop nests or mixed-dimensional loop nests. Doing so does not impact semantics.
>
> Synchronization
> ----
>
> To support fusing parallel loops even in the presence of dependencies, I propose a synchronization primitive with the following syntax:
>
> loop.barrier {levels = 1}

Is there a natural extension of loop.barrier to sequential loops?

Otherwise we'll have to say that some parallel loops cannot be
trivially (assuming we only care about correctness) lowered into
sequential loops.  I think this is fine but is still worth mentioning.

It has no meaning is sequential loops but there is a well defined lowering for a parallel loop with barrier into sequential loop nests (see example below).
No, there is no store at the end. The result of the reduction will be in %res when the loop completes.  
 
> The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.

I'd rather say: the reduction operation will be applied in arbitrary
order, not that it is associative.  For instance, I can imagine the
frontend reduces with `lambda x, y: return x` because they don't
really care about which value "wins".

Agreed. I just wanted to emphasise that deterministic results require an associative operation. In reality this is often not the case, not even addition on floating is associative.
 
> Loops also support multiple results, in which case the body needs to contain multiple yield operations. For example, computing the minimum and maximum at the same time would look like the following
>
> %res0, %res1 = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
>   %v0 = load %mem[%iv1, %iv2, %iv3]
>   "loop.yield"(%v0) {
>     bb0 (%lhs, %rhs):
>       %min = min %lhs, %rhs
>       return %min
>   }
>   "loop.yield"(%v0) {
>     bb0 (%lhs, %rhs):
>       %max = max %lhs, %rhs
>       return %max
>   }
> }
>
> In the above example, res0 will contain the minimum and res1 the maximum after the loop has finished.
>
> An alternative would be to have a single variadic yield operation that computes multiple results at once. I am leaning against that approach as it would require to keep inputs to reductions alive across barriers. With separate yields, one can return a value early before joining the barrier.
>
> For some scenarios it might be helpful to also specify a neutral value for the reduction. This could be done as an extra set of operands to the loop, for example
>
> %res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) initial (%init0) {
>   %v0 = load %mem[%iv1, %iv2, %iv3]
>   "loop.yield"(%v0) {

How does the loop.yield "know" its initial value?  Should we instead have:

   "loop.yield"(%v0, %init0) {

or something like that?

It knows in the same way that it knows which result value of the for loop it is associated with: By order of appearance in the loop body. I would also be fine with passing it explicitly.

>     bb0 (%lhs, %rhs):
>       %sum = add %lhs, %rhs
>       return %sum
>   }
> }
>
> An alternative to the body region that specifies the computation would be to specify an operation kind (like add). This would be helpful when lowering to platforms that have a set of built-in reduction operations (SPIR-V for example has this concept). While I consider this use-case important, I also think that using a region is preferable for its added expressiveness. Also, even if the parallel loop would support built-in operations, we would need to raise this information out of higher-level dialects that might not have this notion. For example, HLO would not provide this information as it also allows for arbitrary functions in reductions. If we have to raise anyway, we can also do this when lowering out of parallel loops.
>
> --
> Stephan
>
> --
> You received this message because you are subscribed to the Google Groups "MLIR" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.


--
Stephan

Dibyendu Das

unread,
Nov 22, 2019, 9:57:21 PM11/22/19
to Stephan Herhut, MLIR
Hi Stephan-

Would we need some modelling of atomic/locking operations and the concept of shared/private variables ?

-Thx
Dibyendu

--

Sanjoy Das

unread,
Nov 23, 2019, 12:36:14 AM11/23/19
to Stephan Herhut, Tim Shen, MLIR, Bruestle, Jeremy
On Fri, Nov 22, 2019 at 4:31 PM Stephan Herhut <her...@google.com> wrote:
>> > Like with the for operation, the lower bound is inclusive and the upper bound is exclusive.
>>
>> Can the IVs wrap?
>
> What do you mean by wrap?

Overflow (signed or unsigned). E.g.

loop.parallel (%iv1) = (INT_MAX-1) to (INT_MAX+1 /* == INT_MIN */) step (1) {
...
}

>> > The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations.
>>
>> This seems a bit restrictive IMO. If I had the following loop:
>>
>> loop.parallel (%iv1) = (0) to (2) step (1) {
>> if (%iv == 0) {
>> *%ptr_a = 5
>> *%ptr_b = 10
>> } else {
>> %r0 = *%ptr_b
>> %r1 = *%ptr_a
>> }
>> }
>>
>> I would like to be able to reorder the two stores if ptr_a and ptr_b
>> don't alias. But I think this current semantic does not allow this
>> because no sequential interleaving allows (assuming all locations are
>> zero initially) r0 = 10 && r1 = 0.
>
> The semantic of the loop does not allow this because generally it would be observable. If you can prove that for all execution orders that are valid under the semantics of the operation your change does not impact the expected overall outcome, you can still decide to do this rewrite. That is no different than reordering loads/stores in sequential code.

It is somewhat different, depending on the programming language. For
instance, in C++ non-atomic variables cannot race so reordering the
two stores in the following program is fine:

void f() {
int* a = malloc(sizeof(int));
int* b = malloc(sizeof(int));

// escape a and b to another thread.

*a = 10;
*b = 20;
}

> That is also not generally allowed by the semantics of the program but you can do it if you prove it to be ok.
>
> The semantics of the parallel loop are specified such that I can do certain reorderings of execution without proving them to be correct. So if you had
>
> loop.parallel (%iv1) - (0) to (2) step (1) {
> *ptr = %iv1
> }
>
> then the value of ptr after the loop can be 0 or 1.
>
>> What do you think about the stronger invariant: there are no data
>> races between parallel iterations (data race us undefined behavior).
>> The only "communication" allowed is via structured reduction
>> primitives.
>
>
> We could specify that if there are data races, then the outcome is undefined. That is weaker (and probably works better with actual memory models). I did

It also allows more optimization. E.g. if the loop body was:

a[%iv] = 300;
p[%iv] = 20;
a[%iv+1] = 400;

I would like to optimize the two stores to `a` to a vector store
(assuming `p` and `a` don't alias). But vectorizing the store to `a`
is equivalent reordering the store to `p` to either before or after
both the stores.

>> Otherwise we'll have to say that some parallel loops cannot be
>> trivially (assuming we only care about correctness) lowered into
>> sequential loops. I think this is fine but is still worth mentioning.
>
> It has no meaning is sequential loops but there is a well defined lowering for a parallel loop with barrier into sequential loop nests (see example below).

Ok, but this means that a loop.parallel is not a normal loop with an
extra constraint on the legality of inter-iteration dependence. There
are cases where a loop.parallel _has_ to be run in parallel to have a
correct semantic. I'm not objecting to this, but IMO it should be
explicitly called out.

>> Just to be more complete, a full example of a sum-reduction will look like:
>>
>> %res = loop.parallel (%iv1) = (0) to (1000) step (1) {
>> %v0 = load %mem[%iv1]
>> %sum = "loop.yield"(%v0) {
>> bb0 (%lhs, %rhs):
>> %sum = add %lhs, %rhs
>> return %sum
>> }
>> if %iv1 == 0:
>> store %result, %sum
>> }
>
> No, there is no store at the end. The result of the reduction will be in %res when the loop completes.

Oops, I didn't notice the `%res = ` in your example.

One advantage of modeling "loop.yield"(%v0) as returning a value is
that it lets us represent operations like batchnorm (where we divide
every element with the mean of the entire batch) using a single
parallel loop. However, this means loop.yield has barrier semantics,
which may or may not be a good thing.

>> I'd rather say: the reduction operation will be applied in arbitrary
>> order, not that it is associative. For instance, I can imagine the
>> frontend reduces with `lambda x, y: return x` because they don't
>> really care about which value "wins".
>
> Agreed. I just wanted to emphasise that deterministic results require an associative operation. In reality this is often not the case, not even addition on floating is associative.

Ack.

>> How does the loop.yield "know" its initial value? Should we instead have:
>>
>> "loop.yield"(%v0, %init0) {
>>
>> or something like that?
>
>
> It knows in the same way that it knows which result value of the for loop it is associated with: By order of appearance in the loop body. I would also be fine with passing it explicitly.

I think having an explicit use-def chain (both for loop.yield inputs
and outputs) is useful here. Otherwise I suspect it will be easy to
reorder a loop.yield and "forget" to update the corresponding
input/output.

-- Sanjoy

Mehdi AMINI

unread,
Nov 23, 2019, 3:47:51 AM11/23/19
to Sanjoy Das, Stephan Herhut, Tim Shen, MLIR, Bruestle, Jeremy
I'd be wary of the "_has_ to be run in parallel" for correctness, but from what I understand there is always a valid lowering to a sequential loop? (even though not just by trivially "removing the parallel annotation")
I'm not sure about this: seems like %init0 is referring to the SSA value outside of the loop here: it does not translate to the accumulator during the reduction.

My mental model is that the parallel.for loop maintains an accumulator internally, it isn't visible to the body of the loop, and is only available from within the loop.yield region.
Another way to see it could be that the loop.yield is a "critical section" that protect access to the accumulator.

From this point of view, it makes sense to me for the loop.yield to be responsible to materialize it as a value.

 
>
>
> It knows in the same way that it knows which result value of the for loop it is associated with: By order of appearance in the loop body. I would also be fine with passing it explicitly.

I think having an explicit use-def chain (both for loop.yield inputs
and outputs) is useful here.  Otherwise I suspect it will be easy to
reorder a loop.yield and "forget" to update the corresponding
input/output.

I was reading loop.yield as being a terminator, isn't it? (this is also why it should not return a value)

-- 
Mehdi


-- Sanjoy


--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Sanjoy Das

unread,
Nov 23, 2019, 11:16:25 AM11/23/19
to Mehdi AMINI, Stephan Herhut, Tim Shen, MLIR, Bruestle, Jeremy
On Sat, Nov 23, 2019 at 12:47 AM Mehdi AMINI <joke...@gmail.com> wrote:
>> Ok, but this means that a loop.parallel is not a normal loop with an
>> extra constraint on the legality of inter-iteration dependence. There
>> are cases where a loop.parallel _has_ to be run in parallel to have a
>> correct semantic. I'm not objecting to this, but IMO it should be
>> explicitly called out.
>
>
> I'd be wary of the "_has_ to be run in parallel" for correctness, but from what I understand there is always a valid lowering to a sequential loop? (even though not just by trivially "removing the parallel annotation")

Yes, which is why IMO the current framing is ok. You could also say
that the current framing allows for a single "blocking" operation,
loop.barrier(), with a well defined semantics.

>> >> How does the loop.yield "know" its initial value? Should we instead have:
>> >>
>> >> "loop.yield"(%v0, %init0) {
>> >>
>> >> or something like that?
>
> I'm not sure about this: seems like %init0 is referring to the SSA value outside of the loop here: it does not translate to the accumulator during the reduction.
>
> My mental model is that the parallel.for loop maintains an accumulator internally, it isn't visible to the body of the loop, and is only available from within the loop.yield region.
> Another way to see it could be that the loop.yield is a "critical section" that protect access to the accumulator.
>
> From this point of view, it makes sense to me for the loop.yield to be responsible to materialize it as a value.

I'm not sure I fully understand the problem you're trying to
illustrate, but semantically, I see loop.yield as doing:

"loop.yield"(%val, %init) {
with_lock {
for an arbitrary number of times {
*accum = red_fn(*accum, %init)
}

*accum = red_fn(*accum, %val)

for an arbitrary number of times {
*accum = red_fn(*accum, %init)
}
}
}

As opposed to just being the initial value of *accum. This lets us
justify, for instance, splitting a single parallel loop reducing 100
elements into two parallel loops reducing 50 elements each followed by
a final add. Splitting up a reduction loop this way requires
"inserting" the initial value into the larger reduction expression
(i.e. from ((((0+a)+b)+c)+d) to "((0+a)+b) + ((0+c)+d)").

>> I think having an explicit use-def chain (both for loop.yield inputs
>> and outputs) is useful here. Otherwise I suspect it will be easy to
>> reorder a loop.yield and "forget" to update the corresponding
>> input/output.
>
> I was reading loop.yield as being a terminator, isn't it? (this is also why it should not return a value)

We definitely need to allow multiple reductions in the same loop but
(I'm assuming) terminators would disallow this. This "multi output
reduction fusion" optimization is important in practice; we use it to
compute sum(x) and sum(x*x) (mean and variance) during
batch-norm-training, for instance.

-- Sanjoy

Nicolas Vasilache

unread,
Nov 23, 2019, 11:57:33 AM11/23/19
to MLIR
Thanks Stephan for the proposal!

Hi Alexandre, long time :)

Others have noted the "must run in parallel", we also had this discussion internally.

Allow me to dump here some pseudo IR that describes how I've been thinking about the problem.
We can go the "omp pragma style" implicit parallelism.
We can additionally think of this in terms of explicit SPMD style loops (cuda-style and/or MPI style).

```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id) out_of (%num_proc) { 
   ... 
   loop.sync(%proc_id) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors 
}
```

I don't have a very strong preference for the next point but here is how I have been thinking.
In terms of representation, I would think `loop` is a low-level concept so I would think that we want 1 loop op per iterator.
Higher-levels of IR like structuredops/linalg have "multi-for" semantics.

This allows a 1-iterator -> N-D virtual processor dimension mapping.

```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id_dim1, %proc_id_dim2) out_of (%num_proc_dim_1, %num_proc_dim_2) { 
   ... 
   loop.sync(%
 proc_id_dim1, % proc_id_dim2) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors 
}
```

Then you should start seeing how we can map one loop to both (gridDim.x and blockDim.x).
This is still compatible with imperfectly nested parallelism that I think we really want as Alex mentioned.

In the multi-for context, mapping k-iterators to N-processor dimensions I think it will be trickier to explain what maps where (in addition to how do we express imperfectly nested parallel loops).

Dropping some parallelism seems simple:
```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id_dim2) out_of (%num_proc_dim_2) { 
   ... 
   loop.sync(
% proc_id_dim2) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors 
}
```

Here is after dropping %1

Mahesh Ravishankar

unread,
Nov 24, 2019, 3:21:12 AM11/24/19
to Stephan Herhut, MLIR
Thanks Stephan for the RFC. I like the idea of a multi-dimensional parallel for loop. So +1 for that idea. I have some questions for the synchronization and reduction aspects below

On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:
IMO it is better to not club parallel for and reductions together. THey are really two separate computations. One is more like a map (using terminology from functional programming), and the other is like a reduce (or more concretely a map-reduce). For example, seems like in the reduction case you cannot have any more operations following the "loop.yield" because that would require some additional synchronization to guarantee that the reduction has completed. If this is the case, as an alternative, would it make sense to have a loop.parallel and a loop.parallel_reduce. To maintain the experessiblity of what you intend here. A loop.parallel_reduce can have two regions. The first region can have the same semantics as the region of the loop.parallel operation. These second region can be the body of the reduction function.
 
A common pattern in the code we want to generate are reductions. It is often beneficial to fuse them into the parallel computation that produces the value to be reduced. To facilitate this on the loop level, I propose to add reduction support to parallel loops. This could be done in the form of a special yield operation. An example would look like the following

%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

The yield operations receives as its sole argument the value of the reduction produced by the current thread. It also needs to specify how values are actually reduced. This is done by providing a region that expects two arguments of the same type as the value yielded. The region computes and returns the produced value.

The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.

I saw some discussion on this, but I think a better semantics is to state that the reduction operation is associative and commutative. 
 

Loops also support multiple results, in which case the body needs to contain multiple yield operations. For example, computing the minimum and maximum at the same time would look like the following

%res0, %res1 = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %min = min %lhs, %rhs
      return %min
  }  
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %max = max %lhs, %rhs
      return %max
  }  
}

In the above example, res0 will contain the minimum and res1 the maximum after the loop has finished.

An alternative would be to have a single variadic yield operation that computes multiple results at once. I am leaning against that approach as it would require to keep inputs to reductions alive across barriers. With separate yields, one can return a value early before joining the barrier.

I am missing where the barrier comes into the picture here. So to me having a single "reduction function" which takes a single input and can produce multiple outputs seems better.
 

For some scenarios it might be helpful to also specify a neutral value for the reduction. This could be done as an extra set of operands to the loop, for example

%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) initial (%init0) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

An alternative to the body region that specifies the computation would be to specify an operation kind (like add). This would be helpful when lowering to platforms that have a set of built-in reduction operations (SPIR-V for example has this concept). While I consider this use-case important, I also think that using a region is preferable for its added expressiveness. Also, even if the parallel loop would support built-in operations, we would need to raise this information out of higher-level dialects that might not have this notion. For example, HLO would not provide this information as it also allows for arbitrary functions in reductions. If we have to raise anyway, we can also do this when lowering out of parallel loops.

--
Stephan

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Sanjoy Das

unread,
Nov 25, 2019, 11:08:08 AM11/25/19
to Mahesh Ravishankar, Stephan Herhut, MLIR
I'm not sure if this is the case in Stephan's proposal. The reduced
value is only "available" at loop termination so the property you
mention, that it requires some additional synchronization, comes "for
free".

I later suggested that we should make the reduced value available in
the loop body to express more complex algorithms (like batchnorm) in a
single loop. That representation has the problem you point out
(loop.yield implies a loop.barrier) but IMO that's not a big problem
because we can lower these loops the same way we lower loop.barrier:
by splitting the larger loop into smaller barrier-less loops.

> If this is the case, as an alternative, would it make sense to have a loop.parallel and a loop.parallel_reduce. To maintain the experessiblity of what you intend here. A loop.parallel_reduce can have two regions. The first region can have the same semantics as the region of the loop.parallel operation. These second region can be the body of the reduction function.
>
>>
>> A common pattern in the code we want to generate are reductions. It is often beneficial to fuse them into the parallel computation that produces the value to be reduced. To facilitate this on the loop level, I propose to add reduction support to parallel loops. This could be done in the form of a special yield operation. An example would look like the following
>>
>> %res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
>> %v0 = load %mem[%iv1, %iv2, %iv3]
>> "loop.yield"(%v0) {
>> bb0 (%lhs, %rhs):
>> %sum = add %lhs, %rhs
>> return %sum
>> }
>> }
>>
>> The yield operations receives as its sole argument the value of the reduction produced by the current thread. It also needs to specify how values are actually reduced. This is done by providing a region that expects two arguments of the same type as the value yielded. The region computes and returns the produced value.
>>
>> The loop does not specify the order in which values get reduced. Consequently, the reduction operation needs to be associative.
>
>
> I saw some discussion on this, but I think a better semantics is to state that the reduction operation is associative and commutative.

As written "reduction operation is associative and commutative"
suggests that using a non-commutative or non-associative operation has
undefined behavior (i.e. the results can be arbitrarily bad).
However, some users might want to use non-associative or
non-commutative operations and be okay with the non-determinism that
results, or have some side-information that even though the operation
is non-associative/non-commutative, for the given inputs they are
associative / commutiative.

To allow such use cases IMO it is better to specify the behavior of
loop.yield directly ("it applies the reduction function in arbitrary
order") than as a constraint on the reduction function itself. We can
then add a footnote that if the reduction function is commutative and
associative and the init value is the operation identity then
loop.yield will produce deterministic results.

-- Sanjoy

Mahesh Ravishankar

unread,
Nov 25, 2019, 3:14:58 PM11/25/19
to Sanjoy Das, Stephan Herhut, MLIR
If I have to restate what you meant, the loop.yield has an implicit barrier at the end of it.
I am not sure as stated it would be trivial to split a loop.parallel with a loop.barrier into two separate loop.parallel operations. To do that you need additional structural constraints on the ops. Off the top of my head
1) The loop.parallel must have a region with a single basic block.
2) loop.barrier must be only in the region within a loop.parallel.
Both of these are traits on ops in MLIR. So that might be OK.
+1.
 

-- Sanjoy


--
Mahesh

Tim Shen

unread,
Nov 25, 2019, 4:37:45 PM11/25/19
to Stephan Herhut, MLIR
Thanks Stephan for the proposal!

Several folks already mentioned w.r.t. CPU, data race, and atomic/locking semantics, etc. Assuming that we want to support all the following lowering targets:
* single CPU thread.
* multiple CPU threads.
* GPU.

Revolving around this idea, other comments are inline.

On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:
I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)

Cheers
  Stephan

Rationale
----

Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.

To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.

Proposed Syntax
----

Below is an example of a parallel loop (assuming it lives in the loop dialect).

loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
  // some computation
}

Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.

Semantics
----

Like with the for operation, the lower bound is inclusive and the upper bound is exclusive. The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations. I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.

If we also consider CPUs as lowering targets, we may want to look at their memory models. IIRC Power and ARM don't guarantee a global total order of stores (https://www.youtube.com/watch?v=VogqOscJYvk).

I agree with Sanjoy's example (of exchanging store orders), in that for architecture inclusiveness we don't need to over specify the ordering.
As a thought experiment, what happens if we take loop.barrier out of the public API and use it merely as a lowering implementation detail (e.g. within the GPU dialect)? My only finding of the impact is that users would lose the ability to guide the fusion process. What if we then add an attribute-based fusion annotation for users to guide the fusion process? Would it justify the removal of the loop.barrier abstraction?
What does it lower to on CUDA? I assume that it's atomicAdd (as today's XLA/GPU do for reduction). OTOH, a deterministic, cross-workgroup reduction on GPU can be implemented in tree reduction. It requires launching multiple kernels.

If we have to talk about memory models nonetheless, maybe we simply expose a set of atomic ops, rather than specifically creating an abstraction "yield"?
 

--
Stephan

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.

--
Regards,
Tim Shen

Stephan Herhut

unread,
Nov 26, 2019, 6:30:47 PM11/26/19
to Dibyendu Das, MLIR
On Fri, Nov 22, 2019 at 6:57 PM Dibyendu Das <dibyendu...@gmail.com> wrote:
Hi Stephan-

Would we need some modelling of atomic/locking operations and the concept of shared/private variables ?

These could be defined in some auxiliary dialect depending on the hardware/platform the code is being lowered to. I see the parallel loop more as a construct to define the semantics of parallel computation wheres atomics are more related to the assumed memory model. Another way to look at this is that the loop describes which parallel executions of the body are legal and the atomics would be a way to ensure that this is actually true.

Cheers
  Stephan


--
Stephan

Uday Bondhugula

unread,
Nov 27, 2019, 7:40:18 AM11/27/19
to MLIR


On Saturday, November 23, 2019 at 12:45:05 AM UTC+5:30, Stephan Herhut wrote:


Below is an example of a parallel loop (assuming it lives in the loop dialect).

loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
  // some computation
}

The 'loop' name looks a bit misleading here - this is really a loop nest or a "multifor"? Incidentally, there was a workshop paper on a primitive like this (although not in an IR setting and perhaps not relevant for the design here):

On Saturday, November 23, 2019 at 1:23:52 AM UTC+5:30, Alexandre Eichenberger wrote:
Hi Stephan, 
 ...  
 
 only a single barrier at the end. This is good for 2D tiling. If there are inner loops that need to be synchronized, then using a nested construct can be employed. I would not allow a barrier in the middle, unless it is a shorthand for splitting the loop into two:
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { 
  //a
  barrier
  //b
}
 
is a shorthand for
 
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* a */ }
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) { /* b */ }


Using the first one as a shorthand for the second is a bit confusing because both imply different thread mapping constraints even if they are semantically equivalent: if it was OpenMP, in the first form, the same thread is guaranteed to execute the chunk %iv = k of 'a' and that of 'b'. However, in the second form, that binding is lost, and both loops could be independently chunked and run in parallel (with the synchronization in between of course). Is this accurate? There are performance implications when it gets to the runtime.

~ Uday



~ Uday

Stephan Herhut

unread,
Nov 27, 2019, 12:52:51 PM11/27/19
to Nicolas Vasilache, MLIR
On Sat, Nov 23, 2019 at 8:57 AM Nicolas Vasilache <nicolas....@gmail.com> wrote:
Thanks Stephan for the proposal!

Hi Alexandre, long time :)

Others have noted the "must run in parallel", we also had this discussion internally.

Allow me to dump here some pseudo IR that describes how I've been thinking about the problem.
We can go the "omp pragma style" implicit parallelism.
We can additionally think of this in terms of explicit SPMD style loops (cuda-style and/or MPI style).

```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id) out_of (%num_proc) { 
   ... 
   loop.sync(%proc_id) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors 
}
```

I don't have a very strong preference for the next point but here is how I have been thinking.
In terms of representation, I would think `loop` is a low-level concept so I would think that we want 1 loop op per iterator.
Higher-levels of IR like structuredops/linalg have "multi-for" semantics.

This allows a 1-iterator -> N-D virtual processor dimension mapping.

```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id_dim1, %proc_id_dim2) out_of (%num_proc_dim_1, %num_proc_dim_2) { 
   ... 
   loop.sync(%
 proc_id_dim1, % proc_id_dim2) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors 
}
```

Then you should start seeing how we can map one loop to both (gridDim.x and blockDim.x).
This is still compatible with imperfectly nested parallelism that I think we really want as Alex mentioned.

So if I understand correctly (and more details would certainly help) the processor mapping encodes that this 1d loop is to be executed using two "dimensions" in the hardware. How is the split encoded? In terms of a parallel loop, this would be written as a 2d loop-nest with explicit step and bound values that encode these constraints. Is that equivalent?

Assuming so, how would you in your proposal encode things like explicit prefetch of some memory into local storage within a block? Could you give an example?
 
In the multi-for context, mapping k-iterators to N-processor dimensions I think it will be trickier to explain what maps where (in addition to how do we express imperfectly nested parallel loops).

Dropping some parallelism seems simple:
```
loop.parallel (%iv1) = (%lb1) to (%ub1) step (%s1) on (%proc_id_dim2) out_of (%num_proc_dim_2) { 
   ... 
   loop.sync(
% proc_id_dim2) // implicitly for all in [0, %num_proc], optionally with some extra information for a subset of processors 
}
```

Here is after dropping %1

Can you expand this a little? It seems this is an example of what would be needed but I cannot fully connect the dots.

Thanks
  Stephan
 
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Stephan Herhut

unread,
Nov 27, 2019, 1:08:26 PM11/27/19
to Mahesh Ravishankar, Sanjoy Das, MLIR
I have no data but was simply slightly uncomfortable making yield have barrier semantics because it restricts execution order without me having a good example use. For batchnorm, does it happen in practice that it is beneficial to make it one loop or would one not prefer to fuse the element-wise after the reductions into some following operation? 
 
If I have to restate what you meant, the loop.yield has an implicit barrier at the end of it.
I am not sure as stated it would be trivial to split a loop.parallel with a loop.barrier into two separate loop.parallel operations. To do that you need additional structural constraints on the ops. Off the top of my head
1) The loop.parallel must have a region with a single basic block.
2) loop.barrier must be only in the region within a loop.parallel.
Both of these are traits on ops in MLIR. So that might be OK.

It gets more complicated with nested parallel loops and a sync across such a nest. For example

loop.parallel (...) {
  loop.parallel (...) {
    loop.barrier {levels = 2}
  }
}

This would sync across both loops and to make sure this can be split, it needs to be a direct loop nest, which is a constraint that we cannot express just using those traits. Loop.parallel is not in general required to be a perfect nest but only conditionally on whether it contains a barrier. It is still a property that can be verified for the barrier. I also expect to get constraints on the placement of loop.barrier out of the discussions on convergence properties. Maybe that will allow us to express these constraints.
I agree. I was pretty sloppy in the RFC on this.

Cheers
  Stephan
 

-- Sanjoy


--
Mahesh


--
Stephan

Stephan Herhut

unread,
Nov 27, 2019, 1:40:06 PM11/27/19
to Tim Shen, MLIR
On Mon, Nov 25, 2019 at 1:37 PM Tim Shen <tim...@google.com> wrote:
Thanks Stephan for the proposal!

Several folks already mentioned w.r.t. CPU, data race, and atomic/locking semantics, etc. Assuming that we want to support all the following lowering targets:
* single CPU thread.
* multiple CPU threads.
* GPU.

I would add non-GPU dedicated hardware that has a SPMD-/SIMT like model.
 

Revolving around this idea, other comments are inline.

On Fri, Nov 22, 2019 at 11:15 AM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:
I am looking for feedback on a parallel loop operation. It is very much inspired by GPU code generation needs and I would be particularly interested in feedback on expressiveness and whether it fits other targets needs. I propose a syntax but admittedly not much thought went into its design and the choice of names :)

Cheers
  Stephan

Rationale
----

Our current lowering pipeline from structured ops to the GPU dialect uses structured loops (from the loop dialect) as an intermediate step. However, structured loops have sequential semantics and hence we lose the inherent parallelism that structures ops provide by construction. One possible solution would be to lower directly to the GPU dialect from structured ops. While this is feasible, it would require to implement fusion and tiling on the level of GPU operations. These transformations, however, are of a general nature and would apply to similar dialects as well. Consequently, I am proposing to add a further kind of loop that models the parallelism from structured ops but is more generic in order to support loop-fusion and tiling transformations.

To enable transformations like fusion, we will need some form of dependency analysis between loops. This proposal aims to be agnostic of such an analysis. Like with structured sequential loops, I expect to see this kind of loop in multiple dialects, e.g., an affine version for affine analysis. For presentation, I will focus on the (non-affine) loop dialect.

Proposed Syntax
----

Below is an example of a parallel loop (assuming it lives in the loop dialect).

loop.parallel (%iv1, %iv2, %iv3) = (%lb1, %lb2, %lb3) to (%ub1, %ub2, %ub3) step (%s1, %s2, %s3) {
  // some computation
}

Other than in the existing sequential loops, a parallel loop iterates over a multi-dimensional iteration space. In other words, it combines a nest of multiple loops into a single operation. The %iv are bound ssa-values representing a point in such iteration space. The %lb, %ub and %s are uses of ssa-values representing the lower bound, upper bound and step, respectively.

Semantics
----

Like with the for operation, the lower bound is inclusive and the upper bound is exclusive. The body of the parallel loop is executed in parallel, i.e., its side-effects can be perceived in any order as long as there is a corresponding sequential interleaving of operations. I will later refer to the computation of one point in the iteration space of the loop as a thread of execution. Figuratively, it corresponds to executing the body of the loop for one valid set of iv values.

If we also consider CPUs as lowering targets, we may want to look at their memory models. IIRC Power and ARM don't guarantee a global total order of stores (https://www.youtube.com/watch?v=VogqOscJYvk).

I agree with Sanjoy's example (of exchanging store orders), in that for architecture inclusiveness we don't need to over specify the ordering. 

This is tricky to specify at the level of abstraction that the parallel loop is. I want to essentially express that threads can run in any order with their instructions interleaved in any way but that instructions within a thread cannot be reordered. Sanjoy's suggested rewrite should be possible if you can prove that changing the order of stores does not change the expected result but the parallel loop by itself does not allow the reordering. On the other hand, lowerings to some architecture can change the order of instructions by combining threads (like vectorizing) or running them all sequentially. That is allowed within the semantics of the loop.

Now, memory models. My (potentially naive) approach has been to simply make this orthogonal assuming that hardware can guarantee ordering within a single thread and I would be happy to weaken the requirements for loop.parallel to essentially state that any execution order is valid as long as it keeps the instructions and effects from a single thread consistent with that thread itself locally but not globally. Would this cover all cases?
The barrier is also useful beyond fusion for expressing things like explicit prefetch. Also, I do not see how putting barrier into a side-dialect would make a difference. Even if it is in the loop dialect, lowerings to targets that do not support it may not implement it. You won't be able to use rewritings that produce it but that would be the same if the barrier operation was in a different dialect.
The goal of having a reduction in the parallel loop is to not nail down the implementation strategy for the reduction. One could perfectly well translate a yield into a store into a temporary buffer (assuming the computation of the operand is non-trivial) and then have a sequence of kernels to do tree-style reduce. The reduce might also be nested inside a parallel outer loop, in which case it could turn into a sequential thing.
My view here is that once you lower it to the level of atomics, you essentially have made the decision on what kind of hardware you are expecting. That is also a valid approach but it would essentially mean that we lower directly into the gpu dialect from higher-level dialects.

Cheers
  Stephan
 

--
Stephan

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAFPX7RLpMw1N7Rr-%2BnBzD9KMzKFZZ4tPHB7K%3DQpZ86RvYs-jOg%40mail.gmail.com.

--
Regards,
Tim Shen


--
Stephan

Sanjoy Das

unread,
Nov 27, 2019, 2:04:55 PM11/27/19
to Stephan Herhut, Tim Shen, MLIR
On Wed, Nov 27, 2019 at 10:40 AM 'Stephan Herhut' via MLIR
<ml...@tensorflow.org> wrote:
>> If we also consider CPUs as lowering targets, we may want to look at their memory models. IIRC Power and ARM don't guarantee a global total order of stores (https://www.youtube.com/watch?v=VogqOscJYvk).
>>
>> I agree with Sanjoy's example (of exchanging store orders), in that for architecture inclusiveness we don't need to over specify the ordering.
>
>
> This is tricky to specify at the level of abstraction that the parallel loop is. I want to essentially express that threads can run in any order with their instructions interleaved in any way but that instructions within a thread cannot be reordered. Sanjoy's suggested rewrite should be possible if you can prove that changing the order of stores does not change the expected result but the parallel loop by itself does not allow the reordering. On the other hand, lowerings to some architecture can change the order of instructions by combining threads (like vectorizing) or running them all sequentially. That is allowed within the semantics of the loop.
>
> Now, memory models. My (potentially naive) approach has been to simply make this orthogonal assuming that hardware can guarantee ordering within a single thread and I would be happy to weaken the requirements for loop.parallel to essentially state that any execution order is valid as long as it keeps the instructions and effects from a single thread consistent with that thread itself locally but not globally. Would this cover all cases?

I would prefer the following:

1. Data races on "ordinary" memory locations is undefined behavior
2. Within a *single thread* we guarantee ordering
3. Maybe introduce some way to specially mark "cross-thread" memory
operations that are allowed to race

Without (1) we won't be able to do rematerlization-like transformations. E.g.

int* ptr = .. argument into the loop body
int r0 = *ptr;
if (r0 == 42) {
assert(r0 == 42);
}

=>

int* ptr = .. argument into the loop body
int r0 = *ptr;
if (r0 == 42) {
int r1 = *ptr; // REMAT
assert(r1 == 42);
}

Since some other thread could have racily written to `ptr` between the
time we loaded it the first time and we loaded it the second. This
means in the transformed program the assertion could fail even though
it could never have failed in the original program.

> The goal of having a reduction in the parallel loop is to not nail down the implementation strategy for the reduction. One could perfectly well translate a yield into a store into a temporary buffer (assuming the computation of the operand is non-trivial) and then have a sequence of kernels to do tree-style reduce. The reduce might also be nested inside a parallel outer loop, in which case it could turn into a sequential thing.

Ideally we would want the user (LHLO->PLOOP) to be able to dictate a
reduction strategy (e.g. "use tree reduction") that the codegen
pipeline *has to use*. In theory it is an implementation detail, but
for floating point arithmetic tree reductions are significantly
different than atomic reductions so the user might have a strong
opinion about this.

> My view here is that once you lower it to the level of atomics, you essentially have made the decision on what kind of hardware you are expecting. That is also a valid approach but it would essentially mean that we lower directly into the gpu dialect from higher-level dialects.

+1

-- Sanjoy

Bruestle, Jeremy

unread,
Dec 3, 2019, 1:10:21 PM12/3/19
to Stephan Herhut, MLIR
Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe. Hopefully I'm not too late to the game. Some comments:

1) The decision as to whether or not to include barriers is a significant choice:

Adding barriers prevents the simple lowering to serial loops. For example, imagine the following case which implements an 'in-place' transpose:

loop.parallel (%i, %j) = (0) to (10) step (1) {
%x = load %ref[%i, %j]
loop.barrier { levels = 1 }
store %x, %ref[%j, %i]
}

Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation. A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering. If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially. That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction. I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact. Notably, the stipe implementation decided to forgo barriers (see #3 below)

2) Reduction support:

- I think that most reductions are in practice both associative and commutative, and some methods of implementing reductions will benefit from allowing the reduction to be commutative. For example, multiple threads and atomic in-memory add require commutative reductions, since the order the additions across threads is arbitrary.
- I think that nothing should be specified regarding the specific order or method of performing the reduction. In cases where the reduction is truly exactly commutative and associative (say integer addition), there is no reason to limit the method of reduction. Where reduction is 'weakly' associative (say floating point addition), it's true that the method of reduction can effect results, but if your problem is truly numerically unstable enough that the order matters, you need to be using other techniques.
- It may be useful to specify or require an 'identity' value (i.e. 0 in the case of add, -inf in the case of max), to be specified for an reduction. There are two possible reasons: A) In the case of a parallel for loop with a trip-count of zero, the result of the reduction is otherwise unspecified, B) In some implementations (say threaded atomic add again) either the output must be pre-initialized to the identity value, or one trip of the parallel loop must be peeled off to estabilish an initial value.

Presuming we add identity, I would suggest describing the semantics as follows:

The output of the reduction must be equivalent the evaluation of some tree of calls to the reduction region, where there may be any number of identity leaf nodes, and each value yielded must appears in exactly one leaf. If there are zero values yielded, the result must be the identity. Perhaps the wording there is a bit turgid, but conceptually what I'm trying to convey is: order is arbitrary (as is associativity) and any number of 'identity' elements may be combined in. The second bit is to allow hardware that has say multiple accumulators (each of which is independently initialized to the identity) which are combined at the end (perhaps by say a tree rollup) to be valid.

Of course for normal reductions (say integer addition), all of the above is trivially true, but if we want to support arbitrary reductions via a region as opposed to fixed set, I think it's important to clarify just exactly how much flexibility we want to allow in terms of hardware specific lowering. In my opinion, we should provide maximal flexibility. Now of course, as defined above if you do use an inexactly associative reduction like floating point addition, you may be any order of additions, but I think once you allow any parallel reduction at all, you've pretty much decided that such a thing is acceptable, and to require deterministic behavior in such a case basically prevents most real world use cases and optimizations almost immediately. Note: the additional requirements of an identity element per above have no effect of floating point add, as 0.0 is an exact identity in floating point addition anyway.

3) Comparison to Stripe/Tile

As a side note, Stripe and Tile implement parallel loops and reductions, in a similar but distinct way.

- Our parallel loop has basically identical semantics, other than that we have chosen not to support barriers, since lowering to serial implementation is very important. Instead, we do any final barrier requiring fusions post lowering to GPU.

- We support an 'in-memory' reduction semantics, where there is an 'aggregate' instruction similar to 'store' but with apparently 'atomic' in memory reductions. These reductions currently come from a fixed set (each with it's own identity values), although we have considered the region based approach. However we decided against it due to the fact that we had no specific use cases, and it makes pattern matching much harder to do with DRR. We semantically allow arbitrary reordering, changes in associativity, etc, and despite looking like 'in-memory' reductions, in practice most aggregations are lowered to some in-register operations, adder trees, etc, since basically none of the platforms of interest support atomic floating point addition (the common case) The advantage of the 'in-memory' reduction semantics is that it allows a single parallel loops to produce multiple independent reductions. It does require some complex aliasing analysis however, which we can get away with due to the affine nature of the accesses.

-Jeremy

Sanjoy Das

unread,
Dec 3, 2019, 2:23:18 PM12/3/19
to Bruestle, Jeremy, Stephan Herhut, MLIR, George Karpenkov
On Tue, Dec 3, 2019 at 10:10 AM Bruestle, Jeremy
<jeremy....@intel.com> wrote:
> Of course for normal reductions (say integer addition), all of the above is trivially true, but if we want to support arbitrary reductions via a region as opposed to fixed set, I think it's important to clarify just exactly how much flexibility we want to allow in terms of hardware specific lowering. In my opinion, we should provide maximal flexibility. Now of course, as defined above if you do use an inexactly associative reduction like floating point addition, you may be any order of additions, but I think once you allow any parallel reduction at all, you've pretty much decided that such a thing is acceptable, and to require deterministic behavior in such a case basically prevents most real world use cases and optimizations almost immediately. Note: the additional requirements of an identity element per above have no effect of floating point add, as 0.0 is an exact identity in floating point addition anyway.

+1

This is what I had in mind when I suggested the executable semantics
of loop.yield be

with_lock {
// red_fn is the reduction function. The pre and post loops
"insert" an arbitrary number of identity elements into the reduction.

for an arbitrary number of times {
*accum = red_fn(*accum, %init)
}

*accum = red_fn(*accum, %val)

for an arbitrary number of times {
*accum = red_fn(*accum, %init)
}
}

There is another hair to split here: can each dynamic instance of the
reduction produce a different value (i.e. there is "run to run"
non-determinism of the same compiled binary) or is every execution
expected to produce the same value, even if is the result of arbitrary
tiling optimizations that can change in unpredictable ways across
compiler versions?

I think there is value in having the latter semantic:

1. This is something XLA users have asked for, so we know there is
some demand. +CC George who has looked at this in the context of XLA.
2. It allows us to remat reductions. For instance, if there is
run-to-run non-determinism them we can't optimize "x = reduce(...); if
(not_nan(x) && x != x) { abort(); }" to "x = reduce(...); x_remat =
reduce(...); if (not_nan(x) && x != x_remat) { abort(); }".

But that has a performance penalty (not sure how much).

-- Sanjoy

George Karpenkov

unread,
Dec 3, 2019, 2:47:21 PM12/3/19
to Sanjoy Das, Bruestle, Jeremy, Stephan Herhut, MLIR
> - I think that most reductions are in practice both associative and commutative, and some methods of implementing reductions will benefit from allowing the reduction to be commutative.  For example, multiple threads and atomic in-memory add require commutative reductions, since the order the additions across threads is arbitrary. 

The downside of this approach is that it will lead to lack of run-to-run determinism.

A contract "reduction is deterministic as long as input size is smaller than X" could be very useful to have, than determinism and precision can be recovered by rewriting a reduction into a sequence of reductions using tree reduction (e.g. the reduction is split into small-enough patches, each patch is reduced deterministically into a new buffer, than that new buffer is reduced).
This approach allows to preserve both determinism and performance.

Bruestle, Jeremy

unread,
Dec 3, 2019, 2:52:19 PM12/3/19
to Sanjoy Das, Stephan Herhut, MLIR, George Karpenkov
So requiring reductions to be deterministic across multiple runs does prevent certain reduction techniques (for example, some GPU's support atomic floating point addition, and if one launched multiple work groups accumulating into a single dram location, this would be non-deterministic addition order at runtime). Maybe this could be an attribute since I do see the use case for 'deterministic at runtime, flexible at compile time', and most of the cases we deal could handle such a limitation.
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CABBcqdHpM86%2BQ0SZgF3pE4cDeRZW8jEh3-5sWbUzTcY0vMx4fA%40mail.gmail.com.


Bruestle, Jeremy

unread,
Dec 3, 2019, 3:00:17 PM12/3/19
to George Karpenkov, Sanjoy Das, Stephan Herhut, MLIR

To me this feels somewhat like a ‘fast-loose-math’ type of tradeoff to me.  It’s really relevant in floating point only, since integer ops are perfectly commutative and associate anyway, so reduction order is completely hidden from the user.  In many ML use cases, runtime nondeterminism of addition order is absolutely acceptable, and enforcing it strictly in all cases definitely may lead to performance losses.  I feel like it’s use case specific enough regardless that having an attribute which implies strictness may be a reasonable solution.  If it’s on, optimization passes / lowering which would lead to non-determinism would be disabled, otherwise they would be allowed. 

--

You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Alexandre Eichenberger

unread,
Dec 3, 2019, 3:06:16 PM12/3/19
to jeremy....@intel.com, ches...@google.com, her...@google.com, ml...@tensorflow.org, san...@google.com
with regards to deterministic reductions.
 
In my experience, customers that are trying to tune their code like deterministic reduction (when using exactly the same number of threads/warps/blocks) when dealing with algorithms where precision of the reductions has an impact on the number of iteration and/or code path. Otherwise, it is hard for them to be sure if an optimization helped or not.
 
In production mode, customers are often willing to have a faster algorithm that are not deterministic with respect to reductions.

Alexandre

-----------------------------------------------------------------------------------------------------
Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
- research: compiler optimization (OpenMP, GPU, SIMD)
- info: al...@us.ibm.com http://www.research.ibm.com/people/a/alexe
- phone: 914-945-1812 (work), 914-312-3618 (cell)
 
 
----- Original message -----
From: "Bruestle, Jeremy" <jeremy....@intel.com>
To: Sanjoy Das <san...@google.com>

Stephan Herhut

unread,
Dec 4, 2019, 9:12:39 AM12/4/19
to Bruestle, Jeremy, MLIR
Hi Jeremy. Thanks for the comments! 

On Tue, Dec 3, 2019 at 7:10 PM Bruestle, Jeremy <jeremy....@intel.com> wrote:
Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe.  Hopefully I'm not too late to the game.  Some comments:

1) The decision as to whether or not to include barriers is a significant choice:

Adding barriers prevents the simple lowering to serial loops.  For example, imagine the following case which implements an 'in-place' transpose:

loop.parallel (%i, %j) = (0) to (10) step (1) {
  %x = load %ref[%i, %j]
  loop.barrier { levels = 1 }
  store %x, %ref[%j, %i]
}

Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation.  A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering.  If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially.  That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction.  I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact.  Notably, the stipe implementation decided to forgo barriers (see #3 below)

I agree with the notion that barriers are heavyweight and model a behavior that is not available on all targets. They are also somewhat optional. As long as one designs the lowering pipeline such that no higher-level dialect creates barriers one can use the parallel loop without. We could make this a design criterion in that transformations on parallel loops should not rely on barriers unless they are strictly required to express the generated code. As an example, we could have parallel loop fusion that does not insert barriers instead of having a naive fusion that inserts barriers and an optimization to remove them (just an example). 

I will leave out the barrier operation in the first iteration of parallel loops, so we can see whether need arises and how we handle it later.
 
2) Reduction support: 

- I think that most reductions are in practice both associative and commutative, and some methods of implementing reductions will benefit from allowing the reduction to be commutative.  For example, multiple threads and atomic in-memory add require commutative reductions, since the order the additions across threads is arbitrary. 
- I think that nothing should be specified regarding the specific order or method of performing the reduction.  In cases where the reduction is truly exactly commutative and associative (say integer addition), there is no reason to limit the method of reduction.  Where reduction is 'weakly' associative (say floating point addition), it's true that the method of reduction can effect results, but if your problem is truly numerically unstable enough that the order matters, you need to be using other techniques.
- It may be useful to specify or require an 'identity' value (i.e. 0 in the case of add, -inf in the case of max), to be specified for an reduction.  There are two possible reasons:  A) In the case of a parallel for loop with a trip-count of zero, the result of the reduction is otherwise unspecified, B) In some implementations (say threaded atomic add again) either the output must be pre-initialized to the identity value, or one trip of the parallel loop must be peeled off to estabilish an initial value.

Presuming we add identity, I would suggest describing the semantics as follows:

The output of the reduction must be equivalent the evaluation of some tree of calls to the reduction region, where there may be any number of identity leaf nodes, and each value yielded must appears in exactly one leaf.  If there are zero values yielded, the result must be the identity.  Perhaps the wording there is a bit turgid, but conceptually what I'm trying to convey is: order is arbitrary (as is associativity) and any number of 'identity' elements may be combined in.  The second bit is to allow hardware that has say multiple accumulators (each of which is independently initialized to the identity) which are combined at the end (perhaps by say a tree rollup) to be valid.

Of course for normal reductions (say integer addition), all of the above is trivially true, but if we want to support arbitrary reductions via a region as opposed to fixed set, I think it's important to clarify just exactly how much flexibility we want to allow in terms of hardware specific lowering. In my opinion, we should provide maximal flexibility.  Now of course, as defined above if you do use an inexactly associative reduction like floating point addition, you may be any order of additions, but I think once you allow any parallel reduction at all, you've pretty much decided that such a thing is acceptable, and to require deterministic behavior in such a case basically prevents most real world use cases and optimizations almost immediately.  Note:  the additional requirements of an identity element per above have no effect of floating point add, as 0.0 is an exact identity in floating point addition anyway.

The original proposal had the idea of identity in the form of extra arguments to the parallel loop. Something like

%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) identity (%id0) {

  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

I had originally called this initial value but identity describes it better I think as there is no requirement to use it initially (or at all). An alternative here would be to provide the identity value to the yield operation, something like

%res = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) {
  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0, %id0) {  // <-- identity value here

    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

which moves it somewhat closer to the use. On the other hand, this would allow for the identity value to not be loop invariant (at least based purely on scoping), which I don't like.

3) Comparison to Stripe/Tile

As a side note, Stripe and Tile implement parallel loops and reductions, in a similar but distinct way.

- Our parallel loop has basically identical semantics, other than that we have chosen not to support barriers, since lowering to serial implementation is very important.  Instead, we do any final barrier requiring fusions post lowering to GPU.

- We support an 'in-memory' reduction semantics, where there is an 'aggregate' instruction similar to 'store' but with apparently 'atomic' in memory reductions.  These reductions currently come from a fixed set (each with it's own identity values), although we have considered the region based approach.  However we decided against it due to the fact that we had no specific use cases, and it makes pattern matching much harder to do with DRR. 

I assume easier pattern matching here refers to fixed ops vs. regions? That is true. Would it be good enough for pattern matching to expect a form where the body of the yield is essentially a function call? We could also add a version of yield that gets a function as attribute but then has an empty region. 
 
We semantically allow arbitrary reordering, changes in associativity, etc, and despite looking like 'in-memory' reductions, in practice most aggregations are lowered to some in-register operations, adder trees, etc, since basically none of the platforms of interest support atomic floating point addition (the common case) The advantage of the 'in-memory' reduction semantics is that it allows a single parallel loops to produce multiple independent reductions.  It does require some complex aliasing analysis however, which we can get away with due to the affine nature of the accesses.

This is interesting. The proposal supports multiple reductions via multiple yield operations in a single body. I am not sure that is the same as what your approach supports. Do you see any use cases from your experience that could not be modelled with multiple yields? So in essence, you can have 

%prod, %sum = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) identity (%id0, %id1) {

  %v0 = load %mem[%iv1, %iv2, %iv3]
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %prod = mul %lhs, %rhs
      return %prod

  }  
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

Thanks
  Stephan
--
Stephan

Stephan Herhut

unread,
Dec 4, 2019, 10:06:36 AM12/4/19
to Alexandre Eichenberger, Bruestle, Jeremy, George Karpenkov, MLIR, Sanjoy Das
Replying to this entire thread of discussion instead of single messages. If I understand concerns correctly, I think having an attribute on the parallel loop and/or yield operation that forces it to use some deterministic order seems the right concept to tackle these issues. Just to reiterate,

- for the XLA reduce example, I would expect the rematerialization to happen at the XLA ops level, so we could tag XLA ops that have been rematerialized accordingly and apply the attribute if they get lowered to parallel loops (ignoring specifics for now). 
- if one wants to implement tree reduction and rely on the smaller reductions to be deterministic, those could be tagged, as well.
- for performance benchmarking, this could also be exposed by pragmas or the like.

There will be no promises wrt. performance (e.g., adding the attribute might make things slower) and we also cannot really enforce that all lowerings from the parallel loop operation will support this. Still makes sense to spec it so we have a common mechanism.

Cheers
  Stephan

--
Stephan

Alexandre Eichenberger

unread,
Dec 4, 2019, 10:29:51 AM12/4/19
to her...@google.com, ches...@google.com, jeremy....@intel.com, ml...@tensorflow.org, san...@google.com

parallel proposal

 

When designing a parallel abstraction, we have some basic design choices. Here are two.

 

GPU-centric vs Target agnostic:

 

Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:

 

loop.parallel (%i, %j) = (0) to (10) step (1) {

    %x = load %ref[%i, %j]

    loop.barrier { levels = 1 }

    store %x, %ref[%j, %i]

}

 

as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior.

 

High-level vs multi-level

 

Consider a reduction on a GPU. We may have a high level way to say "reduce this across warps and thread-blocks" and be done with it (aka a high-level representation). Alternatively, we may want a representation that also let us lower such reduction into a lower optimized form where we would have each warp performing a reduction on the data assigned to them, have a phase where data is combined to one value per thread block, and finally having one value for all thread blocks. 

 

For CPUs, we may want to express some corse parallelism for tiles, and then using threads within a core to process in parallel work within a tile.

 

Such multi-level approach forces you to have nested parallelism with barriers. 

 

Suggestion

 

There is nothing wrong with doing a GPU-centric version of parallel loop, it would probably be more appropriate in a GPU-specific dialect.

 

I would argue for a target-agnostic representation, structured in a way to map relatively easily into GPU-specific loops, CPUs, as well as more specialized hardware such as systolic arrays.

 

If we go that route, probably having two types of parallel loops, one with a barrier joining all the thread at the end, and one without, that would be a good start. Using Nicolas' suggested notation:

 

loop.parallel (%i) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { A }

 

would execute the work A for all iterations and then when each participating thread has completed, they would wait in a barrier (e.g. a kernel at the thread-block level, or a CUDA syncthreads for threads in all warps.

 

A non blocking version would be

 

loop.parallel_nowait  (%j) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { B }

 

where threads don’t synchronize at the end of the parallel loop.

 

Loops with multiple indices:

 

loo.parallel(%i, %j) = (0, 0) to (10, 10) by (1,1) {A}

 

corresponds to

 

loo.parallel(%i) = (0) to (10) by (1) {

      loop.parallel_nowait  (%j) = (0) to (10) by (1) {

        A

    }

}

 

where the above is essentially a CUDA kernel if the outer loop is iterating over thread blocks and the inner loop is iterating over the warps. 

 

Loops with no sequential code between parallel regions are “perfect” GPU code; sequences of inner parallel loops can be merged into one big inner loop with CUDA syncthreads when there is one thread per data point.

Mehdi AMINI

unread,
Dec 4, 2019, 11:23:21 AM12/4/19
to Stephan Herhut, Bruestle, Jeremy, MLIR
On Wed, Dec 4, 2019 at 3:12 PM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:
Hi Jeremy. Thanks for the comments! 

On Tue, Dec 3, 2019 at 7:10 PM Bruestle, Jeremy <jeremy....@intel.com> wrote:
Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe.  Hopefully I'm not too late to the game.  Some comments:

1) The decision as to whether or not to include barriers is a significant choice:

Adding barriers prevents the simple lowering to serial loops.  For example, imagine the following case which implements an 'in-place' transpose:

loop.parallel (%i, %j) = (0) to (10) step (1) {
  %x = load %ref[%i, %j]
  loop.barrier { levels = 1 }
  store %x, %ref[%j, %i]
}

Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation.  A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering.  If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially.  That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction.  I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact.  Notably, the stipe implementation decided to forgo barriers (see #3 below)

I agree with the notion that barriers are heavyweight and model a behavior that is not available on all targets. They are also somewhat optional. As long as one designs the lowering pipeline such that no higher-level dialect creates barriers one can use the parallel loop without. We could make this a design criterion in that transformations on parallel loops should not rely on barriers unless they are strictly required to express the generated code. As an example, we could have parallel loop fusion that does not insert barriers instead of having a naive fusion that inserts barriers and an optimization to remove them (just an example). 

Is barrier pessimizing anything fundamentally? I understand that some extra memory can be required, but that seems like a fundamental problem: if you split into two loops instead of having as barrier, you also need the buffer to materialize the data reuse from the first loop to the second one, wouldn't you?

A fundamental question to me is: are there cases where you *can't* perform this transformations?

-- 
Mehdi

 
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Mehdi AMINI

unread,
Dec 4, 2019, 11:39:28 AM12/4/19
to Alexandre Eichenberger, Stephan Herhut, George Karpenkov, Bruestle, Jeremy, MLIR, Sanjoy Das
On Wed, Dec 4, 2019 at 4:29 PM Alexandre Eichenberger <al...@us.ibm.com> wrote:

parallel proposal

 

When designing a parallel abstraction, we have some basic design choices. Here are two.

 

GPU-centric vs Target agnostic:

 

Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:

 

loop.parallel (%i, %j) = (0) to (10) step (1) {

    %x = load %ref[%i, %j]

    loop.barrier { levels = 1 }

    store %x, %ref[%j, %i]

}

 

as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior.


The "gap" can be intended: it seems common that the high-level representation differs from the target specific execution and the lowering is making it up for. Having this representation at a high-level does not preclude from splitting after and performing other optimizations later after the target properties is known. But the strategy to materialize the lower level can be expressed in a target-specific (there could be multiple strategy for lowering from this).
Also, I am not sure why you're seeing it at this level as a "GPU centric" representation as is? As far as I know, in general you cannot have a global synchronization on the GPU (at least in the regular OpenCL/Cuda model) without breaking it into two kernel launch (so splitting the loops into two just like on a CPU). 
Adding the processor structure as you mention later (from Nicolas earlier) would indeed make it more like a GPU specific representation, but I see this as a lower-level representation though.

With known loop bounds and target specific knowledge, one could have this lowered in a single loop (both on CPU and GPU depending on what you know about the execution environment).

-- 
Mehdi


Alexandre Eichenberger

unread,
Dec 4, 2019, 12:07:21 PM12/4/19
to joke...@gmail.com, ches...@google.com, her...@google.com, jeremy....@intel.com, ml...@tensorflow.org, san...@google.com
Mehdi,
 
My experience is tainted by OpenMP. For the code below to work, you have to have one independent thread per loop iteration. This is the case for GPU code as they deal well with large number of threads. This is not the case for CPU or other accelerators with smaller number of threads, where typically a block of iterations are mapped to a single thread. In this second case, the code needs to be eventually lowered to something like this
 
// assuming exactly 4 threads
 

loop.parallel (%i0, %j0) = (0) to (10) step (5) {

    %x = allocate array of 5x5 ints
    for (%i1, %j1) = (0,0) to (5,5) {

        %x[%i1, %j1] = load %ref[%i0+%i1, %j0 + %j1]

    }

    loop.barrier { levels = 1 }    

    for (%i1, %j1) = (0,0) to (5,5) {

      store %x[%i1, %j1], %ref[%i0+%i1, %j0+%j1i]

    }
    free %x

}

Namely scalar expand %x, create an outerloop that has one iteration per available thread, iterate with an innerloop for all data points mapped to that thread.
 
This is why I do not consider this representation to be neutral with respect to thread-limited processors/accelerators. It is a perfectly fine representation if the underlying assumptions are well understood. 

Bruestle, Jeremy

unread,
Dec 4, 2019, 12:36:29 PM12/4/19
to Mehdi AMINI, Stephan Herhut, MLIR

 

 

From: Mehdi AMINI <joke...@gmail.com>
Date: Wednesday, December 4, 2019 at 8:23 AM
To: Stephan Herhut <her...@google.com>
Cc: "Bruestle, Jeremy" <jeremy....@intel.com>, MLIR <ml...@tensorflow.org>
Subject: Re: [mlir] [RFC][GPU] Parallel Loop Operation/Dialect

 

On Wed, Dec 4, 2019 at 3:12 PM 'Stephan Herhut' via MLIR <ml...@tensorflow.org> wrote:

Hi Jeremy. Thanks for the comments! 

 

On Tue, Dec 3, 2019 at 7:10 PM Bruestle, Jeremy <jeremy....@intel.com> wrote:

Just catching up after vacation... I'm definitely in favor of the overall proposal, which is somewhat in line with our work on Stripe.  Hopefully I'm not too late to the game.  Some comments:

1) The decision as to whether or not to include barriers is a significant choice:

Adding barriers prevents the simple lowering to serial loops.  For example, imagine the following case which implements an 'in-place' transpose:

loop.parallel (%i, %j) = (0) to (10) step (1) {
  %x = load %ref[%i, %j]
  loop.barrier { levels = 1 }
  store %x, %ref[%j, %i]
}

Due to the definition and use of the scalar %x across the barrier, this cannot be split into two independent parallel for loops, and the loop as is cannot be lowered to a serial implementation.  A more complex lowering to serial form may consider 'spilling' all values that cross the barrier into a newly allocated buffer whose size is the total loop range, but this requires significantly more complexity during lowering.  If barrier support was removed, all parallel loops would be able to be safely lowered to serial loops trivially.  That said, some hardware (GPU's in particular) do natively support barrier semantics, and it's hard to represent 'fused' parallel loops properly without a barrier instruction.  I'm not sure what the right solution here is, I just want to clarify that barriers are a heavyweight concept in terms of semantic impact.  Notably, the stipe implementation decided to forgo barriers (see #3 below)

 

I agree with the notion that barriers are heavyweight and model a behavior that is not available on all targets. They are also somewhat optional. As long as one designs the lowering pipeline such that no higher-level dialect creates barriers one can use the parallel loop without. We could make this a design criterion in that transformations on parallel loops should not rely on barriers unless they are strictly required to express the generated code. As an example, we could have parallel loop fusion that does not insert barriers instead of having a naive fusion that inserts barriers and an optimization to remove them (just an example). 

 

Is barrier pessimizing anything fundamentally? I understand that some extra memory can be required, but that seems like a fundamental problem: if you split into two loops instead of having as barrier, you also need the buffer to materialize the data reuse from the first loop to the second one, wouldn't you?

 

A fundamental question to me is: are there cases where you *can't* perform this transformations?

 

-- 

Mehdi

 

I believe such a transformation can always be done (or at least I can’t think of any counter-examples), however I would argue that barriers (especially level=2 or above) complicate all of the analysis as well as lowering extensively, and moreover, barriers of this sort are really only meaningful in GPU like parallelism (and even then, only within a given work group).  Having thought about it more, I’m now of the opinion that making the general case less complex and more amenable to analysis  (i.e. removing barriers) and moving barrier semantics into the GPU specific dialect is probably a better way to go.  Additionally, any ‘loop splitting’ algorithm would need to make concrete buffers at some point, which including thing like deciding layout, etc.  In the GPU case, you could use loop fusion with a barrier after lowering, which is even easier than normal loop fusion, and additionally only ‘removes’ buffers so pass ordering with respect to layout is more flexible.  In most cases, the original representation is unfused anyway, so it also seems more natural in terms of overall pass/lowering flow to add loop fusion than to make a ‘loop splitting’.

 

-Jeremy

Stephan Herhut

unread,
Dec 4, 2019, 12:37:24 PM12/4/19
to Alexandre Eichenberger, Nicolas Vasilache, George Karpenkov, Bruestle, Jeremy, MLIR, Sanjoy Das
On Wed, Dec 4, 2019 at 4:29 PM Alexandre Eichenberger <al...@us.ibm.com> wrote:

parallel proposal

 

When designing a parallel abstraction, we have some basic design choices. Here are two.

 

GPU-centric vs Target agnostic:

 

Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:

 

loop.parallel (%i, %j) = (0) to (10) step (1) {

    %x = load %ref[%i, %j]

    loop.barrier { levels = 1 }

    store %x, %ref[%j, %i]

}

 

as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior. 


I don't see the barrier as GPU specific. What I agree with is that more uses of the barrier have a direct mapping to GPU than to CPU. In its most general form, it is not supported by any platform without splitting loops. Even for GPUs, one might want to tile multiple levels to reduce the number of threads and hence will need to split the loop and materialize values. 
I also see the barrier as orthogonal to the parallel loop itself. One can perfectly well use parallel loop without using barriers.
 

High-level vs multi-level

 

Consider a reduction on a GPU. We may have a high level way to say "reduce this across warps and thread-blocks" and be done with it (aka a high-level representation). Alternatively, we may want a representation that also let us lower such reduction into a lower optimized form where we would have each warp performing a reduction on the data assigned to them, have a phase where data is combined to one value per thread block, and finally having one value for all thread blocks. 

 

For CPUs, we may want to express some corse parallelism for tiles, and then using threads within a core to process in parallel work within a tile.

 

Such multi-level approach forces you to have nested parallelism with barriers.  


I would argue that this is a specific lowering (or implementation detail) of parallel execution on CPUs. Can this not be modelled in the same way that it is on GPUs? The difference being that the inner loops after tiling are mapped to sequential loops. Something like

loop.parallel (%iv1, %iv2) [...] { A }

could turn into (ignoring boundaries for now)

loop.parallel (%iv1_s) = (%lb1) to (%ub1) step (%ub1 / proc_ids) {
  for %inner = 0 to %ub1 / proc_ids { 
  %iv1 = %iv1_s + %inner;
    for %iv2 = %lb2 to %ub2 step %s2 {
      A
    }
  }
}

where now only the outermost dimension is parallel and has proc_ids many "threads". In a next lowering step, one could then offload the parallel op to workers. This could have been done gradually by a transformation that can tile loop.parallel in interesting ways and a second one that turns some loop.parallel into sequential loops.
 

Suggestion

 

There is nothing wrong with doing a GPU-centric version of parallel loop, it would probably be more appropriate in a GPU-specific dialect.

 

I would argue for a target-agnostic representation, structured in a way to map relatively easily into GPU-specific loops, CPUs, as well as more specialized hardware such as systolic arrays.

 

If we go that route, probably having two types of parallel loops, one with a barrier joining all the thread at the end, and one without, that would be a good start. Using Nicolas' suggested notation:

 

loop.parallel (%i) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { A } 


I do not understand the meaning of the "on (%proc_id) out_of (%num_proc)" part here. 
 

would execute the work A for all iterations and then when each participating thread has completed, they would wait in a barrier (e.g. a kernel at the thread-block level, or a CUDA syncthreads for threads in all warps.

 

A non blocking version would be

 

loop.parallel_nowait  (%j) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { B }

 

where threads don’t synchronize at the end of the parallel loop.

 

Loops with multiple indices:

 

loo.parallel(%i, %j) = (0, 0) to (10, 10) by (1,1) {A}

 

corresponds to

 

loo.parallel(%i) = (0) to (10) by (1) {

      loop.parallel_nowait  (%j) = (0) to (10) by (1) {

        A

    }

}

 

where the above is essentially a CUDA kernel if the outer loop is iterating over thread blocks and the inner loop is iterating over the warps. 

 

Loops with no sequential code between parallel regions are “perfect” GPU code; sequences of inner parallel loops can be merged into one big inner loop with CUDA syncthreads when there is one thread per data point.


Why is the second form easier to lower to CPU? I think it is more general in what can be expressed but I have to play a bit with it. What is the difference between

loo.parallel(%i) = (0) to (10) by (1) {     
  loop.parallel_nowait  (%j) = (0) to (10) by (1) {
    A
  }
}

and 

loo.parallel(%i) = (0) to (10) by (1) {     
  loop.parallel  (%j) = (0) to (10) by (1) {
    A
  }
}

If I understand correctly, the outer loop can start all instances of the inner loop in both cases. So in both cases all threads can run fully in parallel. If we add a sequential instruction into top variant like 

loo.parallel(%i) = (0) to (10) by (1) {     
  loop.parallel_nowait  (%j) = (0) to (10) by (1) {
    A
  }
  some_op();
}

what would semantics be? Does some_op run concurrently with all iterations of the inner parallel loop? Or does it run as soon as one iteration completed?

Thanks
  Stephan


--
Stephan

Stephan Herhut

unread,
Dec 4, 2019, 12:50:33 PM12/4/19
to Bruestle, Jeremy, Mehdi AMINI, MLIR
I agree with the assessment in general wrt. loop fusion. I just wonder whether there are targets other than gpu that could make use of barriers so that it makes sense to express this on the parallel loop level. I also think it would be fair for transformations/lowerings to not support barriers. The idea being that after certain transformations (like fusing loops which introduces barriers) one can no longer perform others.

Cheers
  Stephan

--
Stephan

Bruestle, Jeremy

unread,
Dec 4, 2019, 1:20:49 PM12/4/19
to Stephan Herhut, MLIR

 

 

From: Stephan Herhut <her...@google.com>
Date: Wednesday, December 4, 2019 at 6:12 AM
To: "Bruestle, Jeremy" <jeremy....@intel.com>
Cc: MLIR <ml...@tensorflow.org>
Subject: Re: [mlir] [RFC][GPU] Parallel Loop Operation/Dialect

 

Hi Jeremy. Thanks for the comments! 

 

I prefer the first one, since the idea of a non-loop-invariant identity seems like madness to me, although it is a bit less syntactically local unfortunately.

 

3) Comparison to Stripe/Tile

As a side note, Stripe and Tile implement parallel loops and reductions, in a similar but distinct way.

- Our parallel loop has basically identical semantics, other than that we have chosen not to support barriers, since lowering to serial implementation is very important.  Instead, we do any final barrier requiring fusions post lowering to GPU.

- We support an 'in-memory' reduction semantics, where there is an 'aggregate' instruction similar to 'store' but with apparently 'atomic' in memory reductions.  These reductions currently come from a fixed set (each with it's own identity values), although we have considered the region based approach.  However we decided against it due to the fact that we had no specific use cases, and it makes pattern matching much harder to do with DRR. 

 

I assume easier pattern matching here refers to fixed ops vs. regions? That is true. Would it be good enough for pattern matching to expect a form where the body of the yield is essentially a function call? We could also add a version of yield that gets a function as attribute but then has an empty region. 

 

Yes, currently matching inner regions in DRR has been causing us issues, and we recently rewrote one of our core constructions to be regionless to reduce complexity.  However I imagine DRR will improve in regard to support for regions.  Still, if you are lowering to hardware that has special support some fixed set of reductions, you will always end up pattern matching back into what is effectively an enum anyway.  The function as an attribute is interesting to me, and actually, it brings up a thought:  Isn’t a reduction effectively a combination of the actual binary reduction function *and* the identity?  Maybe some method where reductions were declared elsewhere and then only referenced in the yield would help both pattern matching and handling of identity.  Most reductions will be used over and over again throughout a program (and in fact, in most ML use cases, all reduction will be max or add).

 

We semantically allow arbitrary reordering, changes in associativity, etc, and despite looking like 'in-memory' reductions, in practice most aggregations are lowered to some in-register operations, adder trees, etc, since basically none of the platforms of interest support atomic floating point addition (the common case) The advantage of the 'in-memory' reduction semantics is that it allows a single parallel loops to produce multiple independent reductions.  It does require some complex aliasing analysis however, which we can get away with due to the affine nature of the accesses.

 

This is interesting. The proposal supports multiple reductions via multiple yield operations in a single body. I am not sure that is the same as what your approach supports. Do you see any use cases from your experience that could not be modelled with multiple yields? So in essence, you can have 

 

%prod, %sum = loop.parallel (%iv1, %iv2, %iv3) = (%lb0, %lb1, %lb2) to (%ub0, %ub1, %ub2) step (%s0, %s1, %s2) identity (%id0, %id1) {
  %v0 = load %mem[%iv1, %iv2, %iv3]

  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %prod = mul %lhs, %rhs
      return %prod
  }  
  "loop.yield"(%v0) {
    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
  }  
}

 

Let me begin by saying that what follows in purely an aside, as I think the semantics for the proposal are good, and our use case is distinct and I wouldn’t suggest adopting it.  That said…

 

So we allow something roughly like (in a close version of the new syntax):

 

loop.parallel (%i, %j) = (0, 0) to (10, 10) step(1, 1) {
  %v0 = load %mat[%i, %j]
  aggregate add %vec[%i], %v0

}

Which does reduces a matrix into a vector.

 

This can be implementing in the existing proposal as:

 

loop.parallel (%i) = (0) to (10) step(1) {
  %sum = loop.parallel (%j) = (0) to (10) step(1) identity(0) {
    %v0 = load %mat[%i, %j]

    loop.yield (%v0) {

    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
    }

  }

  store %vec[%i], %sum  

}

However in general this transformation is complex to implement.  For example, in tile we can actually write (again in a different syntax):

 

loop.parallel (%i, %j) = (0, 0) to (10, 10) step(1, 1) {
  %v0 = load %mat[%i, %j]
  aggregate add %vec[%i + %j], %v0

}

This can be converted into two-loop form (and indeed we do eventually do that), however it requires an affine transformation of the index space along the lines of:

 

%x = %i + %j

%y = %j

 

So that

 

%i -> %x - %y

%j -> %y

 

Additionally, constraints must be added as well to limit loop bound on the reading side, and loop sizes change so we get something like:

 

loop.parallel (%x) = (0) to (19) step(1) {

  %bound = max %x, 10
  %sum = loop.parallel (%y) = (0) to (%bound) step(1) identity(0) {
    %v0 = load %mat[%x - %y, %y]

    loop.yield (%v0) {

    bb0 (%lhs, %rhs):
      %sum = add %lhs, %rhs
      return %sum
    }

  }

  store %vec[%x], %sum  

}

Now, I’m not in any way suggesting we should adopt something like our ‘aggregate’ for this proposal.  In fact, I think it’s a poor choice, as lowering ‘aggregate’ to efficient code is complex and relies on a number of ‘affine/polyhedral’ tricks and probably should be kept in tile/stripe only, or other similar dialects.  I just think it’s an interesting thing to compare/contrast.

 

-Jeremy 

 

 

Alexandre Eichenberger

unread,
Dec 4, 2019, 2:28:06 PM12/4/19
to her...@google.com, ches...@google.com, jeremy....@intel.com, ml...@tensorflow.org, n...@google.com, san...@google.com
Responses below

Alexandre

-----------------------------------------------------------------------------------------------------
Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
- research: compiler optimization (OpenMP, GPU, SIMD)
- info: al...@us.ibm.com http://www.research.ibm.com/people/a/alexe
- phone: 914-945-1812 (work), 914-312-3618 (cell)
 
 
----- Original message -----
From: Stephan Herhut <her...@google.com>
To: Alexandre Eichenberger <al...@us.ibm.com>, Nicolas Vasilache <n...@google.com>
Cc: George Karpenkov <ches...@google.com>, "Bruestle, Jeremy" <jeremy....@intel.com>, MLIR <ml...@tensorflow.org>, Sanjoy Das <san...@google.com>
Subject: [EXTERNAL] Re: [mlir] [RFC][GPU] Parallel Loop Operation/Dialect
Date: Wed, Dec 4, 2019 12:37 PM
 
On Wed, Dec 4, 2019 at 4:29 PM Alexandre Eichenberger <al...@us.ibm.com> wrote:

parallel proposal

 

When designing a parallel abstraction, we have some basic design choices. Here are two.

 

GPU-centric vs Target agnostic:

 

Code such as below are inherently GPU-centric as it assume that each iteration is mapped to a single hardware thread:

 

loop.parallel (%i, %j) = (0) to (10) step (1) {

    %x = load %ref[%i, %j]

    loop.barrier { levels = 1 }

    store %x, %ref[%j, %i]

}

 

as if one thread was mapped two iterations, you would have a deadlock as the barrier expect the computations prior to the barrier to be completed for all iterations. It is always possible to map this example to something that would work on a CPU by splitting the loop, but then the simple loop implied by the construct is not so simple anymore. I.e. there is a gap between what is described with the actual behavior. 

 
I don't see the barrier as GPU specific. What I agree with is that more uses of the barrier have a direct mapping to GPU than to CPU. In its most general form, it is not supported by any platform without splitting loops. Even for GPUs, one might want to tile multiple levels to reduce the number of threads and hence will need to split the loop and materialize values. 
I also see the barrier as orthogonal to the parallel loop itself. One can perfectly well use parallel loop without using barriers.
 
AEE: Barriers are used both on CPUs and GPUs, absolutely. What is unusual with the above representation is that it reflect a "typical" GPU usage where there is one thread per iteration (with guaranteed progress on the threads). It's a level of abstraction that is more removed from most implementation on CPUs (and GPUs, as you mentioned where you may want to tile for performance). I would rather not have a representation that relies on the one thread per iteration & guaranteed forward progress for all threads.
 

High-level vs multi-level

 

Consider a reduction on a GPU. We may have a high level way to say "reduce this across warps and thread-blocks" and be done with it (aka a high-level representation). Alternatively, we may want a representation that also let us lower such reduction into a lower optimized form where we would have each warp performing a reduction on the data assigned to them, have a phase where data is combined to one value per thread block, and finally having one value for all thread blocks. 

 

For CPUs, we may want to express some corse parallelism for tiles, and then using threads within a core to process in parallel work within a tile.

 

Such multi-level approach forces you to have nested parallelism with barriers.  

 
I would argue that this is a specific lowering (or implementation detail) of parallel execution on CPUs. Can this not be modelled in the same way that it is on GPUs? The difference being that the inner loops after tiling are mapped to sequential loops. Something like
 
loop.parallel (%iv1, %iv2) [...] { A }
 
could turn into (ignoring boundaries for now)
 
loop.parallel (%iv1_s) = (%lb1) to (%ub1) step (%ub1 / proc_ids) {
  for %inner = 0 to %ub1 / proc_ids { 
  %iv1 = %iv1_s + %inner;
    for %iv2 = %lb2 to %ub2 step %s2 {
      A
    }
  }
}
 
where now only the outermost dimension is parallel and has proc_ids many "threads". In a next lowering step, one could then offload the parallel op to workers. This could have been done gradually by a transformation that can tile loop.parallel in interesting ways and a second one that turns some loop.parallel into sequential loops.
 
AEE: I agree with your point. If you wanted the inner loop to go in parallel, e.g. within the threads on a core, or the threads in a warp, you a need nested parallelism representation. You also need to define if there is a barrier at the end of the parallel loop.
 
 

Suggestion

 

There is nothing wrong with doing a GPU-centric version of parallel loop, it would probably be more appropriate in a GPU-specific dialect.

 

I would argue for a target-agnostic representation, structured in a way to map relatively easily into GPU-specific loops, CPUs, as well as more specialized hardware such as systolic arrays.

 

If we go that route, probably having two types of parallel loops, one with a barrier joining all the thread at the end, and one without, that would be a good start. Using Nicolas' suggested notation:

 

loop.parallel (%i) = (0) to (10) by (1) on (%proc_id) out_of (%num_proc) { A } 

 
I do not understand the meaning of the "on (%proc_id) out_of (%num_proc)" part here. 
 
AEE: I reused here the notation from Nicolas in a prior post on this exchange. %num_proc expresses the total number of threads in the parallel loop, and %proc_id provides an id for a given threads (range 0.. %num_proc-1, inclusive)
AEE: the difference is in the number of barriers executed. Assume N parallel threads in the outer loop, and M threads for each of the inner loops. In the second of 3 examples above, N  + N*M barriers are executed. In the first, the two loops can be considered as a single unit that can be mapped to threads. For example, if the two loops were participating in a 2D mapping of tiles, this 2D tiles would be mapped all together to threads without forcing one row of tiles to completely finish before starting the next row. So if you have X threads participating to the two loops, you would have X barriers
 
The semantic of the 3rd example above, some_op() is executed once before guaranteeing that all of the work in the inner loop is completed
 
Thanks
  Stephan
 
 
 
--
Stephan
 

Reply all
Reply to author
Forward
0 new messages