[RFC][GPU] Functions and Memory Modeling

Alex Zinenko

unread,

Oct 31, 2019, 1:03:28 PM10/31/19

to MLIR, Christian Sigg, Mahesh Ravishankar

If you are not interested in the GPU pipeline, you can stop reading now.

We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.

Function Definition
===================

The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.

GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

The control flow inside the GPU function is an arbitrary CFG and can
use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, the
control flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong. The terminator or the region-containing Op
can further specify which of their operands affect the control flow.
For example, successors of `std.cond_br` only control-depend on the
value of its first argument that expresses the condition. Inside a
non-kernel GPU function, the Op is also transitively control-dependent
on any values the call-site operation is.

An op can require this set to remain fixed during transformations,
i.e. the control flow cannot be made dependent on more or less values
than before the transformation. The “more” part is equivalent to
LLVM’s convergent property [1]. The “less” part---to its inverse. The
“same control-affecting value set” is therefore more restrictive than
the LLVM model as the latter has been demonstrated to be insufficient
for some memory-related operations [2]. The restriction is intended to
be decomposed into several independent elements, possibly along the
lines of what is proposed in [2] and being discussed in LLVM [3], but
this relaxation will be proposed and discussed separately due to its
complexity. Any op that requires the set of control-affecting values
to remain fixed must have either a "gpu.func" or a "gpu.launch" op as
an ancestor.

Within GPU-function bodies, we conservatively assume that
side-effecting (or unregistered) operations require the set of
control-affecting operations to be fixed. Standard and Affine loads
and stores do not require this property.

Syntax
------

op ::= `gpu.func` symbol-ref-id `(` argument-list `)` (`->`
function-result-list)?
memory-attribution `kernel`? function-attributes? region

memory-attribution ::= (`workgroup` `(` ssa-id-and-type-list `)`)?
(`private` `(` ssa-id-and-type-list `)`)?

Example
-------

gpu.func @foo(%arg0: index)
workgroup(%shared: memref<32xf32, 3>)
private(%private: memref<1xf32, 5>)
kernel
attributes {qux: "quux"} {
gpu.return
}

The generic form illustrates the concept

"gpu.func"(%arg: index) {sym_name: "foo", kernel, qux: "quux"} ({
^bb0(%arg0: index, %shared: memref<32xf32, 3>, %private: memref<1xf32, 5>):
"gpu.return"() : () -> ()
}) : (index) -> ()

Note the non-default memory spaces used in memref types in memory-attribution.

Function Calls
==============

GPU functions can be called from other GPU functions through the
"gpu.call" operation. It is similar to "std.call" but supports the GPU
function type. "gpu.call" requires transformations not to modify the
set of values it is control-dependent upon unless the body of the
function is known not to contain any op that has the same requirement.
In particular, calls to external GPU functions must remain
control-dependent on the same set of values.

GPU functions with "kernel" attribute can be launched from standard
functions using "gpu.launch_func" operation. They cannot be called
directly.
GPU functions cannot be launched from within other GPU functions, even
transitively. An attempt to launch a GPU kernel from within another
kernel leads to undefined behavior. This can be relaxed in the future
to support dynamic kernel launch.

Standard function can be called from GPU functons using "std.call".

Recursive calls are only allowed for non-kernel GPU functions with no
memory attribution. Rationale: kernel GPU functions are intended to be
called from host, they cannot call themselves from device; memory
attribution requires buffers to be live during the function execution,
a potentially unbounded recursion makes it impossible to evaluate the
total amount of special memory required by the function.

Syntax
------

op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type

Example
-------

gpu.call @foo(%0) : (index) -> ()

Memory Hierarchy
================

GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.

Data in the workgroup memory can be accessed by any work item in the
group, same subscripts pointing to the same element of the memref in
all work items. Since work items may be executed concurrently, a
workgroup synchronization may be required to satisfy a data dependency
between multiple work items.

Data in private memory can only be accessed by one work item. Same
subscripts used to access an element of a memref in private memory
will point to (conceptually) different addresses for different work
items.
For both memory spaces, data may be accessed using existing load and
store instructions, i.e. standard or affine ones.

Other memory types (constant, texture) can be added in the future.

Memory Attribution
==================

Memory buffers are defined at the function level, either in
"gpu.launch" or in "gpu.func" ops. This encoding makes it clear where
the memory belongs makes the lifetime of the memory visible. The
memory is only accessible while the kernel is launched/the function is
currently invoked (there is an activation on the stack). The latter is
more strict than actual gpu implementations but using static memory at
the function level is just for convenience. It is also always possible
to pass pointers to the shared memory into other functions, provided
they expect the correct memory space.

The buffers are considered live throughout the execution of the GPU
function body. The absence of memory attribution syntax means that the
function does not require special buffers.
Rationale: although the underlying models declare memory buffers at
the module level, we chose to do it at the function level to provide
some structuring for the lifetime of those buffers; this avoids the
incentive to use the buffers for communicating between different
kernels or launches of the same kernel, which should be done through
function arguments intead; we chose not to use `alloca`-style approach
that would require more complex lifetime analysis following the
principles of MLIR that promote structure and representing analysis
results in the IR.

The buffers are required to be of a statically-shaped `memref` type.
Rationale: the amount of special memory must be statically known on
most underlying model, dynamic allocation of special memory requires
additional manipulation. Memref is the only type that currently
supports memory spaces.

Example
-------

gpu.launch blocks(%c1, %c1, %c1) in (%bx, %by, %bz)
threads(%c32, %c1, %c1) in (%tx, %ty, %tz)
workgroup(%wg_memory : memref<32xf32, 3>)
private(%pr_memory : memref<1xf32, 5>) {
/* some operations that write to %wg_memory */
/* ... */

// Each workitem reads a different element in the workgroup memory.
%42 = load %wg_memory[%tx] : memref<32xf32, 3>

// Each workitem writes the value into its private memory space.
%c0 = constant 0 : index
store %42, %pr_memory[%c0] : memref<1xf32, 5>

// Passing the buffer into a callee.
call @callee(%pr_memory)
}

Translation
===========

When translating to an external IR, memory attribution can be required
to be performed at the module level. In the simplest implementation,
we naively collect all buffers required by every GPU function in the
module and concatenate them. In the future, this can be improved using
a buffer allocation path (that can be generic and shared with other
dialects) given the lifetime information extracted from the function
call graph. This might require supporting memory attribution at the
module level and "passing" the buffers to the GPU functions, which is
left for future work, along with the buffer allocation itself.

Other Modifications
===================

gpu.launch
----------

"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.

gpu.launch_func
---------------

The "gpu.launch_func" operation now only supports symbols that are
defined by a "gpu.func" operation with "kernel" attribute.

Alternatives Considered
=======================

Memory attribution at module level
----------------------------------

Both NVVM and SPIR-V use module-level values to define buffers in
workgroup-level memory, as well as in global and constant memory. Any
function within the module can directly access such values, and they
can be used to pass data between the functions within one kernel
invocation.

Conceptually, this representation is non-obvious because the same
buffer may seem to be reused between invocations of different kernels,
but it is not the case. It is only live during the execution of a
kernel, and different kernels in the same module can reuse the same
buffer with no guarantees about its content.

Memory attribution at the function-level provides a middle ground
between detailed lifetime analysis starting from SSA and a potential

gpu.alloca
----------

Workgroup and private memory could be allocated statically using the
newly introduced static allocation operation "gpu.alloca". This
corresponds less to the static module-level allocation approach taken
by the lower-level abstractions we intend to generalize. It would
create a misconception of buffers being dynamically allocated, with a
possibility to add new buffers in case of, e.g., recursive function
calls. Furthermore, it could be used with unstructured control flow
making lifetime analysis more complex.

Out of Scope
============

- Launching kernels from kernels aka Dynamic parallelism
- Stack management

[1] https://llvm.org/docs/LangRef.html#function-attributes
[2] http://lists.llvm.org/pipermail/llvm-dev/2016-October/106431.html
[3] http://lists.llvm.org/pipermail/llvm-dev/2019-October/135929.html
[4] https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#address-spaces

--
-- Alex

Mahesh Ravishankar

unread,

Oct 31, 2019, 3:41:46 PM10/31/19

to Alex Zinenko, MLIR, Christian Sigg

Thanks a lot for the proposal Alex. Comments from me are below.

On Thu, Oct 31, 2019 at 10:03 AM Alex Zinenko <zin...@google.com> wrote:

If you are not interested in the GPU pipeline, you can stop reading now.

We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.

Function Definition
===================

The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.

GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.

This sounds a bit ambiguous to me. Typically kernels need to have a fixed API (CUDA requires __global__ to be void return type. SPIR-V/Vulkan requires the signature to be void(void)). Functions called within a kernel don't have this restriction. How about keep GPU functions just for kernels and use Standard functions for functions called within GPU functions. You can then just disallow calling a GPU function within other GPU function or Standard functions to disallow dynamic parallelism (for now).

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.

The control flow inside the GPU function is an arbitrary CFG and can
use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, the

control flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong.

This makes sense to me, but is there any special consideration needed for operations containing regions not isolated from above?

I am not sure we need to explicitly mark private memory. Any "allocation" within the thread is private by-default. In NVVM any alloca in the device-side functions become private/local memory AFAIK. I haven't had enough experience with SPIR-V variables with Private Storage Class, but I think the allocation can be promoted to module scope during conversion from GPU to SPIR-V. Reason I am bringing this up is that a std.func called by a gpu.func itself might allocate private memory, and having to specify it for a gpu.func but having implicit private memory semantics for a std.func called by a gpu.func seems inconsistent.

Is there a concept of "generic pointers"? Then do we want to support address space cast operations?

The buffers are considered live throughout the execution of the GPU
function body. The absence of memory attribution syntax means that the
function does not require special buffers.
Rationale: although the underlying models declare memory buffers at
the module level, we chose to do it at the function level to provide
some structuring for the lifetime of those buffers; this avoids the
incentive to use the buffers for communicating between different
kernels or launches of the same kernel, which should be done through
function arguments intead; we chose not to use `alloca`-style approach
that would require more complex lifetime analysis following the
principles of MLIR that promote structure and representing analysis
results in the IR.

The buffers are required to be of a statically-shaped `memref` type.
Rationale: the amount of special memory must be statically known on
most underlying model, dynamic allocation of special memory requires
additional manipulation. Memref is the only type that currently
supports memory spaces.

This seems unnecessary. CUDA for example can have a "dynamic shape" shared memory size. The size is specified on the host side during the launch. What additional manipulation are you referring to here?

I think this is only relevant for private memory right? For shared memory you cannot allocate additional shared memory within the kernel. For private memory though I don't fully follow why the GPU function needs to take an opinion on it. It's up to the client creating the gpu.func operation to manage the private memory. What if the std.func called within gpu.func is recursive and has an alloca. We cannot concatenate the buffers as suggested here in that case.

Other Modifications
===================

gpu.launch
----------

"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.

gpu.launch_func
---------------

The "gpu.launch_func" operation now only supports symbols that are
defined by a "gpu.func" operation with "kernel" attribute.

Alternatives Considered
=======================

Memory attribution at module level
----------------------------------

Both NVVM and SPIR-V use module-level values to define buffers in
workgroup-level memory, as well as in global and constant memory. Any
function within the module can directly access such values, and they
can be used to pass data between the functions within one kernel
invocation.

The fact that the workgroup-level memory is represented at module scope is really an artifact and shouldn't be relied upon AFAIK. Global and constant are intended to be live across kernel boundaries, and that seems pretty natural. The proposal here does not restrict this for global/constant AFAIK. The only remaining one is private. To me it scope of that should be limited to within the gpu.func op. The host need not specify how much private memory to use.

--

Mahesh

Lei Zhang

unread,

Oct 31, 2019, 6:49:42 PM10/31/19

to Mahesh Ravishankar, Alex Zinenko, MLIR, Christian Sigg

Awesome, thanks for the RFC, Alex! Some comments inlined. :)

On Thu, Oct 31, 2019 at 3:41 PM 'Mahesh Ravishankar' via MLIR <ml...@tensorflow.org> wrote:

Thanks a lot for the proposal Alex. Comments from me are below.

On Thu, Oct 31, 2019 at 10:03 AM Alex Zinenko <zin...@google.com> wrote:
If you are not interested in the GPU pipeline, you can stop reading now.

We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.

Function Definition
===================

The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.

GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.

This sounds a bit ambiguous to me. Typically kernels need to have a fixed API (CUDA requires __global__ to be void return type. SPIR-V/Vulkan requires the signature to be void(void)). Functions called within a kernel don't have this restriction. How about keep GPU functions just for kernels and use Standard functions for functions called within GPU functions. You can then just disallow calling a GPU function within other GPU function or Standard functions to disallow dynamic parallelism (for now).

+1. Both CUDA and Vulkan (and others?) has special requirements on kernel signatures then IMO it's better to model it that way.

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.

The control flow inside the GPU function is an arbitrary CFG

I'm not sure this works for Vulkan compute. We'll need structured control flow actually.

SPIR-V variables belonging to the Private storage class are just variables available to all the functions in the current invocation (or work item) but not across invocations.

As a background, Vulkan compute does not allow recursion at all. So for std functions called by gpu functions, they also cannot be recursive.

Syntax
------

op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type

Example
-------

gpu.call @foo(%0) : (index) -> ()

Memory Hierarchy
================

GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.

In SPIR-V there are more storage classes than this: https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_storage_class_a_storage_class and we'd like to utilize some of them (e.g., PushConstant) for Vulkan compute. I guess it's fine to keep consistent with the number assignment with NVVM here but we'd need to use other numbers too. :)

Data in the workgroup memory can be accessed by any work item in the
group, same subscripts pointing to the same element of the memref in
all work items. Since work items may be executed concurrently, a
workgroup synchronization may be required to satisfy a data dependency
between multiple work items.

Data in private memory can only be accessed by one work item. Same
subscripts used to access an element of a memref in private memory
will point to (conceptually) different addresses for different work
items.
For both memory spaces, data may be accessed using existing load and
store instructions, i.e. standard or affine ones.

Other memory types (constant, texture) can be added in the future.

Memory Attribution
==================

Memory buffers are defined at the function level, either in
"gpu.launch" or in "gpu.func" ops. This encoding makes it clear where
the memory belongs makes the lifetime of the memory visible. The
memory is only accessible while the kernel is launched/the function is
currently invoked (there is an activation on the stack). The latter is
more strict than actual gpu implementations but using static memory at
the function level is just for convenience. It is also always possible
to pass pointers to the shared memory into other functions, provided
they expect the correct memory space.

Is there a concept of "generic pointers"? Then do we want to support address space cast operations?

FYI: generic pointers are not supported by Vulkan compute. Is CUDA relying on generic pointers resolving to different storage class extensively?

What about the buffers used by other functions called from a gpu function? Are they collected to the entry gpu function?

In the future, this can be improved using
a buffer allocation path (that can be generic and shared with other
dialects) given the lifetime information extracted from the function
call graph. This might require supporting memory attribution at the
module level and "passing" the buffers to the GPU functions, which is
left for future work, along with the buffer allocation itself.

I think this is only relevant for private memory right? For shared memory you cannot allocate additional shared memory within the kernel. For private memory though I don't fully follow why the GPU function needs to take an opinion on it. It's up to the client creating the gpu.func operation to manage the private memory. What if the std.func called within gpu.func is recursive and has an alloca. We cannot concatenate the buffers as suggested here in that case.

Other Modifications
===================

gpu.launch
----------

"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.

Maybe not relevant to this RFC, but should we consider deduplicate the mechanisms here then? If it does not provide a useful abstraction differentiation, having less mechanisms is generally better IMO. :)

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAArwm2b6NbzqzyumH%3Dz2fj%2BCmjOBxPozMF21PhyBa9ycrw6ryw%40mail.gmail.com.

Mehdi AMINI

unread,

Nov 1, 2019, 2:48:37 AM11/1/19

to Alex Zinenko, MLIR, Christian Sigg, Mahesh Ravishankar

On Thu, Oct 31, 2019 at 10:03 AM 'Alex Zinenko' via MLIR <ml...@tensorflow.org> wrote:

If you are not interested in the GPU pipeline, you can stop reading now.

We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.

Thanks! That's a great step forward for the definition of the GPU dialect.

It isn't clear to me how much of this can be specified like this in a dialect at the moment: I am not sure we can do this without making this SIMT/convergent property a "core" concept in MLIR. In particular it isn't clear to me that this restriction can be imposed by a "convergent" operation on the enclosing operations that aren't in the gpu dialect. For example what about a convergent operation inside an affine loop nest in a gpu function? What about the affine tiling operation? It seems that the affine dialect would need to be aware of the convergent property to assess the validity.

This (the convergent part) alone is likely something we should start a separate thread as I'm afraid of derailing to much the rest of the discussion on all the other aspects.

Seems like this could be just a verifier failure? Why do we need to leave this UB?

This can be relaxed in the future
to support dynamic kernel launch.

Standard function can be called from GPU functons using "std.call".

Is the difference that a GPU function has the convergent property? Are there other aspects that makes a difference between std.call and gpu.call?

Recursive calls are only allowed for non-kernel GPU functions with no
memory attribution. Rationale: kernel GPU functions are intended to be
called from host, they cannot call themselves from device; memory
attribution requires buffers to be live during the function execution,
a potentially unbounded recursion makes it impossible to evaluate the
total amount of special memory required by the function.

Syntax
------

op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type

Example
-------

gpu.call @foo(%0) : (index) -> ()

Memory Hierarchy
================

GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].

In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.

+1: we really need better than integer, this is an unfortunate design choice of LLVM :(

(actually if someone want to help fixing LLVM, I think it is possible but require some effort)

Data in the workgroup memory can be accessed by any work item in the
group, same subscripts pointing to the same element of the memref in
all work items. Since work items may be executed concurrently, a
workgroup synchronization may be required to satisfy a data dependency
between multiple work items.

Data in private memory can only be accessed by one work item. Same
subscripts used to access an element of a memref in private memory
will point to (conceptually) different addresses for different work
items.
For both memory spaces, data may be accessed using existing load and
store instructions, i.e. standard or affine ones.

I'm not sure about the intent behind this last sentence: I don't think the GPU dialect should be restrictive with respect to this, someone should be able to add their own load in their own dialect and compose this well within a kernel. Am I missing something? Or was this sentence just intended as an "example"?98qw

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gN5uO4Cug852vE71S0Mp12RUHwSe6MKXhGRUKXSPY8xw%40mail.gmail.com.

Mehdi AMINI

unread,

Nov 1, 2019, 2:55:50 AM11/1/19

to Lei Zhang, Mahesh Ravishankar, Alex Zinenko, MLIR, Christian Sigg

This means that some control flow cannot be lowered when targeting Vulkan, but that shouldn't be a blocker in the definition of the GPU dialect though?

Same as above: is seems like the GPU dialect should not be the common denominator of the targets otherwise it won't be general enough.

Syntax
------

op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type

Example
-------

gpu.call @foo(%0) : (index) -> ()

Memory Hierarchy
================

GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.

In SPIR-V there are more storage classes than this: https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_storage_class_a_storage_class and we'd like to utilize some of them (e.g., PushConstant) for Vulkan compute. I guess it's fine to keep consistent with the number assignment with NVVM here but we'd need to use other numbers too. :)

We need strings really, these can be namespaced and specialized then.

What deduplication do you have in mind? Having an op with a region is really convenient in MLIR. Something that I'd really like to see is to stop making it IsolatedFromAbove though.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAGhUxBDiMDuY-R5JEwXJhCAWEJuhCYgBPiN-PSxBr1v8XyPLDg%40mail.gmail.com.

Alex Zinenko

unread,

Nov 1, 2019, 2:10:13 PM11/1/19

to Mahesh Ravishankar, MLIR, Christian Sigg

On Thu, Oct 31, 2019 at 8:41 PM Mahesh Ravishankar <ravish...@google.com> wrote:

Thanks a lot for the proposal Alex. Comments from me are below.

On Thu, Oct 31, 2019 at 10:03 AM Alex Zinenko <zin...@google.com> wrote:
If you are not interested in the GPU pipeline, you can stop reading now.

We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.

Function Definition
===================

The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.

GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.

This sounds a bit ambiguous to me. Typically kernels need to have a fixed API (CUDA requires __global__ to be void return type. SPIR-V/Vulkan requires the signature to be void(void)). Functions called within a kernel don't have this restriction. How about keep GPU functions just for kernels and use Standard functions for functions called within GPU functions. You can then just disallow calling a GPU function within other GPU function or Standard functions to disallow dynamic parallelism (for now).

One of the key aspects of GPU functions is the memory attribution and scoping. We cannot have that on standard functions.

With the `kernel` attribute, it's possible to add an attribute verifier to check the signature. That being said, the signature requirements are different for CUDA and SPIR-V, so I feel like we can generalize here instead of taking the common denominator. The lowering can then use SPIR-V-specific mechanism of passing the arguments into the kernel, or do it as a pre-transformation.

We could alternatively consider separate gpu.kernel and gpu.func, both with memory attribution, but the former with signature constraints. I don't think these two things are sufficiently different to justify the complexity of having two function-like ops.

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.

I'd probably rephrase this is "currently not supported". I haven't figured out a clear story for lowering wrt memory attribution. Should declarations include memory attribution? Are they allowed to have it at all?

The control flow inside the GPU function is an arbitrary CFG and can
use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, the
control flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong.

This makes sense to me, but is there any special consideration needed for operations containing regions not isolated from above?

The set of control-affecting values is constructed across regions, so it does not look necessary. Implicitly captured values cannot affect the control flow unless they are used in a control-flow Op, at which point we include them to the control-affecting set regardless of them being captured.

Interesting point, thanks! I would argue that having multiple mechanisms for different kinds of memory increases the cognitive overhead of the model, so I'd prefer to only have one. There are arguments against the alloca approach below.

In the same way an allocation can be promoted to module scope like you mentioned, it can be promoted to the function scope when converting to GPU. We actually don't have an "alloca" equivalent in the dialects that lower to GPU. It is still possible to inject llvm.alloca with the right address space somewhere, but I don't think we can actively prevent this kind of behavior.

Not yet. Address space casting should likely happen in core/standard on memrefs.

The buffers are considered live throughout the execution of the GPU
function body. The absence of memory attribution syntax means that the
function does not require special buffers.
Rationale: although the underlying models declare memory buffers at
the module level, we chose to do it at the function level to provide
some structuring for the lifetime of those buffers; this avoids the
incentive to use the buffers for communicating between different
kernels or launches of the same kernel, which should be done through
function arguments intead; we chose not to use `alloca`-style approach
that would require more complex lifetime analysis following the
principles of MLIR that promote structure and representing analysis
results in the IR.

The buffers are required to be of a statically-shaped `memref` type.
Rationale: the amount of special memory must be statically known on
most underlying model, dynamic allocation of special memory requires
additional manipulation. Memref is the only type that currently
supports memory spaces.

This seems unnecessary. CUDA for example can have a "dynamic shape" shared memory size. The size is specified on the host side during the launch. What additional manipulation are you referring to here?

Exactly that - providing the shared memory size from host. We don't have support for that in gpu.launch. This can be relaxed later, but looks superfluous for this proposal.

Memory attribution (the term is intended to differentiate from allocation) gives you structural scoping rules, as opposed to CFG dominance scoping, which are arguably easier to analyze. The intention here is for each function to forward-declare its requirements in terms of shared memory. Translating to a lower-level dialect, we can trace all GPU functions called from a kernel and obtain the total shared memory size required at the kernel level.

There is no alloca in MLIR. If there had been, we could have removed support for lowering it directly to LLVM IR alloca when going to the GPU and replaced it with attribution.

Recursive functions probably require more detailed consideration. What's the support of those in NVVM and SPIR-V?

Other Modifications
===================

gpu.launch
----------

"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.

gpu.launch_func
---------------

The "gpu.launch_func" operation now only supports symbols that are
defined by a "gpu.func" operation with "kernel" attribute.

Alternatives Considered
=======================

Memory attribution at module level
----------------------------------

Both NVVM and SPIR-V use module-level values to define buffers in
workgroup-level memory, as well as in global and constant memory. Any
function within the module can directly access such values, and they
can be used to pass data between the functions within one kernel
invocation.

The fact that the workgroup-level memory is represented at module scope is really an artifact and shouldn't be relied upon AFAIK. Global and constant are intended to be live across kernel boundaries, and that seems pretty natural. The proposal here does not restrict this for global/constant AFAIK. The only remaining one is private. To me it scope of that should be limited to within the gpu.func op. The host need not specify how much private memory to use.

Both shared and private are scoped at the function level in the proposal. There is no reliance on module-level declarations, this is actually a rejected alternative.

--

-- Alex

Mahesh Ravishankar

unread,

Nov 1, 2019, 2:12:02 PM11/1/19

to Lei Zhang, Alex Zinenko, MLIR, Christian Sigg

AFAIK, in CUDA you cannot annotate addressspace information with pointer arguments, which means if you pass a pointer to a function it has to be passed as generic pointers. In PTX there are not pointer types at all (they are represented as untyped 32bit/64bit values). So you need generic pointers to be supported for CUDA. If you are not doing separate compilation (where you can have __device__ functions defined across multiple compilation units that are linked together) or if you use LTO, you could propagate the address space information across function boundaries without changing the ABI. I am pretty sure at NVVM level as well, pointer arguments do not have address space information attached to them (unlike SPIR-V where pointers have explicit storage class annotation AFAIK). So yeah, I would say CUDA/NVVM relies heavily on generic pointers. The underlying hardware supports the resolution of generic to specific address space based on the pointer value (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#generic-addressing)

--

Mahesh

Alex Zinenko

unread,

Nov 1, 2019, 2:26:11 PM11/1/19

to Lei Zhang, Mahesh Ravishankar, MLIR, Christian Sigg

They have different requirements, so I would say it is reasonable to check them at the lowering time.

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.

The control flow inside the GPU function is an arbitrary CFG

I'm not sure this works for Vulkan compute. We'll need structured control flow actually.

A single block with loop operations inside fits into "arbitrary CFG" description. Disallowing CFG inside GPU regions looks like imposing a specific order of lowerings for no good reason.

This also sounds like something that should be checked in the lowering. Unless all potential targets forbid recursion, at which point we can just forbid it in the dialect.

Syntax
------

op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type

Example
-------

gpu.call @foo(%0) : (index) -> ()

Memory Hierarchy
================

GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.

In SPIR-V there are more storage classes than this: https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_storage_class_a_storage_class and we'd like to utilize some of them (e.g., PushConstant) for Vulkan compute. I guess it's fine to keep consistent with the number assignment with NVVM here but we'd need to use other numbers too. :)

And the numbers conflict... Maybe we should consider having our own numbers and really look into dialect-specific memory spaces as Mehdi suggested below?

Address space cast can be useful independently of the GPU dialect.

To the entry function, and then across entry functions if there is more than one in a module.

In the future, this can be improved using
a buffer allocation path (that can be generic and shared with other
dialects) given the lifetime information extracted from the function
call graph. This might require supporting memory attribution at the
module level and "passing" the buffers to the GPU functions, which is
left for future work, along with the buffer allocation itself.

I think this is only relevant for private memory right? For shared memory you cannot allocate additional shared memory within the kernel. For private memory though I don't fully follow why the GPU function needs to take an opinion on it. It's up to the client creating the gpu.func operation to manage the private memory. What if the std.func called within gpu.func is recursive and has an alloca. We cannot concatenate the buffers as suggested here in that case.

Other Modifications
===================

gpu.launch
----------

"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.

Maybe not relevant to this RFC, but should we consider deduplicate the mechanisms here then? If it does not provide a useful abstraction differentiation, having less mechanisms is generally better IMO. :)

The inlined version allows one to perform code motion without caring about inter-procedural aspects. I'd say it's quite useful.

--

-- Alex

Alex Zinenko

unread,

Nov 1, 2019, 2:38:32 PM11/1/19

to Mehdi AMINI, MLIR, Christian Sigg, Mahesh Ravishankar

Indeed, I tried to be as conservative as possible, but you are right about the issue. It looks even worse in case of unregistered operations, the conservative thing to do is to assume they are convergent and anti-convergent.

I'll start a separate thread for this, just a thread without a concrete proposal for now.

We may be calling into an external function which launches a kernel, but we have no way to verify it since it's external.

This can be relaxed in the future
to support dynamic kernel launch.

Standard function can be called from GPU functons using "std.call".

Is the difference that a GPU function has the convergent property? Are there other aspects that makes a difference between std.call and gpu.call?

std.call cannot know about gpu.functions.

We may want to detach std.call from std.func and make it applicable to any function-like symbol instead.

As an example indeed. "May be accessed" means you are allowed to use these, but we don't restrict that. I'll rephrase.

--

-- Alex

Alex Zinenko

unread,

Nov 1, 2019, 2:39:47 PM11/1/19

to Mehdi AMINI, Lei Zhang, Mahesh Ravishankar, MLIR, Christian Sigg

+1. This is exactly what I had in mind.

--

-- Alex

Mahesh Ravishankar

unread,

Nov 1, 2019, 3:30:01 PM11/1/19

to Alex Zinenko, Christian Sigg, MLIR

Thanks for the responses Alex. So is it fair to say that the introduction of gpu.func and gpu.call has more to do with convergence properties, and to limit transformations that might affect the control-dependence of these ops? If so, then that makes sense to me now.

--

Mahesh

Lei Zhang

unread,

Nov 1, 2019, 10:38:45 PM11/1/19

to Alex Zinenko, Mahesh Ravishankar, MLIR, Christian Sigg

I didn't mean we should disallow CFG inside GPU functions completely. :) Wanted to point out the requirement from Vulkan compute's perspective. It's just that "arbitrary CFG" sounds too broad to me. Based on my understanding OpenCL is less restrictive than Vulkan compute and requires reducible CFG. So neither support arbitrary CFG. Not exactly sure about CUDA side, though. Mehdi's argument that we want this to be generic and not just be a common denominator of the targets makes sense to me. But on the other side, it also means later down the pipeline we need to verify and reject invalid ones. So if there are common stuff can be checked earlier we'd avoid doing the work in all downstream consumers. :)

Sanjoy Das

unread,

Nov 2, 2019, 12:37:41 AM11/2/19

to Alex Zinenko, MLIR, Christian Sigg, Mahesh Ravishankar, Tim Shen, George Karpenkov

[CC Tim and George]

Hi Alex,

Do we actually need a CFG? If we can get by only using structured
control flow then maintaining the invariant you mention below might be
easier.

> use higher-level control flow ops such as structured conditionals or
> loops. Each Op is control-dependent on a set of values. That is, the
> control flow reaching this Op is conditioned on these values. By
> default, the control flow depends on the operands of the terminator Op
> in the predecessor blocks; and the operands of the Ops to which the
> surrounding regions belong. The terminator or the region-containing Op
> can further specify which of their operands affect the control flow.

Will this be specified as a OpInterface?

> For example, successors of `std.cond_br` only control-depend on the
> value of its first argument that expresses the condition. Inside a
> non-kernel GPU function, the Op is also transitively control-dependent
> on any values the call-site operation is.
>
> An op can require this set to remain fixed during transformations,
> i.e. the control flow cannot be made dependent on more or less values
> than before the transformation. The “more” part is equivalent to
> LLVM’s convergent property [1]. The “less” part---to its inverse. The
> “same control-affecting value set” is therefore more restrictive than
> the LLVM model as the latter has been demonstrated to be insufficient
> for some memory-related operations [2]. The restriction is intended to
> be decomposed into several independent elements, possibly along the
> lines of what is proposed in [2] and being discussed in LLVM [3], but
> this relaxation will be proposed and discussed separately due to its
> complexity. Any op that requires the set of control-affecting values
> to remain fixed must have either a "gpu.func" or a "gpu.launch" op as
> an ancestor.
>
> Within GPU-function bodies, we conservatively assume that
> side-effecting (or unregistered) operations require the set of
> control-affecting operations to be fixed. Standard and Affine loads
> and stores do not require this property.

Can we do better here? Although I was not involved in the process, I
suspect LLVM's design was heavily influenced by the requirement to
retrofit within a CFG-based compiler. Maybe this is a good time to
reevaluate this since we are not bound by LLVM's constraints.

Will this concatenation step happen for global memory?

> --
> You received this message because you are subscribed to the Google Groups "MLIR" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gN5uO4Cug852vE71S0Mp12RUHwSe6MKXhGRUKXSPY8xw%40mail.gmail.com.

Alex Zinenko

unread,

Nov 4, 2019, 5:21:51 AM11/4/19

to Mahesh Ravishankar, Christian Sigg, MLIR

On Fri, Nov 1, 2019 at 8:29 PM Mahesh Ravishankar <ravish...@google.com> wrote:

Thanks for the responses Alex. So is it fair to say that the introduction of gpu.func and gpu.call has more to do with convergence properties, and to limit transformations that might affect the control-dependence of these ops?

Partly, yes. It defines the execution model, which we only had implicit until now. Both convergence and memory are a part of that model.

--

-- Alex

Alex Zinenko

unread,

Nov 4, 2019, 5:57:27 AM11/4/19

to Sanjoy Das, MLIR, Christian Sigg, Mahesh Ravishankar, Tim Shen, George Karpenkov

I'd be happy to have structured control flow only, but I see several problems. One, "structured control flow" is not a core concept in MLIR, unlike CFG. There are dialects that provide Ops with control flow semantics, but it's opaque to the core. Since MLIR wants to design for an open set of Ops, it's undesirable to restrict the body of a function to only use specific dialects, which is the only possibility to restrict it to structured control flow right now. Unless we turn structured control flow into some core-level trait. Two, it would become impossible for IR that uses CFG to target GPU without some sort of raising, which sounds unfortunate for a dialect that is positioned as a lowering target.

Slightly diverging from the topic, I think you are pointing at a concept that we lost during the unification of cfgfunc and mlfunc, where the latter had semantic requirements of structured control flow embedded into the IR, but was too connected to the polyhedral representation to survive. Maybe we should consider some more native identification of what control flow is since it's one of the most compelling aspects of MLIR for potential users.

> use higher-level control flow ops such as structured conditionals or
> loops. Each Op is control-dependent on a set of values. That is, the
> control flow reaching this Op is conditioned on these values. By
> default, the control flow depends on the operands of the terminator Op
> in the predecessor blocks; and the operands of the Ops to which the
> surrounding regions belong. The terminator or the region-containing Op
> can further specify which of their operands affect the control flow.

Will this be specified as a OpInterface?

It could be, indeed.

MLIR is also CFG-based and, in a sense, it's worse than LLVM due to unregistered ops thay may actually introduce custom control flow without you knowing. We discussed this a couple of times, and it feels like there is some notion of divergent/convergent control flow that should be made core to the IR. In this proposal, the attempt is to make the model compatible with LLVM's because we are ultimately targeting LLVM IR, as well as conservative enough to support future relaxation.

LLVM is also reconsidering their design, so it would be interesting to keep in touch with that effort.

Any suggestions on modeling this are more than welcome!

This proposal does not include global memory because the Ops are (currently) at the device level of abstraction while global memory is expected to be allocated on host and passed into device functions. Since concatenation is essentially a placeholder for buffer assignment, it would be reasonable to try and reuse a buffer assignment infrastructure when we have one.

--

-- Alex

Trent Lo

unread,

Nov 4, 2019, 4:18:29 PM11/4/19

to Mahesh Ravishankar, Alex Zinenko, Christian Sigg, MLIR

Hi Mahesh,

Thanks for the clarification. Now I know where the usage of the term is from. Then, we may choose to use the terms crossworkgroup and workgroup (corresponding to global and local in OpenCL) when the spec is written down.

Trent

(I intended to reply to the mlir mailing thread. I do not know why my outlook drops the mlir mailing address for me.)

From: Mahesh Ravishankar <ravish...@google.com>
Sent: Monday, November 4, 2019 11:14 AM
To: Trent Lo <tre...@nvidia.com>
Cc: Alex Zinenko <zin...@google.com>; Christian Sigg <cs...@google.com>
Subject: Re: [mlir] [RFC][GPU] Functions and Memory Modeling

Hi Trent,

Nice to hear from you. I think the origin of these terms are from SPIR-V where __shared__ is referred to as Workgroup. Since we want to abstract GPU dialect to not be CUDA specific, we are looking for terminology accepted by a standard committee like Khronos (which defines standards for OpenCL and SPIR-V), but still map to CUDA. I do second, we be consistent with the terminology and provide explicit documentation of what each thing means and how it maps to, say CUDA, or SPIR-V.

Also, feel free to respond to the common thread. I think people do expect the MLIR mailing list to have heavy traffic.

Thanks

--

Mahesh

On Mon, Nov 4, 2019 at 10:45 AM Trent Lo <tre...@nvidia.com> wrote:

Hi Alex,

Generally the proposal looks fine to me.

One small request:
Could we refer to the "workgroup-level" memory attribute as "shared" (which is aligned with the CUDA terminology)? I also don't mind we completely borrow the OpenCL terms (i.e., global/local/private), but using workgroup with global/private to refer to memory attribution does not sound symmetric and may be confusing.

Thanks,
Trent

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gN5uO4Cug852vE71S0Mp12RUHwSe6MKXhGRUKXSPY8xw%40mail.gmail.com.

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

--

Mahesh

Mahesh Ravishankar

unread,

Nov 4, 2019, 4:20:28 PM11/4/19

to Alex Zinenko, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

+1 to support only structured control flow. I understand that it is tricky to have only structured control flow in MLIR as a whole, but for the GPU dialect we can allow only structured control flow. Most people targeting GPU dialect and interested in running kernels on GPU hardware. For all GPU hardware I am aware of, having structured control flow will help reason about convergence, and I think will not result in loss of expressivity (at least for the most common use cases). For now, we do have uses cases for supporting structured control flow in GPU dialect, but I am not aware of use cases for supporting unstructured control flow at this point. To add to that (as Lei mentioned earlier), for SPIR-V we will have to "rediscover" structured control flow if we are using a compilation flow through GPU dialect, which is a fairly involved problem. If GPU dialect has structured control flow and dialects like Linalg or Affine can target those, lowering from GPU dialect to SPIR-V would be easier.

--

Mahesh

Ben Vanik

unread,

Nov 4, 2019, 5:05:30 PM11/4/19

to Mahesh Ravishankar, Alex Zinenko, Christian Sigg, George Karpenkov, MLIR, Sanjoy Das, Tim Shen

As someone who spent a decent chunk of time messing with the relooper in emscripten for asmjs/webassembly: going from arbitrary cfgs to structured control flow is a painful edge-case-filled pit of sadness. Best to avoid it if at all possible, as even a Herculean effort to do so is less than actually handling it correctly and performantly in all cases ;)

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAArwm2ZUUjheSFUWZxEvq3tmcg70-q-n340F1K-4vbNgKnT1zw%40mail.gmail.com.

Mehdi AMINI

unread,

Nov 5, 2019, 12:08:23 AM11/5/19

to Mahesh Ravishankar, Alex Zinenko, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

I understand why it is more convenient, but it isn't totally clear why it should be totally forbidden to have a reducible CFG.

More than this, because "structured control flow" is not a trait or anything well defined in MLIR, the only way to achieve such restriction is to strictly control the set of operations (as in list them explicitly) that can go in a gpu.func (and these operations must apply the same treatment to any nested region they could hold). This would greatly limits the set of operations that could be used in this dialect, and it seems a bit overly restrictive to me at this point.

--

Mehdi

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAArwm2ZUUjheSFUWZxEvq3tmcg70-q-n340F1K-4vbNgKnT1zw%40mail.gmail.com.

Mahesh Ravishankar

unread,

Nov 5, 2019, 2:39:43 AM11/5/19

to Mehdi AMINI, Alex Zinenko, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

Yes, there would need to be a set of ops in GPU dialect that can be used in a gpu.func to express structured control flow. I don't think it has to be an either-or situation w.r.t to structured vs unstructured control flow. It would be good to avoid the situation where dialects downstream of GPU dialect have to rediscover the structured control flow from arbitrary CFG in all situations. If we need to allow unstructured control flow in GPU dialect, i.e., it cannot live in LLVM dialect (or LLVM for that matter), then GPU dialect can allow unstructured control flow, and then backends dialects, like SPIR-V, that cannot support this can reject such inputs.

--

Mahesh

Alex Zinenko

unread,

Nov 5, 2019, 12:02:52 PM11/5/19

to Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

Indeed, allowing CFG in the GPU dialect does not mean we require using it. There is also no requirement for us to be able to legalize all of what is expressible to SPIR-V or any other dialect. The conversion can check if gpu.func has only the operations the conversion supports (e.g., loops and branches) and report a failure otherwise. I expect as we go further into the lands of capabilities and device versions, there will be other features that are present in the GPU dialect that are not supported by specific lowering pipelines. For now, I would suggest we phrase it as a recommendation, but not force it in the verifier.

As a matter of fact, we do have a use case where we need a CFG inside GPU functions -- lowering. We lower the body of the (currently standard) function to the NVVM dialect, which only has CFG. Even more, the fact that the NVVM dialect exists means that we need to somehow model convergence on CFGs, even if we made it more conservative than LLVM because we do not intend to apply transformations at that level.

--

-- Alex

Mahesh Ravishankar

unread,

Nov 5, 2019, 1:00:08 PM11/5/19

to Alex Zinenko, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

+1 for that.

As a matter of fact, we do have a use case where we need a CFG inside GPU functions -- lowering. We lower the body of the (currently standard) function to the NVVM dialect, which only has CFG. Even more, the fact that the NVVM dialect exists means that we need to somehow model convergence on CFGs, even if we made it more conservative than LLVM because we do not intend to apply transformations at that level.

Don't want to belabor the point more, but this seems like it is pointing to modeling convergence at NVVM/LLVM dialect level. But, I don't really have a skin in this anymore. For SPIR-V and for IREE, we will have to rely on structured control flow anyway, so I don't have any major concerns. Thanks for the clarifications!

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655jZJh1XWKDnqNCO%2BO_Txza%3Df13-Gc6aUfkhP7aZ7t_CEw%40mail.gmail.com.

--

Mahesh

Stephan Herhut

unread,

Nov 5, 2019, 1:33:05 PM11/5/19

to Lei Zhang, Mahesh Ravishankar, Alex Zinenko, MLIR, Christian Sigg

The special requirements on the signature of kernel functions seems like a problem best solved during lowering from gpu to the target dialect. One could simply emit adaptor functions for kernel invocations or clone gpu functions with modified signature. Having actual return values on the gpu dialect level seems a much nicer model to me.

I see the main motivation for gpu.func in memory attribution on a function level and in allowing synchronization in their bodies. Both concepts only make sense for functions that have a gpu execution model in mind.

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.

The control flow inside the GPU function is an arbitrary CFG

I'm not sure this works for Vulkan compute. We'll need structured control flow actually.

This brings back a general question in how dialects compose. I see arbitrary control flow as the same category as supporting some special operation. A specific lowering of the gpu dialect to a target might disallow arbitrary control flow, like it might not support some operations.

An interesting question in the same direction is whether we want to allow synchronization in the context of arbitrary control flow or whether we limit it to structured control flow, at least initially, to simplify specifying the semantics.

Should this rather require that the path in the call graph from this operation to an enclosing gpu.launch only consists of gpu.func functions? That would scope the analysis more and give us an anchor for restrictions on the control flow.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAGhUxBDiMDuY-R5JEwXJhCAWEJuhCYgBPiN-PSxBr1v8XyPLDg%40mail.gmail.com.

--

Stephan

Mehdi AMINI

unread,

Nov 5, 2019, 1:36:37 PM11/5/19

to Mahesh Ravishankar, Alex Zinenko, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

I see convergence as orthogonal to CFG: it affects both representation the same way. I'm not sure if you imply here that convergence would apply only at the NVVM/LLVM dialect level because of the CFG support.

--

Mehdi

Mahesh Ravishankar

unread,

Nov 5, 2019, 2:14:32 PM11/5/19

to Mehdi AMINI, Alex Zinenko, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

The use case that Alex pointed to suggests that the reason to model convergence on CFGs in the GPU dialect is motivated by the fact that NVVM has CFGs. You could then reason about convergence properties as defined here in NVVM dialect instead and not necessarily bubble it up to GPU dialect. A more compelling use case would be if something "above" GPU dialect has unstructured control flow that needs to be translated to NVVM. Then there is no choice but to model convergence in presence of unstructured control flow in the GPU dialect as well.

But to reiterate, I am fine having these in the GPU dialect and backends that don't support unstructured control flow rejecting it.

--

Mahesh

Alex Zinenko

unread,

Nov 6, 2019, 5:20:27 AM11/6/19

to Stephan Herhut, Lei Zhang, Mahesh Ravishankar, MLIR, Christian Sigg

Indeed, gpu.func is intended as an anchor for the GPU programming model, which includes both the execution model (comprising synchronizations) and the memory model. That being said, it's possible that the execution model will have to be reflected in the core IR as Mehdi mentioned above.

The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).

The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.

Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.

The control flow inside the GPU function is an arbitrary CFG

I'm not sure this works for Vulkan compute. We'll need structured control flow actually.

This brings back a general question in how dialects compose. I see arbitrary control flow as the same category as supporting some special operation. A specific lowering of the gpu dialect to a target might disallow arbitrary control flow, like it might not support some operations.

An interesting question in the same direction is whether we want to allow synchronization in the context of arbitrary control flow or whether we limit it to structured control flow, at least initially, to simplify specifying the semantics.

This still goes back to the problem that structured control flow is not a concept in MLIR.

Assuming you meant std.func because one cannot launch from GPU. There are two issues: call graph is a result of an analysis, we don't necessarily want to build op definitions on it; functions can be called from other modules and we can't analyze that.

Alex Zinenko

unread,

Nov 6, 2019, 5:40:00 AM11/6/19

to Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

I'm inclined to say we cannot model convergence only in the GPU dialect, or in any single dialect. We just don't have a proper place for it yet. It sounds more like a general property that should exists across dialects if we want to be able to mix them (which I assume we do in mlir in general). At this point, we cannot define it for structured control flow only because we cannot restrict all potential dialects to only use structured control flow. My point is that we already have a dialect with unstructured control that is mixed with other dialects and that needs convergence properties.

We can however try to define the model in a way that makes reasoning simple if the control flow is guaranteed to be structured.

Stephan Herhut

unread,

Nov 6, 2019, 8:32:49 AM11/6/19

to Alex Zinenko, Lei Zhang, Mahesh Ravishankar, MLIR, Christian Sigg

No, I was looking at the path from an op with an annotation back to the kernel launch. In other words, the part of the call-graph that will be executed on the GPU. Assuming that convergence is a gpu.func specific annotation, and assuming that one would want to verify some basic properties along the call chain, having all functions be gpu.funcs seemed helpful. However, as convergence might become a generic mlir concept, this makes less sense.

I still would not want to limit it to the immediate target or body of a launch. I have not thought about this too deeply but doesn't convergence compose? If I have an op that is convergent inside a function, the function itself is also convergent. As I understand it, this is not something the proposal currently supports.

--

Stephan

Stephan Herhut

unread,

Nov 6, 2019, 8:49:10 AM11/6/19

to Alex Zinenko, Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

If we define convergence outside of gpu (or other SIMT dialects) how do we control its effect? I am thinking about code reuse with more relaxed SPMD dialects. Assume we had some code in a higher-level dialect that uses e.g. a sync which is marked convergent to support SIMT. If I use this code in a non-SIMT setting, e.g. by using a launch operation for a less restricted SPMD target, how do I make optimizations ignore the convergent property in that part of the code? If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.

I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?

Cheers

Stephan

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655j8-cm1idF%3DVm-ZGXjoNPoSmqc3E1p%3DsDTmLciD%2B-5wTA%40mail.gmail.com.

--

Stephan

Alex Zinenko

unread,

Nov 7, 2019, 5:18:12 AM11/7/19

to Stephan Herhut, Lei Zhang, Mahesh Ravishankar, MLIR, Christian Sigg

Sorry, I did not anchor your comment to the right location. AFAICS, we don't have other properties that would need to be verified on a call graph. I think the proposal actually achieves the same property:

1) ops with fixed control-affecting set are only allowed within gpu.func and gpu.launch
2) gpu.func can only be called from another gpu.func

=> ops with fixed control-affecting set cannot be present in funcs whose call graph up to gpu.launch_func has anything but gpu.func.

Not being attached to the call graph makes this restriction transpose easily to the gpu.launch case.

As a clarification, fixed control-affecting set is not an Attribute (assuming annotation = attribute), it's a semantic property of the operation, potentially a trait.

I still would not want to limit it to the immediate target or body of a launch.

It is not. The proposal says a `gpu.func` or a `gpu.launch` should be an ancestor of such operation, not an immediate ancestor aka parent. I'll clarify.

I have not thought about this too deeply but doesn't convergence compose? If I have an op that is convergent inside a function, the function itself is also convergent. As I understand it, this is not something the proposal currently supports.

It says

"gpu.call" requires transformations not to modify the
set of values it is control-dependent upon unless the body of the
function is known not to contain any op that has the same requirement.

which sounds exactly like what you described, except on calls rather than on function definitions. It's not applicable to the "gpu.func" op itself because it is not control-dependent on anything. We could have a notion of a function that requires all calls to it have a fixed control-affecting set, but it sounds too indirect to me.

--

-- Alex

Alex Zinenko

unread,

Nov 7, 2019, 5:42:03 AM11/7/19

to Stephan Herhut, Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

I don't get why a higher-level dialect would have an operation marked convergent or any other property relating to lower-level execution model. Sounds like an abstraction leak. If a dialect has a sync operation, it already deals with some sort of multi-threaded execution model so it may as well be aware of convergence. If it isn't, the execution model it reflects is different, and may or may not be compatible with a SIMT model. It's fine if it is not, we should then do some transformations before lowering to this level.

What I am worried about if convergence has to be a core property, is that all operations will have to position themselves with respect to it. Like tf.graph and std.module will need to say they are not requiring convergence, which makes little sense at their respective levels of abstraction.

If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.

It's already not an attribute, precisely because it's too easy to remove it.

I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?

For abstract code motion transformations (CSE, constant-prop, LICM, etc.) in presence of unknown/unregistered ops, the reasonable behavior is to stay conservative. Imagine doing LICM on loops with NVVM inside and NVVM not being registered. I do understand it looks like it could hamper some transformations, but the general contract for unknown ops is "we cannot transform them, but at least we don't break them". If you want transformations to be aware of your ops, register them.

I don't think convergence prevents you from removing an op completely.

--

-- Alex

Stephan Herhut

unread,

Nov 7, 2019, 5:42:58 AM11/7/19

to Alex Zinenko, Lei Zhang, Mahesh Ravishankar, MLIR, Christian Sigg

Thanks. I somehow completely missed the paragraph about function calls. For me "the body of the function is known not to contain any op" is a property of the function op, so we have the same idea in mind.

Cheers

Stephan

--

Stephan

Stephan Herhut

unread,

Nov 7, 2019, 7:04:06 AM11/7/19

to Alex Zinenko, Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

If you have a generic transformation that can operate on different levels of abstraction, you have to introduce convergence on all levels where the transformation would be applicable to prevent it from happening. The example in my mind was a generic lowering for a high-level Op (at the level of reduction/convolution/etc.) first to a shared level of abstraction with parallel loops with support for syncing across their iterations. This can then both be lowered to SPMD, where sync is not convergent and different threads can sync on different sync instructions, and a SIMT dialect where sync is convergent. If convergence is a trait, then the parallel-loops-level op will have to have this as otherwise transformations that are applied at that level would be illegal if later lowered to a SIMT target.

What I am worried about if convergence has to be a core property, is that all operations will have to position themselves with respect to it. Like tf.graph and std.module will need to say they are not requiring convergence, which makes little sense at their respective levels of abstraction.

We share this concern. I was under the impression that the proposal was to make it a general concept on the op-trait level.

If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.

It's already not an attribute, precisely because it's too easy to remove it.

My concern is that going for operation traits swings to far into the other direction, making it too hard to ignore it.

I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?

For abstract code motion transformations (CSE, constant-prop, LICM, etc.) in presence of unknown/unregistered ops, the reasonable behavior is to stay conservative. Imagine doing LICM on loops with NVVM inside and NVVM not being registered. I do understand it looks like it could hamper some transformations, but the general contract for unknown ops is "we cannot transform them, but at least we don't break them". If you want transformations to be aware of your ops, register them.

I agree with this in the general case and for other properties like sideeffecting, etc. my intuition also goes towards being conservative. Convergence however seems less "general" as a property. It feels much more reasonable to me to specify a dialect that ignores convergence than to specify a dialect that ignores whether ops are side-effecting. If I then wanted to use such a dialect in my lowering, I would have to retrofit convergence. This is hard if it is a trait.

Maybe specifying convergence as a dialect/operation interface would be enough to resolve my concerns here.

Cheers

Stephan

--

Stephan

Sanjoy Das

unread,

Nov 7, 2019, 11:57:21 AM11/7/19

to Stephan Herhut, Alex Zinenko, Mahesh Ravishankar, Mehdi AMINI, MLIR, Christian Sigg, Tim Shen, George Karpenkov

Hi,

I want to first give a bit of context on why I even suggested using
something more structured than CFGs to model convergence.

I think modeling convergence as a "structural restriction" on the set
of control dependencies is unusual and backwards. Ideally, we'd
define an executable semantics for the "weird" GPU operations like
shuffle and ballot, and the structural restrictions on the control
dependencies would automatically follow from these executable
semantics. Other examples of such reasoning:

a. We don't say "the set of control deps on a volatile store must
remain the same", that is just a consequence of the following
executable semantics "the sequence of volatile stores made by the
program must remain the same".

b. While we do like to say "SSA defs dominate uses", this is again
just a consequence of "we don't ever use an undefined SSA register".

c. Refinement optimizations are naturally modeled by executable
semantics: a program in general has N (could be infinite) traces
associated with it and the executable semantics involves making a
non-deterministic choice between these N traces. Refinement
optimizations (for instance, CSE'ing two relaxed atomic reads) just
make this set of traces smaller, which is fine because executable
semantics was just "arbitrarily choosing" between these N traces
anyway (so it could have "decided to choose" from the smaller set of
traces under the "as if" rule).

Given that we are designing a new IR for GPUs, can we directly encode
the executable semantics of the GPU kernels in the IR so that we don't
have to specify convergence as a structural restriction?

Note: I don't think structured control flow immediately solves the
problem but it could be a good starting point for a more principled
representation.

Btw, do we have a list of the "odd" GPU functions we need to support?
If not I can start a doc to document these first.

-- Sanjoy

Mehdi AMINI

unread,

Nov 7, 2019, 12:21:04 PM11/7/19

to Stephan Herhut, Alex Zinenko, Mahesh Ravishankar, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

I mentioned it in the thread before: making convergence a core part of MLIR goes beyond introducing "gpu.func": I'd like to see a complete RFC that focuses purely on this topic. The only way I can see convergence being introduced by gpu.func (or any dialect) is if this dialect is very strict about the operations contained in the region it defines. For instance, just like tf_executor does, the gpu dialect could specify that only gpu dialect operation are allowed in the region (that would exclude standard ops, affine ops, etc.).

Even there I am not sure the generic passes could be made to operate on such a dialect.

--

Mehdi

Uday Bondhugula

unread,

Nov 7, 2019, 11:55:08 PM11/7/19

to MLIR

Sorry, I'm late to this discussion - but I've read the other messages. A few questions - please feel free to point out if these were already answered in one of your responses.

1) What if you restricted the gpu.func to just contain a single block terminated with a return (i.e., banning br and cond_br operations)? The available control flow operations would include std.if, std.for, affine.if, affine.for (all of them are structured). And other structured variants could be introduced/supported if needed. Do you need anything more general for the current GPU targets and in the forseeable future?

2) What will be the lowering path for gpu.func? Would this directly be converted to a function concept in one of the things that the GPU dialect is lowered to - like NVVM, SPIR-V, ...? (like this thread mentions https://groups.google.com/a/tensorflow.org/forum/#!searchin/mlir/stephan%7Csort:date/mlir/dqdgxRHOOrc/TMkQainqAAAJ ) Would you ever need gpu.func to hold an arbitrary CFG list of basic blocks? I think in one of your responses you mentioned you'd need that while lowering away higher level structured ops like for ops - but if you wanted to convert to such a traditional CFG list of blocks form, wouldn't you also get out of gpu.func with the latter becoming <something_else>.func or std.func / llvm.func?

3) What happens to the earlier thought process of having the GPU kernels as a nested region within the code that is actually launching the kernel so that things remain intra-function / intra-op as opposed to inter procedural -- for eg. to facilitate propagation of constants from the launch site into GPU kernels/functions. Is there an "inlined" approach to gpu.func?

~ Uday

On Thursday, October 31, 2019 at 10:33:28 PM UTC+5:30, Alex Zinenko wrote:

If you are not interested in the GPU pipeline, you can stop reading now.

use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, the
control flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong. The terminator or the region-containing Op
can further specify which of their operands affect the control flow.

For example, successors of `std.cond_br` only control-depend on the
value of its first argument that expresses the condition. Inside a
non-kernel GPU function, the Op is also transitively control-dependent
on any values the call-site operation is.

An op can require this set to remain fixed during transformations,
i.e. the control flow cannot be made dependent on more or less values
than before the transformation. The “more” part is equivalent to
LLVM’s convergent property [1]. The “less” part---to its inverse. The
“same control-affecting value set” is therefore more restrictive than
the LLVM model as the latter has been demonstrated to be insufficient
for some memory-related operations [2]. The restriction is intended to
be decomposed into several independent elements, possibly along the
lines of what is proposed in [2] and being discussed in LLVM [3], but
this relaxation will be proposed and discussed separately due to its
complexity. Any op that requires the set of control-affecting values
to remain fixed must have either a "gpu.func" or a "gpu.launch" op as
an ancestor.

Within GPU-function bodies, we conservatively assume that
side-effecting (or unregistered) operations require the set of
control-affecting operations to be fixed. Standard and Affine loads
and stores do not require this property.

Alex Zinenko

unread,

Nov 8, 2019, 3:58:57 AM11/8/19

to Stephan Herhut, Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

I need to wrap my head around this problem. My thinking is that the transformation would only be problematic if the lowering from non-loop dialect to loop-with-sync dialect somehow assumed convergence of the sync. It's only once we have introduced it that we have to preserve it.

What I am worried about if convergence has to be a core property, is that all operations will have to position themselves with respect to it. Like tf.graph and std.module will need to say they are not requiring convergence, which makes little sense at their respective levels of abstraction.

We share this concern. I was under the impression that the proposal was to make it a general concept on the op-trait level.

The proposal above is to try modeling it at the GPU dialect level. It's problematic because we cannot propagate it to unknown ops with regions that could appear inside GPU functions. So we were discussing whether it should be promoted to a core property and if so, with which mechanism (traits, interfaces, attributes). I would suggest factoring out this discussion into a separate thread while we put at least some definition of the execution model in writing.

If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.

It's already not an attribute, precisely because it's too easy to remove it.

My concern is that going for operation traits swings to far into the other direction, making it too hard to ignore it.

What would be the middle ground?

I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?

For abstract code motion transformations (CSE, constant-prop, LICM, etc.) in presence of unknown/unregistered ops, the reasonable behavior is to stay conservative. Imagine doing LICM on loops with NVVM inside and NVVM not being registered. I do understand it looks like it could hamper some transformations, but the general contract for unknown ops is "we cannot transform them, but at least we don't break them". If you want transformations to be aware of your ops, register them.

I agree with this in the general case and for other properties like sideeffecting, etc. my intuition also goes towards being conservative. Convergence however seems less "general" as a property. It feels much more reasonable to me to specify a dialect that ignores convergence than to specify a dialect that ignores whether ops are side-effecting. If I then wanted to use such a dialect in my lowering, I would have to retrofit convergence. This is hard if it is a trait.

My current thinking is to have "allows-divergence" (i.e. can be made control-dependent on more values than before) and "allows-convergence" (i.e. can be made control-dependent on less values than before) properties on operations. For registered operations, that default would be to allow both because there's more operations that don't care than those that do. For unregistered operations however, we conservatively assume the inverse.

Maybe specifying convergence as a dialect/operation interface would be enough to resolve my concerns here.

I actually don't think there is a significant difference between traits and interfaces: we can query if an abstract operation has trait or if it implements a certain interface + the result of calling an interface function.

--

-- Alex

Alex Zinenko

unread,

Nov 8, 2019, 4:25:59 AM11/8/19

to Sanjoy Das, Stephan Herhut, Mahesh Ravishankar, Mehdi AMINI, MLIR, Christian Sigg, Tim Shen, George Karpenkov

This all makes sense to me. The problem with MLIR is the open instruction set and that transformations should work on that, even on unknown/unregistered operations. So we will have to extract parts of the weird semantics and make them somehow reusable across different dialects, not only GPU. A typical case is intermixing GPU dialects with other things such as loops that are not aware (and arguably shouldn't be aware) of the semantics of the surrounding ops or contained ops. Structural properties, on the other hand, are expressed in terms of the core IR and don't require semantics propagation between different dialects.

Btw, do we have a list of the "odd" GPU functions we need to support?
If not I can start a doc to document these first.

It would be great to have a list, but here again, we should also think about how new operations can be added.

-- Sanjoy

--

-- Alex

Alex Zinenko

unread,

Nov 8, 2019, 4:40:21 AM11/8/19

to Uday Bondhugula, MLIR

On Fri, Nov 8, 2019 at 5:55 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

Sorry, I'm late to this discussion - but I've read the other messages. A few questions - please feel free to point out if these were already answered in one of your responses.

1) What if you restricted the gpu.func to just contain a single block terminated with a return (i.e., banning br and cond_br operations)? The available control flow operations would include std.if, std.for, affine.if, affine.for (all of them are structured). And other structured variants could be introduced/supported if needed. Do you need anything more general for the current GPU targets and in the forseeable future?

gpu.func(...) {

"unknown.op"() ({

^bb1:

"i.can.do.whatever.here"(): () -> ()

^bb2:

"there.is.no.verification"() : () -> ()

}) : () -> ()
}

GPU func only has one block here. The restriction wouldn't work unless we allow the operation to restrict _all_ descendants in a region tree, not only immediate descendants, but this arguably is a problem for op composition.

The viable solution is to restrict the body of the function to only contain specific operations (e.g. ifs and fors). Even then, we are basically trusting that those operations won't evolve to break the assumptions of the GUP dialect.

2) What will be the lowering path for gpu.func? Would this directly be converted to a function concept in one of the things that the GPU dialect is lowered to - like NVVM, SPIR-V, ...? (like this thread mentions https://groups.google.com/a/tensorflow.org/forum/#!searchin/mlir/stephan%7Csort:date/mlir/dqdgxRHOOrc/TMkQainqAAAJ )

It's up to specific lowerings to decide what is the right abstraction. For SPIR-V, it will indeed be functions. For NVVM, it's an LLVM function in special module that will include memory allocations.

Would you ever need gpu.func to hold an arbitrary CFG list of basic blocks?

Depending on the order of lowerings, it may end up containing NVVM + LLVM even today. I wouldn't want to impose a specific lowering order, again for compasability reasons.

I think in one of your responses you mentioned you'd need that while lowering away higher level structured ops like for ops - but if you wanted to convert to such a traditional CFG list of blocks form, wouldn't you also get out of gpu.func with the latter becoming <something_else>.func or std.func / llvm.func?

It's the question of timing. We want lowering to be more progressive and I don't see a strong enough reason to impose a fixed order here.

3) What happens to the earlier thought process of having the GPU kernels as a nested region within the code that is actually launching the kernel so that things remain intra-function / intra-op as opposed to inter procedural -- for eg. to facilitate propagation of constants from the launch site into GPU kernels/functions. Is there an "inlined" approach to gpu.func?

Nothing changes. You still have gpu.launch that is "inlined", there's a part of the proposal that adds the execution model and the memory attribution to its region. gpu.func is actually needed because we need a semantics anchor for the outlined case.

--

You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8b0bfa0a-adc6-4421-a4c1-a16f3d46e55a%40tensorflow.org.

--

-- Alex

Alex Zinenko

unread,

Nov 8, 2019, 4:49:58 AM11/8/19

to Mehdi AMINI, Stephan Herhut, Mahesh Ravishankar, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

+1. I would be even better to have an open discussion to prepare the RFC to start with.

Partly off-topic here, but since we are modeling LLVM IR as a dialect, we will have to model LLVM's convergence model somehow even if we can do better for the GPU dialect. Otherwise, we cannot run transformations on the LLVM dialect. Not sure we actually want to, but we already do CSE :)

--

-- Alex

Stephan Herhut

unread,

Nov 8, 2019, 5:20:30 AM11/8/19

to Alex Zinenko, Mehdi AMINI, Mahesh Ravishankar, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

Lets remove the parts concerning convergence from the proposal and ship/refine the other parts independently. I do not see a rush to define execution semantics in the context of this RFC.

+1 to having some form of open design meeting to lay the grounds on convergence.

Cheers

Stephan

--

Stephan

Stephan Herhut

unread,

Nov 8, 2019, 5:35:22 AM11/8/19

to Alex Zinenko, Mahesh Ravishankar, Mehdi AMINI, Sanjoy Das, MLIR, Christian Sigg, Tim Shen, George Karpenkov

Just to clarify: My thinking was that I could retrofit the convergence interface onto a dialect without modifying the dialect's definition. Thinking of it in the sense of using a dialect as a library and specifying additional interfaces it implements independently. Not sure whether one can register interfaces on a dialect independently of its dialect registration but that could be fixed.

Cheers

Stephan

--

Stephan

Reply all

Reply to author

Forward