If you are not interested in the GPU pipeline, you can stop reading now.
We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.
Function Definition
===================
The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.
GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.
The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.
The control flow inside the GPU function is an arbitrary CFG and can
use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, the
control flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong.
The buffers are considered live throughout the execution of the GPU
function body. The absence of memory attribution syntax means that the
function does not require special buffers.
Rationale: although the underlying models declare memory buffers at
the module level, we chose to do it at the function level to provide
some structuring for the lifetime of those buffers; this avoids the
incentive to use the buffers for communicating between different
kernels or launches of the same kernel, which should be done through
function arguments intead; we chose not to use `alloca`-style approach
that would require more complex lifetime analysis following the
principles of MLIR that promote structure and representing analysis
results in the IR.
The buffers are required to be of a statically-shaped `memref` type.
Rationale: the amount of special memory must be statically known on
most underlying model, dynamic allocation of special memory requires
additional manipulation. Memref is the only type that currently
supports memory spaces.
Other Modifications
===================
gpu.launch
----------
"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.
gpu.launch_func
---------------
The "gpu.launch_func" operation now only supports symbols that are
defined by a "gpu.func" operation with "kernel" attribute.
Alternatives Considered
=======================
Memory attribution at module level
----------------------------------
Both NVVM and SPIR-V use module-level values to define buffers in
workgroup-level memory, as well as in global and constant memory. Any
function within the module can directly access such values, and they
can be used to pass data between the functions within one kernel
invocation.
Thanks a lot for the proposal Alex. Comments from me are below.On Thu, Oct 31, 2019 at 10:03 AM Alex Zinenko <zin...@google.com> wrote:If you are not interested in the GPU pipeline, you can stop reading now.
We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.
Function Definition
===================
The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.
GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.This sounds a bit ambiguous to me. Typically kernels need to have a fixed API (CUDA requires __global__ to be void return type. SPIR-V/Vulkan requires the signature to be void(void)). Functions called within a kernel don't have this restriction. How about keep GPU functions just for kernels and use Standard functions for functions called within GPU functions. You can then just disallow calling a GPU function within other GPU function or Standard functions to disallow dynamic parallelism (for now).
The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.
The control flow inside the GPU function is an arbitrary CFG
Syntax
------
op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type
Example
-------
gpu.call @foo(%0) : (index) -> ()
Memory Hierarchy
================
GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.
Data in the workgroup memory can be accessed by any work item in the
group, same subscripts pointing to the same element of the memref in
all work items. Since work items may be executed concurrently, a
workgroup synchronization may be required to satisfy a data dependency
between multiple work items.
Data in private memory can only be accessed by one work item. Same
subscripts used to access an element of a memref in private memory
will point to (conceptually) different addresses for different work
items.
For both memory spaces, data may be accessed using existing load and
store instructions, i.e. standard or affine ones.
Other memory types (constant, texture) can be added in the future.
Memory Attribution
==================
Memory buffers are defined at the function level, either in
"gpu.launch" or in "gpu.func" ops. This encoding makes it clear where
the memory belongs makes the lifetime of the memory visible. The
memory is only accessible while the kernel is launched/the function is
currently invoked (there is an activation on the stack). The latter is
more strict than actual gpu implementations but using static memory at
the function level is just for convenience. It is also always possible
to pass pointers to the shared memory into other functions, provided
they expect the correct memory space.Is there a concept of "generic pointers"? Then do we want to support address space cast operations?
In the future, this can be improved using
a buffer allocation path (that can be generic and shared with other
dialects) given the lifetime information extracted from the function
call graph. This might require supporting memory attribution at the
module level and "passing" the buffers to the GPU functions, which is
left for future work, along with the buffer allocation itself.I think this is only relevant for private memory right? For shared memory you cannot allocate additional shared memory within the kernel. For private memory though I don't fully follow why the GPU function needs to take an opinion on it. It's up to the client creating the gpu.func operation to manage the private memory. What if the std.func called within gpu.func is recursive and has an alloca. We cannot concatenate the buffers as suggested here in that case.
Other Modifications
===================
gpu.launch
----------
"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAArwm2b6NbzqzyumH%3Dz2fj%2BCmjOBxPozMF21PhyBa9ycrw6ryw%40mail.gmail.com.
If you are not interested in the GPU pipeline, you can stop reading now.
We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.
This can be relaxed in the future
to support dynamic kernel launch.
Standard function can be called from GPU functons using "std.call".
Recursive calls are only allowed for non-kernel GPU functions with no
memory attribution. Rationale: kernel GPU functions are intended to be
called from host, they cannot call themselves from device; memory
attribution requires buffers to be live during the function execution,
a potentially unbounded recursion makes it impossible to evaluate the
total amount of special memory required by the function.
Syntax
------
op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type
Example
-------
gpu.call @foo(%0) : (index) -> ()
Memory Hierarchy
================
GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.
Data in the workgroup memory can be accessed by any work item in the
group, same subscripts pointing to the same element of the memref in
all work items. Since work items may be executed concurrently, a
workgroup synchronization may be required to satisfy a data dependency
between multiple work items.
Data in private memory can only be accessed by one work item. Same
subscripts used to access an element of a memref in private memory
will point to (conceptually) different addresses for different work
items.
For both memory spaces, data may be accessed using existing load and
store instructions, i.e. standard or affine ones.
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gN5uO4Cug852vE71S0Mp12RUHwSe6MKXhGRUKXSPY8xw%40mail.gmail.com.
Syntax
------
op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type
Example
-------
gpu.call @foo(%0) : (index) -> ()
Memory Hierarchy
================
GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.In SPIR-V there are more storage classes than this: https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_storage_class_a_storage_class and we'd like to utilize some of them (e.g., PushConstant) for Vulkan compute. I guess it's fine to keep consistent with the number assignment with NVVM here but we'd need to use other numbers too. :)
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAGhUxBDiMDuY-R5JEwXJhCAWEJuhCYgBPiN-PSxBr1v8XyPLDg%40mail.gmail.com.
Thanks a lot for the proposal Alex. Comments from me are below.On Thu, Oct 31, 2019 at 10:03 AM Alex Zinenko <zin...@google.com> wrote:If you are not interested in the GPU pipeline, you can stop reading now.
We propose to introduce a new function-like operation to the GPU
dialect to make the execution model explicit and to introduce support
for GPU memory hierarchy. This aligns with the overall goal of the
dialect to expose the common abstraction layer for GPU devices, in
particular by providing an MLIR unit of semantics (i.e. an operation)
to define the execution model. This RFC deals with both the execution
model and the memory hierarchy because memory visibility is related to
the execution model.
Function Definition
===================
The operation "gpu.func" defines a function that can be executed on a
GPU. It supports memory attribution and has a particular execution
model.
GPU functions are either kernels or regular functions. The former can
be launched through "gpu.launch_func operation"s. The latter can be
called from kernels and regular GPU functions, and also support memory
attribution. Standard functions can be called from GPU functions. See
below for call rules.This sounds a bit ambiguous to me. Typically kernels need to have a fixed API (CUDA requires __global__ to be void return type. SPIR-V/Vulkan requires the signature to be void(void)). Functions called within a kernel don't have this restriction. How about keep GPU functions just for kernels and use Standard functions for functions called within GPU functions. You can then just disallow calling a GPU function within other GPU function or Standard functions to disallow dynamic parallelism (for now).
The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.
The control flow inside the GPU function is an arbitrary CFG and can
use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, thecontrol flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong.This makes sense to me, but is there any special consideration needed for operations containing regions not isolated from above?
The buffers are considered live throughout the execution of the GPU
function body. The absence of memory attribution syntax means that the
function does not require special buffers.
Rationale: although the underlying models declare memory buffers at
the module level, we chose to do it at the function level to provide
some structuring for the lifetime of those buffers; this avoids the
incentive to use the buffers for communicating between different
kernels or launches of the same kernel, which should be done through
function arguments intead; we chose not to use `alloca`-style approach
that would require more complex lifetime analysis following the
principles of MLIR that promote structure and representing analysis
results in the IR.
The buffers are required to be of a statically-shaped `memref` type.
Rationale: the amount of special memory must be statically known on
most underlying model, dynamic allocation of special memory requires
additional manipulation. Memref is the only type that currently
supports memory spaces.This seems unnecessary. CUDA for example can have a "dynamic shape" shared memory size. The size is specified on the host side during the launch. What additional manipulation are you referring to here?
Other Modifications
===================
gpu.launch
----------
"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.
gpu.launch_func
---------------
The "gpu.launch_func" operation now only supports symbols that are
defined by a "gpu.func" operation with "kernel" attribute.
Alternatives Considered
=======================
Memory attribution at module level
----------------------------------
Both NVVM and SPIR-V use module-level values to define buffers in
workgroup-level memory, as well as in global and constant memory. Any
function within the module can directly access such values, and they
can be used to pass data between the functions within one kernel
invocation.The fact that the workgroup-level memory is represented at module scope is really an artifact and shouldn't be relied upon AFAIK. Global and constant are intended to be live across kernel boundaries, and that seems pretty natural. The proposal here does not restrict this for global/constant AFAIK. The only remaining one is private. To me it scope of that should be limited to within the gpu.func op. The host need not specify how much private memory to use.
The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.
The control flow inside the GPU function is an arbitrary CFGI'm not sure this works for Vulkan compute. We'll need structured control flow actually.
Syntax
------
op ::= `gpu.call` symbol-ref-id `(` ssa-id-list `)` attribute-dict?
`:` function-type
Example
-------
gpu.call @foo(%0) : (index) -> ()
Memory Hierarchy
================
GPU memory hierarchy is reflected through memref memory spaces. We use
the memory space 1 for global memory, 3 for workgroup memory, 4 for
constant memory and 5 for private memory, following the NVVM spec [4].
In the longer term, we should consider redesigning memory spaces in
memref to use strings or other named identifiers so as to avoid
clashes between memory spaces used by different devices. Core IR
changes will be proposed separately.In SPIR-V there are more storage classes than this: https://www.khronos.org/registry/spir-v/specs/unified1/SPIRV.html#_a_id_storage_class_a_storage_class and we'd like to utilize some of them (e.g., PushConstant) for Vulkan compute. I guess it's fine to keep consistent with the number assignment with NVVM here but we'd need to use other numbers too. :)
In the future, this can be improved using
a buffer allocation path (that can be generic and shared with other
dialects) given the lifetime information extracted from the function
call graph. This might require supporting memory attribution at the
module level and "passing" the buffers to the GPU functions, which is
left for future work, along with the buffer allocation itself.I think this is only relevant for private memory right? For shared memory you cannot allocate additional shared memory within the kernel. For private memory though I don't fully follow why the GPU function needs to take an opinion on it. It's up to the client creating the gpu.func operation to manage the private memory. What if the std.func called within gpu.func is recursive and has an alloca. We cannot concatenate the buffers as suggested here in that case.
Other Modifications
===================
gpu.launch
----------
"gpu.launch" becomes essentially an inlined version of "gpu.func"
combined with "gpu.launch_func". It's body region has the same
semantics and constraints as the body of "gpu.func", and "gpu.launch"
op also supports memory attribution.Maybe not relevant to this RFC, but should we consider deduplicate the mechanisms here then? If it does not provide a useful abstraction differentiation, having less mechanisms is generally better IMO. :)
This can be relaxed in the future
to support dynamic kernel launch.
Standard function can be called from GPU functons using "std.call".Is the difference that a GPU function has the convergent property? Are there other aspects that makes a difference between std.call and gpu.call?
Thanks for the responses Alex. So is it fair to say that the introduction of gpu.func and gpu.call has more to do with convergence properties, and to limit transformations that might affect the control-dependence of these ops?
> use higher-level control flow ops such as structured conditionals or
> loops. Each Op is control-dependent on a set of values. That is, the
> control flow reaching this Op is conditioned on these values. By
> default, the control flow depends on the operands of the terminator Op
> in the predecessor blocks; and the operands of the Ops to which the
> surrounding regions belong. The terminator or the region-containing Op
> can further specify which of their operands affect the control flow.
Will this be specified as a OpInterface?
Hi Mahesh,
Thanks for the clarification. Now I know where the usage of the term is from. Then, we may choose to use the terms crossworkgroup and workgroup (corresponding to global and local in OpenCL) when the spec is written down.
Trent
(I intended to reply to the mlir mailing thread. I do not know why my outlook drops the mlir mailing address for me.)
From: Mahesh Ravishankar <ravish...@google.com>
Sent: Monday, November 4, 2019 11:14 AM
To: Trent Lo <tre...@nvidia.com>
Cc: Alex Zinenko <zin...@google.com>; Christian Sigg <cs...@google.com>
Subject: Re: [mlir] [RFC][GPU] Functions and Memory Modeling
Hi Trent,
Nice to hear from you. I think the origin of these terms are from SPIR-V where __shared__ is referred to as Workgroup. Since we want to abstract GPU dialect to not be CUDA specific, we are looking for terminology accepted by a standard committee like Khronos (which defines standards for OpenCL and SPIR-V), but still map to CUDA. I do second, we be consistent with the terminology and provide explicit documentation of what each thing means and how it maps to, say CUDA, or SPIR-V.
Also, feel free to respond to the common thread. I think people do expect the MLIR mailing list to have heavy traffic.
Thanks
--
Mahesh
On Mon, Nov 4, 2019 at 10:45 AM Trent Lo <tre...@nvidia.com> wrote:
Hi Alex,
Generally the proposal looks fine to me.
One small request:
Could we refer to the "workgroup-level" memory attribute as "shared" (which is aligned with the CUDA terminology)? I also don't mind we completely borrow the OpenCL terms (i.e., global/local/private), but using workgroup with global/private to refer to memory attribution does not sound symmetric and may be confusing.
Thanks,
Trent
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655gN5uO4Cug852vE71S0Mp12RUHwSe6MKXhGRUKXSPY8xw%40mail.gmail.com.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
--
Mahesh
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAArwm2ZUUjheSFUWZxEvq3tmcg70-q-n340F1K-4vbNgKnT1zw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAArwm2ZUUjheSFUWZxEvq3tmcg70-q-n340F1K-4vbNgKnT1zw%40mail.gmail.com.
As a matter of fact, we do have a use case where we need a CFG inside GPU functions -- lowering. We lower the body of the (currently standard) function to the NVVM dialect, which only has CFG. Even more, the fact that the NVVM dialect exists means that we need to somehow model convergence on CFGs, even if we made it more conservative than LLVM because we do not intend to apply transformations at that level.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655jZJh1XWKDnqNCO%2BO_Txza%3Df13-Gc6aUfkhP7aZ7t_CEw%40mail.gmail.com.
The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.
The control flow inside the GPU function is an arbitrary CFGI'm not sure this works for Vulkan compute. We'll need structured control flow actually.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAGhUxBDiMDuY-R5JEwXJhCAWEJuhCYgBPiN-PSxBr1v8XyPLDg%40mail.gmail.com.
The memory attribution defines SSA values that correspond to memory
buffers allocated in the memory hierarchy of the GPU (see below).
The operation has one attached region that corresponds to the body of
the function. The region arguments consist of the function arguments
without modification, followed by buffers defined in memory
annotations. The body of a GPU function, when launched, is executed by
multiple work items. There are no guarantees on the order in which
work items execute, or on the connection between them. In particular,
work items are not necessarily executed in lock-step. Synchronization
ops such as "gpu.barrier" should be used to coordinate work items.
Declarations of GPU functions, i.e. not having the body region, are
not supported.Could you explain a bit more the rationale for not supporting this. I see this as an orthogonal linking issue that is out of scope of this proposal. So not sure why it is mentioned here.
The control flow inside the GPU function is an arbitrary CFGI'm not sure this works for Vulkan compute. We'll need structured control flow actually.This brings back a general question in how dialects compose. I see arbitrary control flow as the same category as supporting some special operation. A specific lowering of the gpu dialect to a target might disallow arbitrary control flow, like it might not support some operations.An interesting question in the same direction is whether we want to allow synchronization in the context of arbitrary control flow or whether we limit it to structured control flow, at least initially, to simplify specifying the semantics.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/CAPL655j8-cm1idF%3DVm-ZGXjoNPoSmqc3E1p%3DsDTmLciD%2B-5wTA%40mail.gmail.com.
I still would not want to limit it to the immediate target or body of a launch.
I have not thought about this too deeply but doesn't convergence compose? If I have an op that is convergent inside a function, the function itself is also convergent. As I understand it, this is not something the proposal currently supports.
If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.
I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?
What I am worried about if convergence has to be a core property, is that all operations will have to position themselves with respect to it. Like tf.graph and std.module will need to say they are not requiring convergence, which makes little sense at their respective levels of abstraction.
If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.It's already not an attribute, precisely because it's too easy to remove it.
I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?For abstract code motion transformations (CSE, constant-prop, LICM, etc.) in presence of unknown/unregistered ops, the reasonable behavior is to stay conservative. Imagine doing LICM on loops with NVVM inside and NVVM not being registered. I do understand it looks like it could hamper some transformations, but the general contract for unknown ops is "we cannot transform them, but at least we don't break them". If you want transformations to be aware of your ops, register them.
If you are not interested in the GPU pipeline, you can stop reading now.
use higher-level control flow ops such as structured conditionals or
loops. Each Op is control-dependent on a set of values. That is, the
control flow reaching this Op is conditioned on these values. By
default, the control flow depends on the operands of the terminator Op
in the predecessor blocks; and the operands of the Ops to which the
surrounding regions belong. The terminator or the region-containing Op
can further specify which of their operands affect the control flow.
For example, successors of `std.cond_br` only control-depend on the
value of its first argument that expresses the condition. Inside a
non-kernel GPU function, the Op is also transitively control-dependent
on any values the call-site operation is.
An op can require this set to remain fixed during transformations,
i.e. the control flow cannot be made dependent on more or less values
than before the transformation. The “more” part is equivalent to
LLVM’s convergent property [1]. The “less” part---to its inverse. The
“same control-affecting value set” is therefore more restrictive than
the LLVM model as the latter has been demonstrated to be insufficient
for some memory-related operations [2]. The restriction is intended to
be decomposed into several independent elements, possibly along the
lines of what is proposed in [2] and being discussed in LLVM [3], but
this relaxation will be proposed and discussed separately due to its
complexity. Any op that requires the set of control-affecting values
to remain fixed must have either a "gpu.func" or a "gpu.launch" op as
an ancestor.
Within GPU-function bodies, we conservatively assume that
side-effecting (or unregistered) operations require the set of
control-affecting operations to be fixed. Standard and Affine loads
and stores do not require this property.
What I am worried about if convergence has to be a core property, is that all operations will have to position themselves with respect to it. Like tf.graph and std.module will need to say they are not requiring convergence, which makes little sense at their respective levels of abstraction.We share this concern. I was under the impression that the proposal was to make it a general concept on the op-trait level.
If convergence was just an annotation on functions or instances of operations, one could simply strip the attribute. If it becomes a property of operations, one would need to lower to a different dialect.It's already not an attribute, precisely because it's too easy to remove it.My concern is that going for operation traits swings to far into the other direction, making it too hard to ignore it.
I think we also had the default wrong in the original proposal. Unknown operations and operations without annotations should be considered as having no constraints. If the lowering to a specific target supports those operations and requires convergence properties, the operations should have been annotated accordingly. If operations are not supported by a target, translation will fail. After all, convergence controls how operations are moved but not whether they can be elided altogether. Or is this a too naive view?For abstract code motion transformations (CSE, constant-prop, LICM, etc.) in presence of unknown/unregistered ops, the reasonable behavior is to stay conservative. Imagine doing LICM on loops with NVVM inside and NVVM not being registered. I do understand it looks like it could hamper some transformations, but the general contract for unknown ops is "we cannot transform them, but at least we don't break them". If you want transformations to be aware of your ops, register them.I agree with this in the general case and for other properties like sideeffecting, etc. my intuition also goes towards being conservative. Convergence however seems less "general" as a property. It feels much more reasonable to me to specify a dialect that ignores convergence than to specify a dialect that ignores whether ops are side-effecting. If I then wanted to use such a dialect in my lowering, I would have to retrofit convergence. This is hard if it is a trait.
Maybe specifying convergence as a dialect/operation interface would be enough to resolve my concerns here.
Btw, do we have a list of the "odd" GPU functions we need to support?
If not I can start a doc to document these first.
-- Sanjoy
Sorry, I'm late to this discussion - but I've read the other messages. A few questions - please feel free to point out if these were already answered in one of your responses.1) What if you restricted the gpu.func to just contain a single block terminated with a return (i.e., banning br and cond_br operations)? The available control flow operations would include std.if, std.for, affine.if, affine.for (all of them are structured). And other structured variants could be introduced/supported if needed. Do you need anything more general for the current GPU targets and in the forseeable future?
2) What will be the lowering path for gpu.func? Would this directly be converted to a function concept in one of the things that the GPU dialect is lowered to - like NVVM, SPIR-V, ...? (like this thread mentions https://groups.google.com/a/tensorflow.org/forum/#!searchin/mlir/stephan%7Csort:date/mlir/dqdgxRHOOrc/TMkQainqAAAJ )
Would you ever need gpu.func to hold an arbitrary CFG list of basic blocks?
I think in one of your responses you mentioned you'd need that while lowering away higher level structured ops like for ops - but if you wanted to convert to such a traditional CFG list of blocks form, wouldn't you also get out of gpu.func with the latter becoming <something_else>.func or std.func / llvm.func?
3) What happens to the earlier thought process of having the GPU kernels as a nested region within the code that is actually launching the kernel so that things remain intra-function / intra-op as opposed to inter procedural -- for eg. to facilitate propagation of constants from the launch site into GPU kernels/functions. Is there an "inlined" approach to gpu.func?
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/8b0bfa0a-adc6-4421-a4c1-a16f3d46e55a%40tensorflow.org.