[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Johannes Doerfert via llvm-dev

unread,

Mar 10, 2021, 2:41:19 PM3/10/21

to Artem Belevich, jhole...@nvidia.com, llvm...@lists.llvm.org

Artem, Justin,

I am running into a problem and I'm curious if I'm missing something or
if the support is simply missing.
Am I correct to assume the NVPTX backend does not deal with `llvm.sin`
and friends?

This is what I see, with some variations: https://godbolt.org/z/PxsEWs

If this is missing in the backend, is there a plan to get this working,
I'd really like to have the
intrinsics in the middle end rather than __nv_cos, not to mention that
-ffast-math does emit intrinsics
and crashes.

~ Johannes

--
───────────────────
∽ Johannes (he/his)

_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Artem Belevich via llvm-dev

unread,

Mar 10, 2021, 3:38:55 PM3/10/21

to Johannes Doerfert, llvm...@lists.llvm.org

On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <johannes...@gmail.com> wrote:

Artem, Justin,

I am running into a problem and I'm curious if I'm missing something or
if the support is simply missing.
Am I correct to assume the NVPTX backend does not deal with `llvm.sin`
and friends?

Correct. It can't deal with anything that may need to lower to a standard library call.

This is what I see, with some variations: https://godbolt.org/z/PxsEWs

If this is missing in the backend, is there a plan to get this working,
I'd really like to have the
intrinsics in the middle end rather than __nv_cos, not to mention that
-ffast-math does emit intrinsics
and crashes.

It all boils down to the fact that PTX does not have the standard libc/libm which LLVM could lower the calls to, nor does it have a 'linking' phase where we could link such a library in, if we had it.

Libdevice bitcode does provide the implementations for some of the functions (though with a __nv_ prefix) and clang links it in in order to avoid generating IR that LLVM can't handle, but that's a workaround that does not help LLVM itself.

--Artem

~ Johannes

--
───────────────────
∽ Johannes (he/his)

--

--Artem Belevich

William Moses via llvm-dev

unread,

Mar 10, 2021, 3:50:03 PM3/10/21

to Artem Belevich, llvm...@lists.llvm.org

Since clang (and arguably any other frontend that uses) should link in libdevice, could we lower these intrinsics to the libdevice code?

For example, consider compiling the simple device function below:

```

// /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64 --cuda-gpu-arch=sm_37
__device__ double f(double x) {
return cos(x);
}
```

The LLVM module for it is as follows:

```
...
define dso_local double @_Z1fd(double %x) #0 {
entry:
%__a.addr.i = alloca double, align 8
%x.addr = alloca double, align 8
store double %x, double* %x.addr, align 8
%0 = load double, double* %x.addr, align 8
store double %0, double* %__a.addr.i, align 8
%1 = load double, double* %__a.addr.i, align 8
%call.i = call contract double @__nv_cos(double %1) #7
ret double %call.i
}

define internal double @__nv_cos(double %a) #1 {
%q.i = alloca i32, align 4
```

Obviously we would need to do something to ensure these functions don't get deleted prior to their use in lowering from intrinsic to libdevice.
...

Johannes Doerfert via llvm-dev

unread,

Mar 10, 2021, 3:57:24 PM3/10/21

to William Moses, Artem Belevich, llvm...@lists.llvm.org

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

Trying to figure out a good way to have the cake and eat it too.

~ Johannes

Artem Belevich via llvm-dev

unread,

Mar 10, 2021, 4:25:56 PM3/10/21

to Johannes Doerfert, llvm...@lists.llvm.org

On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <johannes...@gmail.com> wrote:

Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).

It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.

If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?

It's possible, but it would require plumbing in CUDA SDK awareness into LLVM. While clang driver can deal with that, LLVM currently can't. The bitcode library path would have to be provided by the user.

The standard library as bitcode raises some questions.

* When do we want to do the linking? If we do it at the beginning, then the question is how to make sure unused functions are not eliminated before we may need them, as we don't know apriori what's going to be needed. We also do want the unused functions to be gone after we're done. Linking it in early would allow optimizing the code better at the expense of having to optimize a lot of code we'll throw away. Linking it in late has less overhead, but leaves the linked in bitcode unoptimized, though it's probably in the ballpark of what would happen with a real library call. I.e. no inlining, etc.

* It incorporates linking into LLVM, which is not LLVM's job. Arguably, the line should be drawn at the lowering to libcalls as it's done for other back-ends. However, we're also constrained to by the need to have the linking done before we generate PTX which prevents doing it after LLVM is done generating an object file.

One thing that may work within the existing compilation model is to pre-compile the standard library into PTX and then textually embed relevant functions into the generated PTX, thus pushing the 'linking' phase past the end of LLVM's compilation and make it look closer to the standard compile/link process. This way we'd only enable libcall lowering in NVPTX, assuming that the library functions will be magically available out there. Injection of PTX could be done with an external script outside of LLVM and it could be incorporated into clang driver. Bonus points for the fact that this scheme is compatible with -fgpu-rdc out of the box -- assemble the PTX with `ptxas -rdc` and then actually link with the library, instead of injecting its PTX before invoking ptxas.

--Artem

Trying to figure out a good way to have the cake and eat it too.

~ Johannes

On 3/10/21 2:49 PM, William Moses wrote:
> Since clang (and arguably any other frontend that uses) should link in
> libdevice, could we lower these intrinsics to the libdevice code?

The linking happens *before* LLVM gets to work on IR.

As I said, it's a workaround, not the solution. It's possible for LLVM to still attempt lowering something in the IR into a libcall and we would not be able to deal with that. It happens to work well enough in practice.

Do you have an example where you see the problem with -ffast-math?

--

--Artem Belevich

Johannes Doerfert via llvm-dev

unread,

Mar 10, 2021, 4:55:57 PM3/10/21

to Artem Belevich, llvm...@lists.llvm.org

On 3/10/21 3:25 PM, Artem Belevich wrote:
> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
> johannes...@gmail.com> wrote:
>
>> Right. We could keep the definition of __nv_cos and friends
>> around. Right now, -ffast-math might just crash on the user,
>> which is arguably a bad thing. I can also see us benefiting
>> in various other ways from llvm.cos uses instead of __nv_cos
>> (assuming precision is according to the user requirements but
>> that is always a condition).
>>
>> It could be as simple as introducing __nv_cos into
>> "llvm.used" and a backend matching/rewrite pass.
>>
>> If the backend knew the libdevice location it could even pick
>> the definitions from there. Maybe we could link libdevice late
>> instead of eager?
>>
> It's possible, but it would require plumbing in CUDA SDK awareness into
> LLVM. While clang driver can deal with that, LLVM currently can't. The
> bitcode library path would have to be provided by the user.

The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

> The standard library as bitcode raises some questions.

Which standard library? CUDAs libdevice is a bitcode library, right?

> * When do we want to do the linking? If we do it at the beginning, then the
> question is how to make sure unused functions are not eliminated before we
> may need them, as we don't know apriori what's going to be needed. We also
> do want the unused functions to be gone after we're done. Linking it in
> early would allow optimizing the code better at the expense of having to
> optimize a lot of code we'll throw away. Linking it in late has less
> overhead, but leaves the linked in bitcode unoptimized, though it's
> probably in the ballpark of what would happen with a real library call.
> I.e. no inlining, etc.
>
> * It incorporates linking into LLVM, which is not LLVM's job. Arguably, the
> line should be drawn at the lowering to libcalls as it's done for other
> back-ends. However, we're also constrained to by the need to have the
> linking done before we generate PTX which prevents doing it after LLVM is
> done generating an object file.

I'm confused. Clang links in libdevice.bc early. If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.
We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.

~ Johannes

Artem Belevich via llvm-dev

unread,

Mar 10, 2021, 5:38:34 PM3/10/21

to Johannes Doerfert, llvm...@lists.llvm.org

On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <johannes...@gmail.com> wrote:

On 3/10/21 3:25 PM, Artem Belevich wrote:
> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
> johannes...@gmail.com> wrote:
>
>> Right. We could keep the definition of __nv_cos and friends
>> around. Right now, -ffast-math might just crash on the user,
>> which is arguably a bad thing. I can also see us benefiting
>> in various other ways from llvm.cos uses instead of __nv_cos
>> (assuming precision is according to the user requirements but
>> that is always a condition).
>>
>> It could be as simple as introducing __nv_cos into
>> "llvm.used" and a backend matching/rewrite pass.
>>
>> If the backend knew the libdevice location it could even pick
>> the definitions from there. Maybe we could link libdevice late
>> instead of eager?
>>
> It's possible, but it would require plumbing in CUDA SDK awareness into
> LLVM. While clang driver can deal with that, LLVM currently can't. The
> bitcode library path would have to be provided by the user.

The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.

> The standard library as bitcode raises some questions.

Which standard library? CUDAs libdevice is a bitcode library, right?

It's whatever LLVM will need to lower libcalls to. libdevice bitcode is the closest approximation of that we have at the moment.

> * When do we want to do the linking? If we do it at the beginning, then the
> question is how to make sure unused functions are not eliminated before we
> may need them, as we don't know apriori what's going to be needed. We also
> do want the unused functions to be gone after we're done. Linking it in
> early would allow optimizing the code better at the expense of having to
> optimize a lot of code we'll throw away. Linking it in late has less
> overhead, but leaves the linked in bitcode unoptimized, though it's
> probably in the ballpark of what would happen with a real library call.
> I.e. no inlining, etc.
>
> * It incorporates linking into LLVM, which is not LLVM's job. Arguably, the
> line should be drawn at the lowering to libcalls as it's done for other
> back-ends. However, we're also constrained to by the need to have the
> linking done before we generate PTX which prevents doing it after LLVM is
> done generating an object file.

I'm confused. Clang links in libdevice.bc early.

Yes. Because that's where it has to happen if we want to keep LLVM unaware of CUDA SDK.

It does not have to be the case if/when LLVM can do the linking itself.

If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.

This is the 'link early' approach, I should've been explicit that it's 'link early *everything*' as opposed to linking only what's needed at the beginning.

It would work at the expense of having to process/optimize 500KB worth of bitcode for every compilation, whether it needs it or not.

We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.

Again, I agree that it is doable.

This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.

So, to summarize:

* link the library as bitcode early, add artificial placeholders for everything, compile, remove placeholders and DCE unused stuff away.

Pros:

- we're already doing most of it before clang hands hands off IR to LLVM, so it just pushes it a bit lower in the compilation.

Cons:

- runtime cost of optimizing libdevice bitcode,

- libdevice may be required for all NVPTX compilations?

* link the library as bitcode late.

Pros:

- lower runtime cost than link-early approach.

Cons:

- We'll need to make sure that NVVMReflect pass processes the library.

- less optimizations on the library functions. Some of the code gets DCE'ed away after NVVMReflect and the rest could be optimized better.

- libdevice may be required for all NVPTX compilations?

* 'link' with the library as PTX appended as text to LLVM's output and let ptxas do the 'linking'

Pros: LLVM remains agnostic of CUDA SDK installation details. All it does is allows lowering libcalls and leaves their resolution to the external tools.

Cons: Need to have the PTX library somewhere and need to integrate the 'linking' into the compilation process somehow.

Neither is particularly good. If the runtime overhead of link-early is acceptable, then it may be a winner here, by a very small margin.

link-as-PTX may be better conceptually as it keeps linking and compilation separate.

As for the practical steps, here's what we need:

- allow libcall lowering in NVPTX, possibly guarded by a flag. This is needed for all of the approaches above.

- teach LLVM how to link in bitcode (and, possibly, control early/late mode)

- teach clang driver to delegate libdevice linking to LLVM.

This will allow us to experiment with all three approaches and see what works best.

--Artem

--

--Artem Belevich

Johannes Doerfert via llvm-dev

unread,

Mar 10, 2021, 6:45:01 PM3/10/21

to Artem Belevich, llvm...@lists.llvm.org

I think if we embed knowledge about the nv_XXX functions we can
even get away without the cons you listed for early linking above.

For early link I'm assuming an order similar to [0] but I also discuss
the case where we don't link libdevice early for a TU.

Link early:
1) clang emits module.bc and links in libdevice.bc but with the
   `optnone`, `noinline`, and "used" attribute for functions in
   libdevice. ("used" is not an attribute but could as well be.)
   At this stage module.bc might call __nv_XXX or llvm.XXX freely
   as defined by -ffast-math and friends.
2) Run some optimizations in the middle end, maybe till the end of
   the inliner loop, unsure.
3) Run a libcall lowering pass and another NVVMReflect pass (or the
   only instance thereof). We effectively remove all llvm.XXX calls
   in favor of __nv_XXX now. Note that we haven't spend (much) time
   on the libdevice code as it is optnone and most passes are good
   at skipping those. To me, it's unclear if the used parts should
   not be optimized before we inline them anyway to avoid redoing
   the optimizations over and over (per call site). That needs
   measuring I guess. Also note that we can still retain the current
   behavior for direct calls to __nv_XXX if we mark the call sites
   as `alwaysinline`, or at least the behavior is almost like the
   current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
   something we would do right now. Alternatively, remove `optnone`
   and `noinline` from the __nv_XXX calls.
5) Continue with the pipeline as before.

As mentioned above, `optnone` avoids spending time on the libdevice
until we "activate" it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don't think this is (much)
more expensive than linking libdevice early right now.

Link late, aka. translation units without libdevice:
1) clang emits module.bc but does not link in libdevice.bc, it will be
   made available later. We still can mix __nv_XXX and llvm.XXX calls
   freely as above.
2) Same as above.
3) Same as above.
4) Same as above but effectively a no-op, no __nv_XXX definitions are
   available.
5) Same as above.

I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?

~ Johannes

P.S. If the rewrite capability (aka libcall lowering) is generic we could
use the scheme for many other things as well.

[0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice

Artem Belevich via llvm-dev

unread,

Mar 10, 2021, 7:22:52 PM3/10/21

to Johannes Doerfert, llvm...@lists.llvm.org

WDYM by `embed knowledge about the nv_XXX functions`? By linking those functions in? Of do you mean that we should just declare them before/instead of linking libdevice in?

For early link I'm assuming an order similar to [0] but I also discuss
the case where we don't link libdevice early for a TU.

That link just describes the steps needed to use libdevice. It does not deal with how/where it fits in the LLVM pipeline.

The gist is that NVVMreflect replaces some conditionals with constants. libdevice uses that as a poor man's IR preprocessor, conditionally enabling different implementations and relying on DCE and constant folding to remove unused parts and eliminate the now useless branches.

While running NVVM alone will make libdevice code valid and usable, it would still benefit from further optimizations. I do not know to what degree, though.

Link early:
1) clang emits module.bc and links in libdevice.bc but with the
    `optnone`, `noinline`, and "used" attribute for functions in
    libdevice. ("used" is not an attribute but could as well be.)
    At this stage module.bc might call __nv_XXX or llvm.XXX freely
    as defined by -ffast-math and friends.

That could work. Just carrying extra IR around would probably be OK.

We may want to do NVVMReflect as soon as we have it linked in and, maybe, allow optimizing the functions that are explicitly used already.

2) Run some optimizations in the middle end, maybe till the end of
the inliner loop, unsure.
3) Run a libcall lowering pass and another NVVMReflect pass (or the
only instance thereof). We effectively remove all llvm.XXX calls

    in favor of __nv_XXX now. Note that we haven't spend (much) time
    on the libdevice code as it is optnone and most passes are good
    at skipping those. To me, it's unclear if the used parts should
    not be optimized before we inline them anyway to avoid redoing
    the optimizations over and over (per call site). That needs
    measuring I guess. Also note that we can still retain the current
    behavior for direct calls to __nv_XXX if we mark the call sites
    as `alwaysinline`, or at least the behavior is almost like the
    current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
    something we would do right now. Alternatively, remove `optnone`
    and `noinline` from the __nv_XXX calls.
5) Continue with the pipeline as before.

SGTM.

As mentioned above, `optnone` avoids spending time on the libdevice
until we "activate" it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don't think this is (much)
more expensive than linking libdevice early right now.

Link late, aka. translation units without libdevice:
1) clang emits module.bc but does not link in libdevice.bc, it will be
    made available later. We still can mix __nv_XXX and llvm.XXX calls
    freely as above.
2) Same as above.
3) Same as above.
4) Same as above but effectively a no-op, no __nv_XXX definitions are
    available.
5) Same as above.

I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?

The `optnone` trick may indeed remove much of the practical differences between the early/late approaches.

In principle it should work.

Next question is -- is libdevice sufficient to satisfy LLVM's assumptions about the standard library.

While it does provide most of the equivalents of libm functions, the set is not complete and some of the functions differ from their libm counterparts.

The differences are minor, so we should be able to deal with it by generating few wrapper functions for the odd cases.

Here's what clang does to provide math functions using libdevice:

https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h

The most concerning aspect of libdevice is that we don't know when we'll no longer be able to use the libdevice bitcode? My understanding is that IR does not guarantee binary stability and at some point we may just be unable to use it. Ideally we need our own libm for GPUs.

--Artem

--

--Artem Belevich

Johannes Doerfert via llvm-dev

unread,

Mar 10, 2021, 7:56:47 PM3/10/21

to Artem Belevich, llvm...@lists.llvm.org

I mean by providing the "libcall lowering" pass. So the knowledge
that llvm.cos maps to __nv_cos.

>
>
>> For early link I'm assuming an order similar to [0] but I also discuss
>> the case where we don't link libdevice early for a TU.
>>
> That link just describes the steps needed to use libdevice. It does not
> deal with how/where it fits in the LLVM pipeline.
> The gist is that NVVMreflect replaces some conditionals with constants.
> libdevice uses that as a poor man's IR preprocessor, conditionally enabling
> different implementations and relying on DCE and constant folding to remove
> unused parts and eliminate the now useless branches.
> While running NVVM alone will make libdevice code valid and usable, it
> would still benefit from further optimizations. I do not know to what
> degree, though.
>
>
>> Link early:
>> 1) clang emits module.bc and links in libdevice.bc but with the
>> `optnone`, `noinline`, and "used" attribute for functions in
>> libdevice. ("used" is not an attribute but could as well be.)
>> At this stage module.bc might call __nv_XXX or llvm.XXX freely
>> as defined by -ffast-math and friends.
>>
> That could work. Just carrying extra IR around would probably be OK.
> We may want to do NVVMReflect as soon as we have it linked in and, maybe,
> allow optimizing the functions that are explicitly used already.

Right. NVVMReflect can be run twice and with `alwaysinline`
on the call sites of __nv_XXX functions we will actually
inline and optimize them while the definitions are just "dragged
along" in case we need them later.

Right now, clang will generate any llvm intrinsic and we crash, so anything
else is probably a step in the right direction. Eventually, we should
"lower"
all intrinsics that the NVPTX backend can't handle or at least emit a nice
error message. Preferably, clang would know what we can't deal with and not
generate intinsic calls for those in the first place.

>
> The most concerning aspect of libdevice is that we don't know when we'll no
> longer be able to use the libdevice bitcode? My understanding is that IR
> does not guarantee binary stability and at some point we may just be unable
> to use it. Ideally we need our own libm for GPUs.

For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
backing on CUDA and libc++ implementations, I hope it will stay that way.
That said, if the need arises we might really have to port libc++ to the
GPUs.

Back to the problem with libdevice. I agree that the solution of NVIDIA
to ship a .bc library is suboptimal but with the existing, or an extended,
auto-upgrader we might be able to make that work reasonably well for the
foreseeable future. That problem is orthogonal to what we are discussing
above, I think.

~ Johannes

William Moses via llvm-dev

unread,

Mar 10, 2021, 9:44:46 PM3/10/21

to Johannes Doerfert, llvm...@lists.llvm.org

We could also consider doing something slightly broader.

For example we could define a special attribute on top of the llvm.cos call/declaration etc with metadata or an attribute that points to the actual __nv_cos function. Then in a subsequent lowering pass the corresponding intrinsic with the relevant attribute has its uses replaced by the actual function.

Johannes Doerfert via llvm-dev

unread,

Mar 11, 2021, 1:54:19 PM3/11/21

to William Moses, llvm...@lists.llvm.org

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

```
__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

```

and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority ;)

Artem Belevich via llvm-dev

unread,

Mar 11, 2021, 2:37:41 PM3/11/21

to Johannes Doerfert, llvm...@lists.llvm.org

On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <johannes...@gmail.com> wrote:

I certainly agree we should try to avoid a hard-coded mapping
in C++.

I could see something like:

```
__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }

```

and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority ;)

I'm not sure how that would work.

Where would you place that `__attribute__((implementation))` ? We do not have the definitions for `__nv_*` as they come from NVIDIA-provided bitcode. We could add the attribute to the declaration in `__clang_cuda_libdevice_declares.h`.

How does LLVM handle the differences in function attributes between function declaration and definition? Will there be trouble when we link in the actual __nv_cos from the libdevice that would not have that attribute?

Another potential gotcha is that for the functions that can't be directly mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the implementation ourselves. We will not know whether the implementation will be used until after the substitution pass, so we'll need to make sure it's not DCE'd until then. It appears to be the same issue (though on a smaller scale) as with linking in libdevice directly.

Let's take a step back and figure out what are the issues we want to solve.

The top-level goal is to provide implementation for LLVM intrinsics. For now let's stick with libm-related ones.

What we have is the libdevice bitcode which uses different function names and provides a subset of the functionality we need.

What we miss is

- something to connect LLVM's libcalls to the GPU-side implementation,

- additional code to provide implementations for the functions that are missing or different in libdevice.

Considering that we want this to work in LLVM, the additional code would have to be a bitcode and it would have to exist in addition to libdevice.

Our options for the mapping between LLVM intrinsics and the implementation are

* intrinsic -> __nv_* equivalent mapping pass

This would still need additional bitcode for the missing/different functions.

* lower libcalls to the standard libm APIs, implement libm -> __nv_* mapping in our own bitcode.

Considering that additional bitcode is needed in both cases, I believe that the second approach makes more sense.

LLVM does not need to know or care about what's provided by libdevice, and we'd have more flexibility, compared to what we could do in the mapping pass. It also makes it easy to substitute a different implementation, if we have or need one.

WDYT?

--Artem

--

--Artem Belevich

Johannes Doerfert via llvm-dev

unread,

Mar 11, 2021, 7:11:03 PM3/11/21

to Artem Belevich, llvm...@lists.llvm.org

I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

> LLVM does not need to know or care about what's provided by libdevice, and
> we'd have more flexibility, compared to what we could do in the mapping
> pass. It also makes it easy to substitute a different implementation, if we
> have or need one.

I agree that LLVM (core) should not know about __nv_*, that's why I
suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_* declarations.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute. So:

```
__nv_A() { ... } // called, inlined and optimized as before, DCE'ed after.

__nv_B() { ... } // not called, DCE'ed.

__attribute__((implements("llvm.C"))
__nv_C() { ... } // calls are inlined and optimized as before, not DCE'ed
// though because of the attribute. Replaces llvm.C as
// callee in the special pass.
```

So "implements" gives you a way to statically replace a function declaration
or definition with another one. I could see it being used to provide other
intrinsics to platforms with backends that don't support them.

Does that make some sense?

~ Johannes

Artem Belevich via llvm-dev

unread,

Mar 11, 2021, 9:00:17 PM3/11/21

to Johannes Doerfert, llvm...@lists.llvm.org

Could you elaborate on the drawbacks?

The fact is that we already depend on the external bitcode (libdevice in this case), though right now we're trying to keep that to clang only. The current approach is not sound in principle and is rather brittle in practice. Nor clang is the only source of the IR for the LLVM to compile, so it leaves LLVM-only users without a good solution. There are already a handful of JIT compilers that each do their own gluing of libdevice into the IR they want to compile for NVPTX. I think we do have a very good reason to deal with that in LLVM itself.

While I agree that additional bitcode is a hassle, I think it would be a net positive change for LLVM usability for NVPTX users.

The external bitcode would not be required for those who do not need libdevice now, so the change should not be disruptive.

> LLVM does not need to know or care about what's provided by libdevice, and
> we'd have more flexibility, compared to what we could do in the mapping
> pass. It also makes it easy to substitute a different implementation, if we
> have or need one.

I agree that LLVM (core) should not know about __nv_*, that's why I
suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_* declarations.

We may have slightly different end goals in mind.

I was thinking of making the solution work for LLVM. I.e. users would be free to use llvm.sin with NVPTX back-end with a few documented steps needed to make it work (basically "pass additional -link-libm-bitcode=path/to/bitcode_libm.bc").

Your scenario above suggests that the goal is to allow clang to generate both llvm intrinsics and the glue which would then be used by LLVM to make it work for clang, but not in general. It's an improvement compared to what we have now, but I still think we should try a more general solution.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute.

I do not think we need tablegen for anything here. I was thinking of just compiling a real math library (or a wrapper on top of libdevice) from C/C++ sources.

Our approaches are not mutually exclusive. If there's a strong opposition to providing a bitcode libm for NVPTX, implementing it somewhere closer to clang would still be an improvement, even if it's not as general as I'd like. It should still be possible to allow LLVM to lower libcalls in NVPTX to standard libm API, enabled with a flag, and just let the end users who are interested (e.g. JITs) to provide their own implementation.

--Artem

--

--Artem Belevich

Johannes Doerfert via llvm-dev

unread,

Mar 11, 2021, 11:26:38 PM3/11/21

to Artem Belevich, llvm...@lists.llvm.org

Bitcode comes with all the problems libdevice itself has wrt.
compatibility. It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest. Also, bitcode is
platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.

>>> LLVM does not need to know or care about what's provided by libdevice,
>> and
>>> we'd have more flexibility, compared to what we could do in the mapping
>>> pass. It also makes it easy to substitute a different implementation, if
>> we
>>> have or need one.
>> I agree that LLVM (core) should not know about __nv_*, that's why I
>> suggested
>> the `__attribute__((implements("...")))` approach. My preferred solution
>> is still to annotate our declarations of __nv_* and point to the
>> llvm.intrinsics (name) from there. If we have a missing mapping, we
>> point to an
>> intrinsic from a definition that lives in the Clang headers next to the
>> __nv_* declarations.
>>
> We may have slightly different end goals in mind.
> I was thinking of making the solution work for LLVM. I.e. users would be
> free to use llvm.sin with NVPTX back-end with a few documented steps needed
> to make it work (basically "pass additional
> -link-libm-bitcode=path/to/bitcode_libm.bc").
>
> Your scenario above suggests that the goal is to allow clang to generate
> both llvm intrinsics and the glue which would then be used by LLVM to make
> it work for clang, but not in general. It's an improvement compared to what
> we have now, but I still think we should try a more general solution.
>

My scenario doesn't disallow a bitcode approach for non-clang
frontends, nor does it disallow them to simply build the glue code
with clang and package it themselves. It does however allow us to
maintain C(++) code rather than IR, which is by itself a big win.

>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works well
>> even if we can't map 1:1, and we could even restrict the "used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.

I did not understand your suggestion before. Agreed, no tablegen.

>
> Our approaches are not mutually exclusive. If there's a strong opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.

Right. And their own implementation could be trivially created for
them as bc file:

`clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`

Or am I missing something here?

~ Johannes

Johannes Doerfert via llvm-dev

unread,

Mar 12, 2021, 11:48:55 AM3/12/21

to Artem Belevich, llvm...@lists.llvm.org

I prototyped the LLVM-Core parts last night:

https://reviews.llvm.org/D98516

If this is something we support I'll write an RFC, also
for the missing clang parts.

~ Johannes

[EOM]

Artem Belevich via llvm-dev

unread,

Mar 12, 2021, 1:51:08 PM3/12/21

to Johannes Doerfert, llvm...@lists.llvm.org

We already have this problem, so it does not make things (much) worse than they are.

Considering that we'll be able to keep the library in sync with LLVM, the compatibility is less of a problem as the library that would come with LLVM would be built with/for exactly that LLVM version.

It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest.

We seem to agree that the implementation of such a library would be in C/C++.

Also, bitcode is platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.

Agreed. We will likely need multiple variants. We will compile specifically for NVPTX or AMDGPU and we will know specific ABI and the data layout for them regardless of the host we're building on.

It appears to me is the the difference vs what we have now is that we'll need to have the libm sources somewhere, the process to build them for particular GPUs (that may need to be done out of the tree as it may need CUDA/HIP SDKs) and having to incorporate such libraries into llvm distribution.

OK. I'll agree that that may be a bit too much for now.

Agreed.

>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works well
>> even if we can't map 1:1, and we could even restrict the "used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.

I did not understand your suggestion before. Agreed, no tablegen.

>
> Our approaches are not mutually exclusive. If there's a strong opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.

Right. And their own implementation could be trivially created for
them as bc file:

`clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`

Or am I missing something here?

I think we're on the same page. Let's see where the attribute(implementation) gets us.

--Artem

--

--Artem Belevich

James Y Knight via llvm-dev

unread,

Mar 12, 2021, 5:39:22 PM3/12/21

to Artem Belevich, llvm...@lists.llvm.org

On Fri, Mar 12, 2021 at 1:51 PM Artem Belevich via llvm-dev <llvm...@lists.llvm.org> wrote:

Also, bitcode is platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.

Agreed. We will likely need multiple variants. We will compile specifically for NVPTX or AMDGPU and we will know specific ABI and the data layout for them regardless of the host we're building on.

It appears to me is the the difference vs what we have now is that we'll need to have the libm sources somewhere, the process to build them for particular GPUs (that may need to be done out of the tree as it may need CUDA/HIP SDKs) and having to incorporate such libraries into llvm distribution.

OK. I'll agree that that may be a bit too much for now.

It sounded before like you were saying the library should effectively be function aliases for standard libm names, to call __nv_ names. Isn't it utterly trivial to generate such a bitcode file as part of the toolchain build, without requiring any external SDKs?

Artem Belevich via llvm-dev

unread,

Mar 12, 2021, 6:00:02 PM3/12/21

to James Y Knight, llvm...@lists.llvm.org

That's true for most, but not all functions provided by libdevice. We'd still need something that's a bit more involved.

--Artem

--

--Artem Belevich

William Moses via llvm-dev

unread,

Apr 28, 2021, 6:56:55 PM4/28/21

to Artem Belevich, llvm...@lists.llvm.org

Hi all,

Reviving this thread as Johannes and I recently had some time to take a look and do some additional design work. We'd love any thoughts on the following proposal.

Background:

Standard math (and potentially other) functions for both AMD and NVIDIA GPU's don't exist and LLVM-based tools must instead call architecture-specific functions that perform similar computations.

For example in clang/lib/Headers/__clang_hip_math.h:

__DEVICE__
double sqrt(double __x) { return __ocml_sqrt_f64(__x); }

and clang/lib/Headers/__clang_cuda_math.h:

__DEVICE__ double sqrt(double __a) { return __nv_sqrt(__a); }

In the case of CUDA, the definition of these functions are found by immediately linking against a corresponding CUDA libdevice.bc

This design presents several problems:

1) It is illegal to use llvm math intrinsics on GPU code as these functions do not have definitions.

While in theory we could define the lowering of these intrinsics to be a table which looks up the correct __nv_sqrt, this would require the definition of all such functions to remain or otherwise be available. As it's undesirable for the LLVM backend to be aware of CUDA paths, etc, this means that the original definitions brought in by merging libdevice.bc must be maintained. Currently these are deleted if they are unused (as libdevice has them marked as internal).

2) GPU math functions aren't able to be optimized, unlike standard math functions.

Since LLVM has no idea what these foreign functions are, they cannot be optimized. This is problematic in two ways. First, these functions to not have all the relevant attributes one might expect (inaccessiblememonly, willreturn, etc). Secondly, they cannot benefit from instcombine-style optimizations that recognize math intrinsic. For example, a call to sin(0) from source code will remain a call to __ocml_sqrt_f32(0) [if on AMD] rather than being replaced with 0.

These two design issues make it difficult for tools that wish to generate GPU code (frontends, target offloading, Enzyme AD tool, etc) as well as simply being able to optimize it effectively.

Design Constraints:

To remedy the problems described above we need a design that meets the following:

* Does not require modifying libdevice.bc or other code shipped by a vendor-specific installation

* Allows llvm math intrinsics to be lowered to device-specific code

* Keeps definitions of code used to implement intrinsics until after all potential relevant intrinsics (including those created by LLVM passes) have been lowered.

Initial Design:

To remedy this we propose a refined version of the implements mechanism described above. Specifically, consider the example below:

define internal float @my_cos_fast(float %d) {
...
}

declare internal float @my_cos(float %d)

define double @foo(double %d, float %f) {
%c1 = tail call fast double @llvm.cos.f64(double %d)
%c2 = tail call fast double @cos(double %d)
ret double %c2
}

declare double @cos(double) !metadata !1
declare double @llvm.cos.f64(double) !metadata !0

!0 = !{!"implemented_by", double(double)* @my_cos}
!1 = !{!"implemented_by", double(double)* @my_cos_fast}

Here, each function that we may want to provide an implementation for (in this case cos and llvm.cos.f64), has a metadata tag "implemented_by" followed by the function which it will be replaced with. The type signature of the original function and its implementation must match.

The implemented_by metadata will be defined to ensure both the replacement and the replacee will be kept around (for example to ensure that LLVM passes that generate a call to llvm.cos will still have a definition).

After all passes that could generate such intrinsics and instruction simplifications have run, a new LLVM optimization pass that replaces uses of the function with its implementation.

Proposed Patches:

1) Allow metadata on declaration [not just definition]

2) Tell GlobalOpt and other passes not to delete globals using/used in implemented_by

3) Write implementedby pass that scans all functions, replaces call, removes metadata

4) Add Clang attributes to expose implements and use in nvptx/amd headers

Cheers,
Billy

Jon Chesterfield via llvm-dev

unread,

Apr 29, 2021, 7:25:36 PM4/29/21

to llvm-dev, llvm-dev...@lists.llvm.org

Date: Wed, 28 Apr 2021 18:56:32 -0400
From: William Moses via llvm-dev <llvm...@lists.llvm.org>
To: Artem Belevich <t...@google.com>
...

Hi all,

Reviving this thread as Johannes and I recently had some time to take a
look and do some additional design work. We'd love any thoughts on the
following proposal.

Keenly interested in this. Simplification (subjective) of the metadata proposal at the end. Some extra background info first though as GPU libm is a really interesting design space. When I did the bring up for a different architecture ~3 years ago, iirc I found the complete set:

- clang lowering libm (named functions) to intrinsics

- clang lowering intrinsic to libm functions

- optimisation passes that transform libm and ignore intrinsics

- optimisation passes that transform intrinsics and ignore libm

- selectiondag represents some intrinsics as nodes

- strength reduction, e.g. cos(double) -> cosf(float) under fast-math

I then wrote some more IR passes related to opencl-style vectorisation and some combines to fill in the gaps (which have not reached upstream). So my knowledge here is out of date but clang/llvm wasn't a totally consistent lowering framework back then.

Cuda ships an IR library containing functions similar to libm. ROCm does something similar, also IR. We do an impedance matching scheme in inline headers which blocks various optimisations and poses some challenges for fortran.

*Background:*

...

While in theory we could define the lowering of these intrinsics to be a
table which looks up the correct __nv_sqrt, this would require the
definition of all such functions to remain or otherwise be available. As
it's undesirable for the LLVM backend to be aware of CUDA paths, etc, this
means that the original definitions brought in by merging libdevice.bc must
be maintained. Currently these are deleted if they are unused (as libdevice
has them marked as internal).

The deleting is it's own hazard in the context of fast-math, as the function can be deleted, and then later an optimisation creates a reference to it, which doesn't link. It also prevents the backend from (safely) assuming the functions are available, which is moderately annoying for lowering some SDag ISD nodes.

2) GPU math functions aren't able to be optimized, unlike standard math

functions.

This one is bad.

*Design Constraints:*

To remedy the problems described above we need a design that meets the
following:
* Does not require modifying libdevice.bc or other code shipped by a
vendor-specific installation
* Allows llvm math intrinsics to be lowered to device-specific code
* Keeps definitions of code used to implement intrinsics until after all
potential relevant intrinsics (including those created by LLVM passes) have
been lowered.

Yep, constraints sound right. Back ends can emit calls to these functions too, but I think nvptx/amdgcn do not. Perhaps they would like to be able to in places.

*Initial Design:*

... metadata / aliases ...

Design would work, lets us continue with the header files we have now. Avoids some tedious programming, i.e. if we approached this as the usual back end lowering, where intrinsics / isd nodes are emitted as named function calls. That can be mostly driven by a table lookup as the function arity is limited. It is (i.e. was) quite tedious programming that in ISel. Doing basically the same thing for SDag + GIsel / ptx + gcn, with associated tests, is also unappealing.

The set of functions near libm is small and known. We would need to mark 'sin' as 'implemented by' slightly different functions for nvptx and amdgcn, and some of them need thin wrapper code (e.g. modf in amdgcn takes an argument by pointer). It would be helpful for the fortran runtime libraries effort if the implementation didn't use inline code in headers.

There's very close to a 1:1 mapping between the two gpu libraries, even some extensions to libm exist in both. Therefore we could write a table,

{llvm.sin.f64, "sin", __nv_sin, __ocml_sin},

with NULL or similar for functions that aren't available.

A function level IR pass, called late in the pipeline, crawls the call instructions and rewrites based on simple rules and that table. That is, would rewrite a call to llvm.sin.f64 to a call to __ocml_sin. Exactly the same net effect as a header file containing metadata annotations, except we don't need the metadata machinery and we can use a single trivial IR pass for N architectures (by adding a column). Pass can do the odd ugly thing like impedance match function type easily enough.

The other side of the problem - that functions once introduced have to hang around until we are sure they aren't needed - is the same as in your proposal. My preference would be to introduce the libdevice functions immediately after the lowering pass above, but we can inject it early and tag them to avoid erasure instead. Kind of need that to handle the cos->cosf transform anyway.

Quite similar to the 'in theory ... table' suggestion, which I like because I remember it being far simpler than the sdag rewrite rules.

Thanks!

Jon

Johannes Doerfert via llvm-dev

unread,

Sep 7, 2021, 12:15:14 PM9/7/21

to llvm-dev, Jon Chesterfield

+bump

Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.

I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.

That said, people should chime in if they (dis)like the approach to get math
optimizations (and similar things) working on the GPU.

~ Johannes

Artem Belevich via llvm-dev

unread,

Sep 7, 2021, 12:36:18 PM9/7/21

to Johannes Doerfert, llvm-dev, Jon Chesterfield, Liu, Yaxun (Sam)

On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <johannes...@gmail.com> wrote:

+bump

Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.

I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.

That said, people should chime in if they (dis)like the approach to get math
optimizations (and similar things) working on the GPU.

I do like this approach for CUDA and NVPTX. I think HIP/AMDGPU may benefit from it, too (+cc: yaxun.liu@).

This will likely also be useful for things other than math functions.

E.g. it may come handy for sanitizer runtimes (+cc: eugenis@) that currently rely on LLVM *not* materializing libcalls they can't provide when they are building the runtime itself.

--Artem

--

--Artem Belevich

Artem Belevich via llvm-dev

unread,

Nov 17, 2021, 2:20:53 PM11/17/21

to Johannes Doerfert, llvm-dev, Jon Chesterfield, Liu, Yaxun (Sam)

bump.

On Tue, Sep 7, 2021 at 9:36 AM Artem Belevich <t...@google.com> wrote:

On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <johannes...@gmail.com> wrote:
+bump

Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.

I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.

I think we've got as much interest expressed (or not) as we can reasonably expect for something that most back-ends do not care about.

I vote for moving forward with the patches.

--Artem

--

--Artem Belevich

Jon Chesterfield via llvm-dev

unread,

Nov 17, 2021, 2:40:46 PM11/17/21

to Artem Belevich, llvm-dev, Liu, Yaxun (Sam)

Thanks for the ping.

The IR pass that rewrote llvm.libm intrinsics to architecture specific ones I wrote years ago was pretty trivial. I'm up for re-implementing that.

Essentially type out a (hash)table with entries like {llvm.sin.f64, "sin", __nv_sin, __ocml_sin} and do the substitution as a pass called 'ExpandLibmIntrinsics' or similar, run somewhere before instruction selection for nvptx / amdgpu / other.

Could factor it differently if we don't like having the nv/oc names next to each other, pass could take the corresponding lookup table as an argument.

Main benefit over the implemented-in-terms-of metadata approach is it's trivial to implement and dead simple. Lowering in IR means doing it once instead of once in sdag and once in gisel. I'll write the pass (from scratch, annoyingly, as the last version I wrote is still closed source) if people seem in favour.

Thanks all,

Jon

Artem Belevich via llvm-dev

unread,

Nov 17, 2021, 3:05:32 PM11/17/21

to Jon Chesterfield, llvm-dev, Liu, Yaxun (Sam)

On Wed, Nov 17, 2021 at 11:40 AM Jon Chesterfield <jonathanch...@gmail.com> wrote:

Thanks for the ping.

The IR pass that rewrote llvm.libm intrinsics to architecture specific ones I wrote years ago was pretty trivial. I'm up for re-implementing that.

Essentially type out a (hash)table with entries like {llvm.sin.f64, "sin", __nv_sin, __ocml_sin} and do the substitution as a pass called 'ExpandLibmIntrinsics' or similar, run somewhere before instruction selection for nvptx / amdgpu / other.

Could factor it differently if we don't like having the nv/oc names next to each other, pass could take the corresponding lookup table as an argument.

Main benefit over the implemented-in-terms-of metadata approach is it's trivial to implement and dead simple. Lowering in IR means doing it once instead of once in sdag and once in gisel. I'll write the pass (from scratch, annoyingly, as the last version I wrote is still closed source) if people seem in favour.

SGTM.

Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now.

Expanding intrinsics late may miss some optimization opportunities,

so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes.

--Artem

--

--Artem Belevich

Jon Chesterfield via llvm-dev

unread,

Nov 17, 2021, 3:17:17 PM11/17/21

to Artem Belevich, llvm-dev, Liu, Yaxun (Sam)

SGTM.
Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now.
Expanding intrinsics late may miss some optimization opportunities,
so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes.

Good old phase ordering. I don't think we've got any optimisations that target the nv/oc named functions and would personally prefer to never implement any.

We do have ones that target llvm.libm, and some that target extern C functions with the same names as libm. There's some code in clang that converts some libm functions into llvm intrinsics, and I think some other code in clang that converts in the other direction. Maybe dependent on various math flags.

So it seems we either canonicalise libm-like code and rearrange optimisations to work on the canonical form, or we write optimisations that know there are N names for essentially the same function. I'd prefer to go with the canonical form approach, e.g. we could rewrite calls to __nv_sin into calls to sin early on in the pipeline (or ignore them? seems likely applications call libm functions directly), and rewrite calls to sin to __nv_sin late on, with optimisations written against sin.

Thanks!

Liu, Yaxun (Sam) via llvm-dev

unread,

Nov 17, 2021, 3:49:07 PM11/17/21

to Jon Chesterfield, Artem Belevich, llvm-dev

[AMD Official Use Only]

+1 but we may want to put it under a clang option in the beginning in case it causes perf degradation.

Sam

From: Jon Chesterfield <jonathanch...@gmail.com>
Sent: Wednesday, November 17, 2021 3:17 PM
To: Artem Belevich <t...@google.com>
Cc: Johannes Doerfert <johannes...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>; Arsenault, Matthew <Matthew....@amd.com>; Evgenii Stepanov <eug...@google.com>; Liu, Yaxun (Sam) <Yaxu...@amd.com>
Subject: Re: [llvm-dev] NVPTX codegen for llvm.sin (and friends)

[CAUTION: External Email]

Roman Lebedev via llvm-dev

unread,

Nov 17, 2021, 3:56:20 PM11/17/21

to Liu, Yaxun (Sam), Jon Chesterfield, llvm-dev

I would like to note that there's prior (and generic!) art in this
area - ReplaceWithVeclib (https://reviews.llvm.org/D95373)
Presumably the NVPTX backend only needs to declare
the wanted replacements, and they //should// already happen.

Roman

Artem Belevich via llvm-dev

unread,

Nov 17, 2021, 3:56:42 PM11/17/21

to Jon Chesterfield, llvm-dev, Liu, Yaxun (Sam)

I should've phrased it better. What I mean is that because the __nv_* functions are provided as IR, Replacing intrinsics with calls to __nv_ functions may provide further IR optimization opportunities -- inlining, CSE, DCE, etc.. I didn't mean the optimizations based on known semantics of the functions. I agree that those should be done for canonical calls only.

--Artem

Thanks!

--

--Artem Belevich

Reply all

Reply to author

Forward