I am running into a problem and I'm curious if I'm missing something or
if the support is simply missing.
Am I correct to assume the NVPTX backend does not deal with `llvm.sin`
and friends?
This is what I see, with some variations: https://godbolt.org/z/PxsEWs
If this is missing in the backend, is there a plan to get this working,
I'd really like to have the
intrinsics in the middle end rather than __nv_cos, not to mention that
-ffast-math does emit intrinsics
and crashes.
~ Johannes
--
───────────────────
∽ Johannes (he/his)
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Artem, Justin,
I am running into a problem and I'm curious if I'm missing something or
if the support is simply missing.
Am I correct to assume the NVPTX backend does not deal with `llvm.sin`
and friends?
This is what I see, with some variations: https://godbolt.org/z/PxsEWs
If this is missing in the backend, is there a plan to get this working,
I'd really like to have the
intrinsics in the middle end rather than __nv_cos, not to mention that
-ffast-math does emit intrinsics
and crashes.
~ Johannes
--
───────────────────
∽ Johannes (he/his)
It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.
If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?
Trying to figure out a good way to have the cake and eat it too.
~ Johannes
Right. We could keep the definition of __nv_cos and friends
around. Right now, -ffast-math might just crash on the user,
which is arguably a bad thing. I can also see us benefiting
in various other ways from llvm.cos uses instead of __nv_cos
(assuming precision is according to the user requirements but
that is always a condition).
It could be as simple as introducing __nv_cos into
"llvm.used" and a backend matching/rewrite pass.
If the backend knew the libdevice location it could even pick
the definitions from there. Maybe we could link libdevice late
instead of eager?
Trying to figure out a good way to have the cake and eat it too.
~ Johannes
On 3/10/21 2:49 PM, William Moses wrote:
> Since clang (and arguably any other frontend that uses) should link in
> libdevice, could we lower these intrinsics to the libdevice code?
The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.
> The standard library as bitcode raises some questions.
Which standard library? CUDAs libdevice is a bitcode library, right?
> * When do we want to do the linking? If we do it at the beginning, then the
> question is how to make sure unused functions are not eliminated before we
> may need them, as we don't know apriori what's going to be needed. We also
> do want the unused functions to be gone after we're done. Linking it in
> early would allow optimizing the code better at the expense of having to
> optimize a lot of code we'll throw away. Linking it in late has less
> overhead, but leaves the linked in bitcode unoptimized, though it's
> probably in the ballpark of what would happen with a real library call.
> I.e. no inlining, etc.
>
> * It incorporates linking into LLVM, which is not LLVM's job. Arguably, the
> line should be drawn at the lowering to libcalls as it's done for other
> back-ends. However, we're also constrained to by the need to have the
> linking done before we generate PTX which prevents doing it after LLVM is
> done generating an object file.
I'm confused. Clang links in libdevice.bc early. If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.
We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.
This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.
~ Johannes
On 3/10/21 3:25 PM, Artem Belevich wrote:
> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
> johannes...@gmail.com> wrote:
>
>> Right. We could keep the definition of __nv_cos and friends
>> around. Right now, -ffast-math might just crash on the user,
>> which is arguably a bad thing. I can also see us benefiting
>> in various other ways from llvm.cos uses instead of __nv_cos
>> (assuming precision is according to the user requirements but
>> that is always a condition).
>>
>> It could be as simple as introducing __nv_cos into
>> "llvm.used" and a backend matching/rewrite pass.
>>
>> If the backend knew the libdevice location it could even pick
>> the definitions from there. Maybe we could link libdevice late
>> instead of eager?
>>
> It's possible, but it would require plumbing in CUDA SDK awareness into
> LLVM. While clang driver can deal with that, LLVM currently can't. The
> bitcode library path would have to be provided by the user.
The PTX backend could arguably be CUDA SDK aware, IMHO, it would
even be fine if the middle-end does the remapping to get inlining
and folding benefits also after __nv_cos is used. See below.
> The standard library as bitcode raises some questions.
Which standard library? CUDAs libdevice is a bitcode library, right?
> * When do we want to do the linking? If we do it at the beginning, then the
> question is how to make sure unused functions are not eliminated before we
> may need them, as we don't know apriori what's going to be needed. We also
> do want the unused functions to be gone after we're done. Linking it in
> early would allow optimizing the code better at the expense of having to
> optimize a lot of code we'll throw away. Linking it in late has less
> overhead, but leaves the linked in bitcode unoptimized, though it's
> probably in the ballpark of what would happen with a real library call.
> I.e. no inlining, etc.
>
> * It incorporates linking into LLVM, which is not LLVM's job. Arguably, the
> line should be drawn at the lowering to libcalls as it's done for other
> back-ends. However, we're also constrained to by the need to have the
> linking done before we generate PTX which prevents doing it after LLVM is
> done generating an object file.
I'm confused. Clang links in libdevice.bc early.
If we make sure
`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
to `__nv_cos` which is available. After the lowering we can remove
the artificial uses of `__nv_XXX` functions that we used to keep the
definitions around in order to remove them from the final result.
We get the benefit of having `llvm.cos` for some of the pipeline,
we know it does not have all the bad effects while `__nv_cos` is defined
with inline assembly. We also get the benefit of inlining `__nv_cos`
and folding the implementation based on the arguments. Finally,
this should work with the existing pipeline, the linking is the same
as before, all we do is to keep the definitions alive longer and
lower `llvm.cos` to `__nv_cos` in a middle end pass.
This might be similar to the PTX solution you describe below but I feel
we get the inline benefit from this without actually changing the pipeline
at all.
I think if we embed knowledge about the nv_XXX functions we can
even get away without the cons you listed for early linking above.
For early link I'm assuming an order similar to [0] but I also discuss
the case where we don't link libdevice early for a TU.
Link early:
1) clang emits module.bc and links in libdevice.bc but with the
`optnone`, `noinline`, and "used" attribute for functions in
libdevice. ("used" is not an attribute but could as well be.)
At this stage module.bc might call __nv_XXX or llvm.XXX freely
as defined by -ffast-math and friends.
2) Run some optimizations in the middle end, maybe till the end of
the inliner loop, unsure.
3) Run a libcall lowering pass and another NVVMReflect pass (or the
only instance thereof). We effectively remove all llvm.XXX calls
in favor of __nv_XXX now. Note that we haven't spend (much) time
on the libdevice code as it is optnone and most passes are good
at skipping those. To me, it's unclear if the used parts should
not be optimized before we inline them anyway to avoid redoing
the optimizations over and over (per call site). That needs
measuring I guess. Also note that we can still retain the current
behavior for direct calls to __nv_XXX if we mark the call sites
as `alwaysinline`, or at least the behavior is almost like the
current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
something we would do right now. Alternatively, remove `optnone`
and `noinline` from the __nv_XXX calls.
5) Continue with the pipeline as before.
As mentioned above, `optnone` avoids spending time on the libdevice
until we "activate" it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don't think this is (much)
more expensive than linking libdevice early right now.
Link late, aka. translation units without libdevice:
1) clang emits module.bc but does not link in libdevice.bc, it will be
made available later. We still can mix __nv_XXX and llvm.XXX calls
freely as above.
2) Same as above.
3) Same as above.
4) Same as above but effectively a no-op, no __nv_XXX definitions are
available.
5) Same as above.
I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?
~ Johannes
P.S. If the rewrite capability (aka libcall lowering) is generic we could
use the scheme for many other things as well.
[0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
For early link I'm assuming an order similar to [0] but I also discuss
the case where we don't link libdevice early for a TU.
Link early:
1) clang emits module.bc and links in libdevice.bc but with the
`optnone`, `noinline`, and "used" attribute for functions in
libdevice. ("used" is not an attribute but could as well be.)
At this stage module.bc might call __nv_XXX or llvm.XXX freely
as defined by -ffast-math and friends.
2) Run some optimizations in the middle end, maybe till the end of
the inliner loop, unsure.
3) Run a libcall lowering pass and another NVVMReflect pass (or the
only instance thereof). We effectively remove all llvm.XXX calls
in favor of __nv_XXX now. Note that we haven't spend (much) time
on the libdevice code as it is optnone and most passes are good
at skipping those. To me, it's unclear if the used parts should
not be optimized before we inline them anyway to avoid redoing
the optimizations over and over (per call site). That needs
measuring I guess. Also note that we can still retain the current
behavior for direct calls to __nv_XXX if we mark the call sites
as `alwaysinline`, or at least the behavior is almost like the
current one is.
4) Run an always inliner pass on the __nv_XXX calls because it is
something we would do right now. Alternatively, remove `optnone`
and `noinline` from the __nv_XXX calls.
5) Continue with the pipeline as before.
As mentioned above, `optnone` avoids spending time on the libdevice
until we "activate" it. At that point (globals) DCE can be scheduled
to remove all unused parts right away. I don't think this is (much)
more expensive than linking libdevice early right now.
Link late, aka. translation units without libdevice:
1) clang emits module.bc but does not link in libdevice.bc, it will be
made available later. We still can mix __nv_XXX and llvm.XXX calls
freely as above.
2) Same as above.
3) Same as above.
4) Same as above but effectively a no-op, no __nv_XXX definitions are
available.
5) Same as above.
I might misunderstand something about the current pipeline but from [0]
and the experiments I run locally it looks like the above should cover all
the cases. WDYT?
>
>
>> For early link I'm assuming an order similar to [0] but I also discuss
>> the case where we don't link libdevice early for a TU.
>>
> That link just describes the steps needed to use libdevice. It does not
> deal with how/where it fits in the LLVM pipeline.
> The gist is that NVVMreflect replaces some conditionals with constants.
> libdevice uses that as a poor man's IR preprocessor, conditionally enabling
> different implementations and relying on DCE and constant folding to remove
> unused parts and eliminate the now useless branches.
> While running NVVM alone will make libdevice code valid and usable, it
> would still benefit from further optimizations. I do not know to what
> degree, though.
>
>
>> Link early:
>> 1) clang emits module.bc and links in libdevice.bc but with the
>> `optnone`, `noinline`, and "used" attribute for functions in
>> libdevice. ("used" is not an attribute but could as well be.)
>> At this stage module.bc might call __nv_XXX or llvm.XXX freely
>> as defined by -ffast-math and friends.
>>
> That could work. Just carrying extra IR around would probably be OK.
> We may want to do NVVMReflect as soon as we have it linked in and, maybe,
> allow optimizing the functions that are explicitly used already.
Right. NVVMReflect can be run twice and with `alwaysinline`
on the call sites of __nv_XXX functions we will actually
inline and optimize them while the definitions are just "dragged
along" in case we need them later.
Right now, clang will generate any llvm intrinsic and we crash, so anything
else is probably a step in the right direction. Eventually, we should
"lower"
all intrinsics that the NVPTX backend can't handle or at least emit a nice
error message. Preferably, clang would know what we can't deal with and not
generate intinsic calls for those in the first place.
>
> The most concerning aspect of libdevice is that we don't know when we'll no
> longer be able to use the libdevice bitcode? My understanding is that IR
> does not guarantee binary stability and at some point we may just be unable
> to use it. Ideally we need our own libm for GPUs.
For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
backing on CUDA and libc++ implementations, I hope it will stay that way.
That said, if the need arises we might really have to port libc++ to the
GPUs.
Back to the problem with libdevice. I agree that the solution of NVIDIA
to ship a .bc library is suboptimal but with the existing, or an extended,
auto-upgrader we might be able to make that work reasonably well for the
foreseeable future. That problem is orthogonal to what we are discussing
above, I think.
~ Johannes
I could see something like:
```
__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }
```
and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority ;)
I certainly agree we should try to avoid a hard-coded mapping
in C++.
I could see something like:
```
__attribute__((implementation("llvm.cos"))
double __nv_cos(...) { ... }
```
and a pass that transforms all calls to a function with an
"implementation" to calls to that implementation. Maybe
later we attach a score/priority ;)
I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.
> LLVM does not need to know or care about what's provided by libdevice, and
> we'd have more flexibility, compared to what we could do in the mapping
> pass. It also makes it easy to substitute a different implementation, if we
> have or need one.
I agree that LLVM (core) should not know about __nv_*, that's why I
suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_* declarations.
This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.
The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute. So:
```
__nv_A() { ... } // called, inlined and optimized as before, DCE'ed after.
__nv_B() { ... } // not called, DCE'ed.
__attribute__((implements("llvm.C"))
__nv_C() { ... } // calls are inlined and optimized as before, not DCE'ed
// though because of the attribute. Replaces llvm.C as
// callee in the special pass.
```
So "implements" gives you a way to statically replace a function declaration
or definition with another one. I could see it being used to provide other
intrinsics to platforms with backends that don't support them.
Does that make some sense?
~ Johannes
> LLVM does not need to know or care about what's provided by libdevice, and
> we'd have more flexibility, compared to what we could do in the mapping
> pass. It also makes it easy to substitute a different implementation, if we
> have or need one.
I agree that LLVM (core) should not know about __nv_*, that's why I
suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_* declarations.
This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.
The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute.
Bitcode comes with all the problems libdevice itself has wrt.
compatibility. It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest. Also, bitcode is
platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.
>>> LLVM does not need to know or care about what's provided by libdevice,
>> and
>>> we'd have more flexibility, compared to what we could do in the mapping
>>> pass. It also makes it easy to substitute a different implementation, if
>> we
>>> have or need one.
>> I agree that LLVM (core) should not know about __nv_*, that's why I
>> suggested
>> the `__attribute__((implements("...")))` approach. My preferred solution
>> is still to annotate our declarations of __nv_* and point to the
>> llvm.intrinsics (name) from there. If we have a missing mapping, we
>> point to an
>> intrinsic from a definition that lives in the Clang headers next to the
>> __nv_* declarations.
>>
> We may have slightly different end goals in mind.
> I was thinking of making the solution work for LLVM. I.e. users would be
> free to use llvm.sin with NVPTX back-end with a few documented steps needed
> to make it work (basically "pass additional
> -link-libm-bitcode=path/to/bitcode_libm.bc").
>
> Your scenario above suggests that the goal is to allow clang to generate
> both llvm intrinsics and the glue which would then be used by LLVM to make
> it work for clang, but not in general. It's an improvement compared to what
> we have now, but I still think we should try a more general solution.
>
My scenario doesn't disallow a bitcode approach for non-clang
frontends, nor does it disallow them to simply build the glue code
with clang and package it themselves. It does however allow us to
maintain C(++) code rather than IR, which is by itself a big win.
>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works well
>> even if we can't map 1:1, and we could even restrict the "used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.
I did not understand your suggestion before. Agreed, no tablegen.
>
> Our approaches are not mutually exclusive. If there's a strong opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.
Right. And their own implementation could be trivially created for
them as bc file:
`clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`
Or am I missing something here?
~ Johannes
https://reviews.llvm.org/D98516
If this is something we support I'll write an RFC, also
for the missing clang parts.
~ Johannes
[EOM]
It is also hard to update and maintain. You basically
maintain IR or you maintain C(++) as I suggest.
Also, bitcode is platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.
>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works well
>> even if we can't map 1:1, and we could even restrict the "used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.
I did not understand your suggestion before. Agreed, no tablegen.
>
> Our approaches are not mutually exclusive. If there's a strong opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.
Right. And their own implementation could be trivially created for
them as bc file:
`clang -emit-llvm-bc $clang_src/.../__clang_cuda_cmath.h -femit-all-decls`
Or am I missing something here?
Also, bitcode is platform specific. I can imagine building a bitcode file during the
build but shipping one means you have to know ABI and datalayout or
hope they are the same everywhere.Agreed. We will likely need multiple variants. We will compile specifically for NVPTX or AMDGPU and we will know specific ABI and the data layout for them regardless of the host we're building on.It appears to me is the the difference vs what we have now is that we'll need to have the libm sources somewhere, the process to build them for particular GPUs (that may need to be done out of the tree as it may need CUDA/HIP SDKs) and having to incorporate such libraries into llvm distribution.OK. I'll agree that that may be a bit too much for now.
1) Allow metadata on declaration [not just definition]
2) Tell GlobalOpt and other passes not to delete globals using/used in implemented_by
3) Write implementedby pass that scans all functions, replaces call, removes metadata
4) Add Clang attributes to expose implements and use in nvptx/amd headers
Date: Wed, 28 Apr 2021 18:56:32 -0400
From: William Moses via llvm-dev <llvm...@lists.llvm.org>
To: Artem Belevich <t...@google.com>
...
Hi all,
Reviving this thread as Johannes and I recently had some time to take a
look and do some additional design work. We'd love any thoughts on the
following proposal.
...
While in theory we could define the lowering of these intrinsics to be a
table which looks up the correct __nv_sqrt, this would require the
definition of all such functions to remain or otherwise be available. As
it's undesirable for the LLVM backend to be aware of CUDA paths, etc, this
means that the original definitions brought in by merging libdevice.bc must
be maintained. Currently these are deleted if they are unused (as libdevice
has them marked as internal).
functions.
*Design Constraints:*
To remedy the problems described above we need a design that meets the
following:
* Does not require modifying libdevice.bc or other code shipped by a
vendor-specific installation
* Allows llvm math intrinsics to be lowered to device-specific code
* Keeps definitions of code used to implement intrinsics until after all
potential relevant intrinsics (including those created by LLVM passes) have
been lowered.
... metadata / aliases ...
Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.
I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.
That said, people should chime in if they (dis)like the approach to get math
optimizations (and similar things) working on the GPU.
~ Johannes
+bump
Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.
I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.
That said, people should chime in if they (dis)like the approach to get math
optimizations (and similar things) working on the GPU.
On Tue, Sep 7, 2021 at 9:15 AM Johannes Doerfert <johannes...@gmail.com> wrote:+bump
Jon did respond positive to the proposal. I think the table implementation
vs the "implemented_by" implementation is something we can experiment with.
I'm in favor of the latter as it is more general and can be used in other
places more easily, e.g., by providing source annotations. That said, having
the table version first would be a big step forward too.
I'd say, if we hear some other positive voices towards this we go ahead with
patches on phab. After an end-to-end series is approved we merge it
together.
Thanks for the ping.The IR pass that rewrote llvm.libm intrinsics to architecture specific ones I wrote years ago was pretty trivial. I'm up for re-implementing that.Essentially type out a (hash)table with entries like {llvm.sin.f64, "sin", __nv_sin, __ocml_sin} and do the substitution as a pass called 'ExpandLibmIntrinsics' or similar, run somewhere before instruction selection for nvptx / amdgpu / other.Could factor it differently if we don't like having the nv/oc names next to each other, pass could take the corresponding lookup table as an argument.Main benefit over the implemented-in-terms-of metadata approach is it's trivial to implement and dead simple. Lowering in IR means doing it once instead of once in sdag and once in gisel. I'll write the pass (from scratch, annoyingly, as the last version I wrote is still closed source) if people seem in favour.
SGTM.Providing a fixed set of replacements for specific intrinsics is all NVPTX needs now.Expanding intrinsics late may miss some optimization opportunities,so we may consider doing it earlier and/or more than once, in case we happen to materialize new intrinsics in the later passes.
[AMD Official Use Only]
+1 but we may want to put it under a clang option in the beginning in case it causes perf degradation.
Sam
From: Jon Chesterfield <jonathanch...@gmail.com>
Sent: Wednesday, November 17, 2021 3:17 PM
To: Artem Belevich <t...@google.com>
Cc: Johannes Doerfert <johannes...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>; Arsenault, Matthew <Matthew....@amd.com>; Evgenii Stepanov <eug...@google.com>; Liu, Yaxun (Sam) <Yaxu...@amd.com>
Subject: Re: [llvm-dev] NVPTX codegen for llvm.sin (and friends)
[CAUTION: External Email]
Roman
Thanks!