RFC: Finishing ROCM backend so as to enable by default

262 views
Skip to first unread message

Stella Laurenzo

unread,
Aug 29, 2023, 12:53:01 AM8/29/23
to iree-discuss
Hi Folks,

We've had in-tree code for the ROCM backend for a while. This has received some downstream attention and testing by some users, but it is increasing in importance and focus. I'd like to ultimately arrive at some feedback on what it would take to call it good enough to enable by default in IREE. I've got no expectations on a timeline for this, and I'm just looking for feedback. To the extent possible, I'd like to avoid a fully general discussion on what the criteria is for an arbitrary backend and focus on this specific one (I expect others are all going to need to be uniquely evaluated anyway).

In my mind, there are a handful of ways that this backend is presently insufficient and would need to be addressed:

  1. Lack of testing: This one is obvious and represents the first tier of support, in my mind. I'd like to provide some incentive to move downstream testing to an appropriate granularity of upstream testing as we evaluate maturity. For now, given the experimental nature of the PkgCI (and that it isn't blocking anyone), I've ok'd Stanley to enable ROCM here so we can assess. If it causes problems, we'll disable it again. While the PkgCI handle integration testing, we will still need a build-bot to do runtime/CTS testing with a real device.
  2. Runtime Maturity: The ROCM backend was effectively ported from a very early version of the CUDA backend, which for reasons of history, had always been the least mature of our supported backends. Due to some heroic work be Lei Zhang and Ben Vanik, the CUDA backend is now in pretty good shape. We need the ROCM backend to make similar investments, primarily so that it no longer depends on deprecated or unsupportable parts of the HAL. Having this continue to diverge keeps us from moving important parts of the runtime forward, and it can't fall to the maintainers of the rest of the backends and infra to keep ROCM going. I expect that Ben Vanik and Lei Zhang would need to provide a thumbs up/review before calling this in good shape.
  3. Compiler Backend Maturity: The ROCM compiler backend is implemented within the "LLVMGPU" codebase, along with the CUDA backend. Both have maintenance needs that make us hesitate to just add "one more thing" until we see that things are moving towards a better supportable state. There has been some recent coordination between Quinn/Lei/Mahesh/Kunwar/Harsh on this which makes me think that this can be solved and there is a will to do so. Still, I would expect that Mahesh owns the bit to say we're ready to more generally support this.
  4. Build and CI Infra: There is a variety of AMD hardware to test and no to limited ability to acquire most of it on a public cloud. With the PkgCI work, I've addressed these issues, and it is fairly straight-forward to bolt runners onto the project, and the testing setup supports enough annotations to let us to manage the variants. I expect that as we transition more integration testing to PkgCI, this will be a non-issue. It has also proven to be easier to manage bespoke Linux-based AMD CI systems than some alternatives, effectively just requiring user-land package installs in (fully ephemeral) docker images and a single host-side argument to map the device into the container. I think that as long as people are committed to keeping these runners going, there is little overhead to the overall project to enable testing on them.
  5. Committed staffing: We need to know that people are on the hook to not just fix the ROCM backend when it breaks but also to work proactively to keep it up to standards so that it isn't a drag on core development work. This will require people who can do meaningful amounts of compiler and runtime maintenance as needed.
Anything else?
- Stella

Scott Todd

unread,
Aug 29, 2023, 11:10:41 AM8/29/23
to Stella Laurenzo, iree-discuss
Thanks for the RFC! Agreed that code maturity should improve and we would need CI runners capable of ensuring test coverage.

I'm curious how much we think can still be shared between the compiler backends. If code diverges significantly, that could become trickier to support.

I'll also suggest two more items:

Continuous benchmarks: Nice to have, especially early to track trends and spot regressions. Ideally we could plug in to the existing benchmark infrastructure, though I also get the feeling that we're due for a benchmarking refresh at some point soon (easier PyTorch integration, more runners, LLM workloads, etc.). I think once we have AMD CI runners for builds/tests, hooking them up for benchmarks should be a matter of plumbing and provisioning enough resources to service the benchmark jobs too. Whether it's worth doing that plumbing would depend on priorities for the backend and how well they match with the current status of the benchmarking infrastructure.

Documentation: https://openxla.github.io/iree/guides/deployment-configurations/gpu-cuda-rocm/ could be split into two separate pages, as both backends mature and grow in their own ways. We could also give some guidance on when to use one API over another when there are multiple options (e.g. try both, use Vulkan if you want to interop with a graphics application and CUDA/ROCm if you want to use vendor-specific features).

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/f15ecfc8-43e3-464d-9397-c5e8ab25fea0n%40googlegroups.com.

Ben Vanik

unread,
Aug 29, 2023, 12:35:56 PM8/29/23
to iree-discuss
Thanks Stella, those points match my feelings on things. In-tree targets have certain latency requirements on fixes/improvements/evolution and experimental/ allows targets that can't meet those to still live in-tree - the alternative is for them to live out-of-tree (which we support via IREE_EXTERNAL_HAL_DRIVERS) so that they can be maintained at a different rate and with different test frequency/depth. In my mind new targets should be started out-of-tree, at some level of maturity/commitment/expected longevity move in-tree to experimental/, and then when enabled for scaling moved out of experimental/. It's ok for a target that is in experimental/ to stay there awhile if the alternative is to move it out-of-tree - until the fixes just recently I'd have made the question around getting ROCM out of experimental/ "should we move ROCM to its own repo?" instead as it'd set a good precedent for other targets people build out living out-of-tree. Getting a template repo setup with the runtime and compiler portions + test harness and such would be really useful in general. The HAL CTS also needs a rewrite and to be made much more exhaustive so that it's easier both for in-tree targets and out-of-tree targets to be integrated without deep target-specific debugging being required.

As for ROCM in particular (vs other future targets) the two things that gate such a migration out of experimental/ and into the main drivers/ tree for me are #2 (hopefully soon ROCM will be the only synchronous-only backend, at which time it'll go more unsupported as the compiler starts emitting async-only code) and #3 (given that the ROCM and CUDA codegen sides share so much and CUDA is unowned and has some deep tech debt/chaos to recover from). If the runtime side is made async in line with the others then it won't be too bad to support (it may just lack features of other targets that get more activity) and could be moved out of experimental but the compiler side will be the bigger lift. A better HAL CTS would make having a less maintained target in-tree is less costly but without a maintained codegen story and a primary maintainer of the whole end-to-end we need the ability to disable ROCM as issues arise.

Once cuda2 replaces cuda doing a snapshot and hipification (or whatever) of it may be a good starting point for solving #2. cuda2 will still need some architectural changes as the HAL gets closer to v1 but having a rocm2 that was easier to port changes between would make it easier to keep both improving together.

Che-Yu Wu

unread,
Aug 29, 2023, 12:42:20 PM8/29/23
to Ben Vanik, iree-discuss
Thanks for the RFC! Here are the comments on some CI-related items:

Continuous benchmarks: 
Totally agree we should have a refresh on the current benchmark infrastructure. And I'm happy to see PkgCI move to an e2e python-driven benchmark flow (I ran into many issues with combining Python and CMake build system). For regression tracking if needed we can upload numbers to https://perf.iree.dev and it should work (it only needs an API key and the key is stored in Github action secrets)

Build and CI Infra: As we bring in more non-GCP runners, I think it will be great to document all runners we have, including their specifications, which runner labels to request them in CI, and the points of contact when runners fail or misbehave.

Mahesh Ravishankar

unread,
Aug 29, 2023, 4:28:41 PM8/29/23
to Che-Yu Wu, Ben Vanik, iree-discuss
Thanks Stella for raising the issue.

Regarding:
  1. Compiler Backend Maturity: The ROCM compiler backend is implemented within the "LLVMGPU" codebase, along with the CUDA backend. Both have maintenance needs that make us hesitate to just add "one more thing" until we see that things are moving towards a better supportable state. There has been some recent coordination between Quinn/Lei/Mahesh/Kunwar/Harsh on this which makes me think that this can be solved and there is a will to do so. Still, I would expect that Mahesh owns the bit to say we're ready to more generally support this.

For some context, the CUDA backend needs some maintenance. The primary change here would be 
1) Make the Tensor core pipeline that is currently the most performant thanks to Manish Gupta's work, modernized to use the tensors -> vectors path. This is currently not the case, and Hanhan was working on fixing this. He had some progress but those need to land and things like shared memory usage etc. need to be flushed out to make this the pipeline to target tensor cores
2) After (1) we deprecate some pipelines in the CUDA backends that were built as various experimentation points. 
All of this should be directly relevant for RoCM. Since this work is just coming online, using it to build in RoCM support from the get go, with good testing and CI support would be great! I think it is valuable to have RoCM path lit up enough to have a way to isolate issues that might be specific to the SPIR-V driver compiler on AMD hardware. I estimate this work to take a couple of months to reach a satisfactory state. Really I am looking for someone to drive this forward (cause this is really a full time endeavor), but things are all set up to move forward here. So having RoCM as a first-class supported backend seems viable to me.

Lei Zhang

unread,
Aug 30, 2023, 1:43:54 PM8/30/23
to Mahesh Ravishankar, Che-Yu Wu, Ben Vanik, iree-discuss
Thanks Stella! It would be awesome to see improvements and eventually default support for the ROCm platform! 

I think we have a good coverage of major issues by you and Scott. Among them, I personally think committed staffing is the top--all other issues would need to have people taking on them. :) Once it's enabled by default there will be ongoing maintenance. We should have at least two people comfortable to pick up tasks there and fix issues, for both the runtime and the compiler side. Doesn't mean they need to be exclusively full time sitting on this; it's more about experience and familiarity with the codebase and can jump in when others might not be available / busy with something else. For the runtime Ben will always be the last line of defense but I wouldn't want to just put the daily maintenance onto his plate. I'm always happy to help; and we should still find some other person(s) mainly responsible for it. Whoever is responsible should have a good setup of the environment to reproduce issues, given that that's the typical first blockers of fixing any issues and not everybody can have all device variants. 

Regarding making the rocm runtime production ready, I agree with Ben that waiting for the cuda2 to replace cuda and then start there would be better. We are not too far away from that--all the progress is tracked in this issue with some major remaining tasks specifically here. I also have a few outstanding performance improvement patches not listed there; those aren't architectural and so nonblocking.  

Thanks,
Lei


Stella Laurenzo

unread,
Aug 31, 2023, 9:44:23 PM8/31/23
to Che-Yu Wu, Ben Vanik, iree-discuss
On Tue, Aug 29, 2023 at 9:42 AM 'Che-Yu Wu' via iree-discuss <iree-d...@googlegroups.com> wrote:
Thanks for the RFC! Here are the comments on some CI-related items:

Continuous benchmarks: 
Totally agree we should have a refresh on the current benchmark infrastructure. And I'm happy to see PkgCI move to an e2e python-driven benchmark flow (I ran into many issues with combining Python and CMake build system). For regression tracking if needed we can upload numbers to https://perf.iree.dev and it should work (it only needs an API key and the key is stored in Github action secrets)

I didn't quite get there last week and will try to this week: I was just going to stream out all compile/latency metrics to a file as the tests run (one row per metric) and then combine/report/put in an artifact. For the moment, I've just been eyeballing logs to see that things don't regress.
 

Stella Laurenzo

unread,
Oct 4, 2023, 9:29:05 PM10/4/23
to iree-discuss
Hi folks - Would we be ok enabling the ROCM compiler target by default at head and in deployed packages?

My rationale is that:
  • We've done work on it to better manage bitcode and tool dependency, and I'd really like to start testing the mechanics of the compiler target at head.
  • It has no external dependencies outside of the project.
  • Our tools flows are heavily based on being able to use nightly/deployed compiler binaries, and having these available by default will really help solve the testing chicken/egg issues on the runtime and CI.
  • The runtime would still remain off by default so there is little risk of this becoming GA before we are ready with the runtime (which requires more work).
  • We can disable it at any time if it causes issues.
  • It'd be good to start getting some signal on how burdensome this is to carry while it is still not strictly required.
Thoughts?
- Stella

Anush Elangovan

unread,
Oct 4, 2023, 10:06:43 PM10/4/23
to Stella Laurenzo, iree-discuss
This would be great. The ROCM backend has been around for a while and it would be good to get it into a more supported shape as we continue to invest in it.

Thanks

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

Ben Vanik

unread,
Oct 5, 2023, 12:00:56 PM10/5/23
to Anush Elangovan, Stella Laurenzo, iree-discuss
I'm +1 on enabling in the compiler.

Scott Todd

unread,
Oct 5, 2023, 12:06:56 PM10/5/23
to Ben Vanik, Anush Elangovan, Stella Laurenzo, iree-discuss
I'm also fine with enabling the compiler target in the CMake build and releases by default. It's actually already been enabled in the Bazel build for a while (can see `//compiler/src/iree/compiler/Dialect/HAL/Target/ROCM/test:smoketest.mlir.test (cached) PASSED in 0.2s` in CI logs).

My only hesitation would be the point in the original message here about LLVMGPU codegen needing some maintenance. It seems like the actual compiler target parts (HAL/Target) are pretty small, so there might not be too much cost there.

Jacques Pienaar

unread,
Oct 5, 2023, 12:11:51 PM10/5/23
to Scott Todd, Ben Vanik, Anush Elangovan, Stella Laurenzo, iree-discuss
I see this as a good staging step. Given setup we have easy outs if we discover pain points while can track progress/stability improvements. Low risk and good signal, so SGTM.

-- Jacques 

Stella Laurenzo

unread,
Oct 5, 2023, 12:39:55 PM10/5/23
to Jacques Pienaar, Scott Todd, Ben Vanik, Anush Elangovan, iree-discuss
Ok, I'll triage any issues and burn down to enable shortly.



On Thu, Oct 5, 2023 at 9:11 AM Jacques Pienaar <jpie...@google.com> wrote:
I see this as a good staging step. Given setup we have easy outs if we discover pain points while can track progress/stability improvements. Low risk and good signal, so SGTM.

-- Jacques 

On Thu, Oct 5, 2023, 9:06 AM 'Scott Todd' via iree-discuss <iree-d...@googlegroups.com> wrote:
I'm also fine with enabling the compiler target in the CMake build and releases by default. It's actually already been enabled in the Bazel build for a while (can see `//compiler/src/iree/compiler/Dialect/HAL/Target/ROCM/test:smoketest.mlir.test (cached) PASSED in 0.2s` in CI logs).

My only hesitation would be the point in the original message here about LLVMGPU codegen needing some maintenance. It seems like the actual compiler target parts (HAL/Target) are pretty small, so there might not be too much cost there.

These things get better with testing, use and engagement, so this incremental step will help there. Limited people want to spend time on things that step 1 is "suit up and go into the monster's lair".

Stella Laurenzo

unread,
Mar 7, 2024, 3:47:37 PM3/7/24
to Stella Laurenzo, Jacques Pienaar, Scott Todd, Ben Vanik, Anush Elangovan, iree-discuss
Since the last update, the "hip" runtime driver has been completely rewritten and we believe it to be more feature complete than the ROCM driver. We're at the point where it would be beneficial to have this available by default in binaries so that we can leverage it in mainline testing flows. I'll be sending a patch shortly to enable by default.



--
- Stella

Lei Zhang

unread,
Mar 21, 2024, 11:46:22 AM3/21/24
to iree-discuss
Just wanted to update here--we have landed https://github.com/openxla/iree/commit/5f2743baaa79160a3883854e60e5188822dceeb1 which moved hip into the hal/drivers directory and enabled building it by default.
Reply all
Reply to author
Forward
0 new messages