Very bad performance when xla_gpu_enable_triton_gemm=true on MI210/MI300X

145 views
Skip to first unread message

Steeve Morin

unread,
Sep 10, 2024, 6:38:12 PMSep 10
to OpenXLA Discuss
Hello,

We've seen that enabling xla_gpu_enable_triton_gemm will severly (~50-100x) impact performance on AMD CDNA (MI chips).
For instance, Llama 3 8B will run at about 1 tok/s with triton enabled vs ~66 tok/s disabled.

Since its enabled by default, we were wondering if we did something wrong.

Also, we've seen that NVIDIA recommends disabling it also on their GPUs: https://github.com/NVIDIA/JAX-Toolbox?tab=readme-ov-file#environment-variables, although we've haven't seen any difference on NVIDIA so far.

Should we just go ahead and disable it altogether ?

Thank you,
Steeve

George Karpenkov

unread,
Sep 11, 2024, 10:54:20 AMSep 11
to Steeve Morin, Chao Chen, OpenXLA Discuss
We've never tuned Triton tilings for AMD GPUs, and we've never debugged their performance, I'd check what exact kernels are slow in the timeline and potentially file bugs on Triton AMD backend.

+@Chao Chen can someone from AMD comment on Triton-on-AMD performance? 


> Also, we've seen that NVIDIA recommends disabling it also on their GPUs: https://github.com/NVIDIA/JAX-Toolbox?tab=readme-ov-file#environment-variables, although we've haven't seen any difference on NVIDIA so far.

That's a different question; I think we got mostly aligned with NVIDIA to remove that recommendation, it shouldn't be valid anymore.

--
You received this message because you are subscribed to the Google Groups "OpenXLA Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openxla-discu...@openxla.org.
To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/f95a8c94-d2a4-4e47-a811-acb95439038en%40openxla.org.
For more options, visit https://groups.google.com/a/openxla.org/d/optout.

Jay Furmanek

unread,
Sep 11, 2024, 11:24:39 AMSep 11
to OpenXLA Discuss, George Karpenkov, OpenXLA Discuss, Steeve Morin, Chao Chen
Hi guys,
We have the Triton GEMM rewriter enabled for AMD, but don't yet have the auto-tuner enabled yet (we're working on that).
That " xla_gpu_enable_triton_gemm" should probably be off by default for AMD until we finish there.

@Steve
What Llama-3 model are you using? We can use that one to ensure Triton helps there before re-enabling once we have the tuner in.
Is this JAX, or are you using Torch-XLA? Also, which GPU are you using? Triton likely won't ever be enabled for anything older than MI200.

George Karpenkov

unread,
Sep 11, 2024, 12:02:42 PMSep 11
to Jay Furmanek, OpenXLA Discuss, Steeve Morin, Chao Chen
Ah I guess it's not surprising then, without an autotuner the performance will be quite bad.

Steeve Morin

unread,
Sep 11, 2024, 1:11:10 PMSep 11
to OpenXLA Discuss, George Karpenkov, OpenXLA Discuss, Steeve Morin, Chao Chen, Jay Furmanek
Thanks you all!!

We’re using our own llama3 implem using our framework, ZML, which is based on Zig, StabeHLO/XLA and Bazel.
Would love to give you access to the repo to you folks if you want to, send me your github handles if you want.

More context:
https://x.com/steeve/status/1819005278467015044?s=46&t=IF2rTl8IrJJyT2pzj-XmBQ

Quick side question: we usually run compilations in parallel (in different threads), does that impact autotuning ?

Steeve

George Karpenkov

unread,
Sep 12, 2024, 4:26:00 AMSep 12
to Steeve Morin, OpenXLA Discuss, Chao Chen, Jay Furmanek
> Quick side question: we usually run compilations in parallel (in different threads), does that impact autotuning ?

No, there's a mutex. Don't use more then one process for a single GPU though (but that probably won't work anyway due to BFC allocator).
Reply all
Reply to author
Forward
0 new messages