Very bad performance when xla_gpu_enable_triton

Steeve Morin

unread,

Sep 10, 2024, 6:38:12 PM9/10/24

to OpenXLA Discuss

Hello,

We've seen that enabling xla_gpu_enable_triton_gemm will severly (~50-100x) impact performance on AMD CDNA (MI chips).
For instance, Llama 3 8B will run at about 1 tok/s with triton enabled vs ~66 tok/s disabled.

Since its enabled by default, we were wondering if we did something wrong.

Also, we've seen that NVIDIA recommends disabling it also on their GPUs: https://github.com/NVIDIA/JAX-Toolbox?tab=readme-ov-file#environment-variables, although we've haven't seen any difference on NVIDIA so far.

Should we just go ahead and disable it altogether ?

Thank you,

Steeve

George Karpenkov

unread,

Sep 11, 2024, 10:54:20 AM9/11/24

to Steeve Morin, Chao Chen, OpenXLA Discuss

We've never tuned Triton tilings for AMD GPUs, and we've never debugged their performance, I'd check what exact kernels are slow in the timeline and potentially file bugs on Triton AMD backend.

+@Chao Chen can someone from AMD comment on Triton-on-AMD performance?

> Also, we've seen that NVIDIA recommends disabling it also on their GPUs: https://github.com/NVIDIA/JAX-Toolbox?tab=readme-ov-file#environment-variables, although we've haven't seen any difference on NVIDIA so far.

That's a different question; I think we got mostly aligned with NVIDIA to remove that recommendation, it shouldn't be valid anymore.

--
You received this message because you are subscribed to the Google Groups "OpenXLA Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openxla-discu...@openxla.org.
To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/f95a8c94-d2a4-4e47-a811-acb95439038en%40openxla.org.
For more options, visit https://groups.google.com/a/openxla.org/d/optout.

Jay Furmanek

unread,

Sep 11, 2024, 11:24:39 AM9/11/24

to OpenXLA Discuss, George Karpenkov, OpenXLA Discuss, Steeve Morin, Chao Chen

Hi guys,

We have the Triton GEMM rewriter enabled for AMD, but don't yet have the auto-tuner enabled yet (we're working on that).

That " xla_gpu_enable_triton_gemm" should probably be off by default for AMD until we finish there.

@Steve

What Llama-3 model are you using? We can use that one to ensure Triton helps there before re-enabling once we have the tuner in.

Is this JAX, or are you using Torch-XLA? Also, which GPU are you using? Triton likely won't ever be enabled for anything older than MI200.

George Karpenkov

unread,

Sep 11, 2024, 12:02:42 PM9/11/24

to Jay Furmanek, OpenXLA Discuss, Steeve Morin, Chao Chen

Ah I guess it's not surprising then, without an autotuner the performance will be quite bad.

Steeve Morin

unread,

Sep 11, 2024, 1:11:10 PM9/11/24

to OpenXLA Discuss, George Karpenkov, OpenXLA Discuss, Steeve Morin, Chao Chen, Jay Furmanek

Thanks you all!!

We’re using our own llama3 implem using our framework, ZML, which is based on Zig, StabeHLO/XLA and Bazel.
Would love to give you access to the repo to you folks if you want to, send me your github handles if you want.

More context:
https://x.com/steeve/status/1819005278467015044?s=46&t=IF2rTl8IrJJyT2pzj-XmBQ

Quick side question: we usually run compilations in parallel (in different threads), does that impact autotuning ?

Steeve

George Karpenkov

unread,

Sep 12, 2024, 4:26:00 AM9/12/24

to Steeve Morin, OpenXLA Discuss, Chao Chen, Jay Furmanek

> Quick side question: we usually run compilations in parallel (in different threads), does that impact autotuning ?

No, there's a mutex. Don't use more then one process for a single GPU though (but that probably won't work anyway due to BFC allocator).

Steeve Morin

unread,

Mar 3, 2025, 1:02:00 PMMar 3

to OpenXLA Discuss, George Karpenkov, OpenXLA Discuss, Chao Chen, Jay Furmanek, Steeve Morin

Hey folks,

Jumping back on this.

We have tried running the Triton implementation of Paged Attention at https://github.com/jax-ml/jax/blob/main/jax/experimental/pallas/ops/gpu/paged_attention.py

It runs at about 900us on H200 vs 20ms on MI300.

Something tells me this might be linked to autotuning still ? Was it ever enabled ?

Reply all

Reply to author

Forward

Very bad performance when xla_gpu_enable_triton_gemm=true on MI210/MI300X

Steeve Morin

George Karpenkov

Jay Furmanek

George Karpenkov

Steeve Morin

George Karpenkov

Steeve Morin