Starting in <45min: “Optimizing attention for modern hardware” - Tri Dao (Princeton & Together AI)

19 views
Skip to first unread message

Nadav Timor

unread,
Apr 10, 2025, 11:16:55 AMApr 10
to Faster LLM Inference Seminar

Date & Time:

Today, April 10, at 12:00 PM EST

(Add to calendar)


Abstract:

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention pioneered an approach to speed up attention on GPUs by minimizing memory reads/writes. We describe recent advances in this area, including optimizations for Hopper GPUs. exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. This allows us to reach up to 85% of theoretical max TFLOPS. We will then cover advanced techniques and optimization for inference, such as persistent kernels, load balancing, GQA packing. Finally we examine new attention variants designed specifically for inference efficiency and test-time compute.


Bio:

Tri Dao is an Assistant Professor at Princeton University and chief scientist of Together AI. He completed his PhD in Computer Science at Stanford. He works at the intersection of machine learning and systems, and his research interests include hardware-efficient algorithms and sequence models with long-range memory. His work has received the COLM 2024 Outstanding paper award and ICML 2022 Outstanding paper runner-up award.


Registration:

https://faster-llms.vercel.app


We look forward to your participation!

Reply all
Reply to author
Forward
0 new messages