SWE-Glu SF: [Programming Week] Distributed Training Puzzles

27 views
Skip to first unread message

Cheikh Fiteni

unread,
Sep 28, 2024, 12:10:02 PM9/28/24
to SWE-Glu SF Papers Reading Group

Weekly update to P(Glu),

Our meeting will be Sunday, September 29th, 2:30 PM @ 848 Divisadero Street. In lieu of a paper reading, this week we’ll be changing things up and programming through Sasha Rush’s LLM Distributed Training Puzzles.


While you don’t have to read any papers ahead of the next meeting, we recommend you read through the collab and reach out with any questions. If you’re too curious like us, these papers and this great EleutherAI post on Transformer (Training) Math might be a great start.


Finally, for the deeply curious, open source has us covered with a production distributed training codebase (supporting heterogenous nodes!) at OpenDilico.


Why distributed training is cool:

  1. Forces to think about the massively parallel structure of GPUs, SIMD lanes, and control flow-less primitives that make up the largest part of tech budgets today.

  2. Organizes around distributed systems challenges like fault tolerance and OOM errors with constraints for good old fashion checkpointing.

  3. Makes the scaling hypothesis possible, at least until physics breaks down to make single chip bus widths wider.


Whether you’re a newcomer to SWEGlu or you’ve been to every meeting, we hope you can join us for a practical week, and engage with the water.


Best,
Cheikh and Sasha

P.S. if you are somehow reading this email but not on our listserv join it here. If you are on our listserv, send it to your friends.
Reply all
Reply to author
Forward
0 new messages