astra-sim learner

17 views
Skip to first unread message

童青峰

unread,
Oct 6, 2025, 11:18:06 PM10/6/25
to astrasi...@googlegroups.com, willi...@gatech.edu

Dear developers of Astra-sim:

I'm currently learning about astra-sim developed by your team, and I'm taking the ISCA 2022 tutorial course. While working on the four exercises, I encountered the same error in all of them. However, when I changed the configuration files for workload, system, and network to more complex ones - specifically using network: sample_Torus3D.json, system: sample_torus_sys.txt, and workload: microAllReduce.txt - the execution was successful.

I'm wondering if this is due to later changes in the main code that introduced new requirements for the configuration files. Could you please help resolve this issue?

Additionally, I have another question. I'm planning to purchase one or two servers for my own use, considering GPU configurations from options like NVIDIA RTX 5090, RTX A6000, V100, A100, and H100. I have a budget of approximately $80,000. For better development work with Astra-sim, what configuration would you recommend? I understand that current simulation scales typically involve thousands or even tens of thousands of A100 or H100 GPUs, so having just 1-2 servers wouldn't be sufficient for validation experiments. However, I believe it would be helpful for actually testing matrix operations or collective algorithms and collecting real runtime data, which makes me think purchasing servers with specific GPU models is necessary.

Based on your practical experience, could you explain in which aspects servers are actually used during simulation experiments, especially for collecting real experimental data? Also, which server model would you recommend for my current situation? Thank U for u kind help and suggestions.

Best regrads,

Qfeng

Yoo, Jinsun

unread,
Nov 4, 2025, 1:59:37 AM11/4/25
to astrasi...@googlegroups.com, Won, William Jonghoon, 童青峰
Hello, 
Thanks for reaching out to the mailing list and apologies for the late reply!

  1. Tutorial
Could you try the MICRO 2024 tutorial, which is our most recent tutorial?

  1. Server configuration.
This would heavily depend on what specific research question you wish to answer. Do you simply want to validate ASTRA-sim? Do you want to search for the optimal collective scheduling for a specific workload? etc.

Best,
Jinsun

From: '童青峰' via ASTRA-sim Users <astrasi...@googlegroups.com>
Sent: Monday, October 6, 2025 11:17 PM
To: astrasi...@googlegroups.com <astrasi...@googlegroups.com>; Won, William Jonghoon <willi...@gatech.edu>
Subject: astra-sim learner
 
--
You received this message because you are subscribed to the Google Groups "ASTRA-sim Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astrasim-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/astrasim-users/103e087f.160b.199bcacf35d.Coremail.202512281055%40std.uestc.edu.cn.
For more options, visit https://groups.google.com/d/optout.

童青峰

unread,
Dec 7, 2025, 9:05:38 PM12/7/25
to Yoo, Jinsun, astrasi...@googlegroups.com, Won, William Jonghoon

Hello Jinsun:


Hope you’re doing well. Thank you again for your helpful response last time—it gave me a clearer perspective on Astra-Sim. I’ve been continuing to explore the platform and its capabilities in simulating large-scale training workloads, and I’ve been really impressed with its depth and flexibility.

In my current project, I am exploring how simulation platforms can not only predict performance but also actively guide the design of architectures and training strategies. Specifically, I aim to develop a platform that supports fine-grained, customizable simulations of microscopic behaviors—similar to how ANSYS enables detailed structural analysis—to evaluate novel, user-defined strategies for communication, memory, and computation optimization. The goal is to move beyond mimicking existing hardware or strategies and instead provide actionable suggestions for accelerating training, such as recommending optimal parallelization schemes or communication primitives.

I have noticed that Astra-Sim’s published work primarily emphasizes improving simulation accuracy for given configurations, which is of course essential for reliable predictions. However, I am particularly interested in expanding this capability toward a more exploratory and proactive paradigm—where the platform can serve as a "strategy lab" for designing and evaluating innovative acceleration strategies before they are implemented in real systems.

As an experienced developer of Astra-Sim, I would greatly value your thoughts on the following:

  1. How do you view the difference between a simulation platform focused on accuracy (descriptive simulation) and one aimed at guiding strategy design (prescriptive simulation)? Do you think Astra-Sim’s architecture can naturally evolve to support the latter?

  2. Based on your experience, what are the main challenges in enabling simulation platforms to support custom, fine-grained strategies for training acceleration? Are there any extensions or modifications to Astra-Sim that you would recommend for this purpose?

  3. Are there any existing features or APIs in Astra-Sim (e.g., custom schedulers, network models) that you feel are particularly suitable for users to experiment with novel strategies?

  4. Could you share any insights or lessons from your work on how to effectively model and evaluate untested strategies in simulation? Are there best practices for balancing model fidelity with flexibility?

  5. Are there other technologies, research papers, or simulation tools (e.g., Chakra, Alpa, performance modeling frameworks) you would recommend for someone looking to build a strategy-oriented simulation platform?

I believe your expertise and perspective would be incredibly helpful in shaping the direction of my work. I am also open to any suggestions you might have regarding collaboration, references, or potential future developments in this area.

Thank you very much for your time and contributions to the community. I look forward to hearing your thoughts and advice.

Best regards,
Qfeng




-----原始邮件-----
发件人: "Yoo, Jinsun" <jin...@gatech.edu>
发送时间: 2025-11-04 14:59:32 (星期二)
收件人: "astrasi...@googlegroups.com" <astrasi...@googlegroups.com>, "Won, William Jonghoon" <willi...@gatech.edu>, 童青峰 <202512...@std.uestc.edu.cn>
主题: Re: astra-sim learner
Reply all
Reply to author
Forward
0 new messages