Hi AstraSim team,
Hope you are well. We have been trying to use AstraSim with NS3 and have been facing some basic queries. Was hoping someone could answer them. Here they are:
Looking forward to the responses.
Thank you.
Best,
Jit Gupta.
Juniper Business Use Only
--network-configuration flag (ref:
https://astra-sim.github.io/astra-sim-docs/network-backend/ns3-network-backend.html#configuring-the-ns3-backend)CC_MODE will change the congestion control algorithms: 1: DCQCN, 3: HPCC, 7: TIMELY, 8: DCTCP, 10: HPCC-PINT. TOPOLOGY_FILE item of the configuration passed to --network-configuration, a 1 server with 8 GPU example would look like the text example in the link. (Please ignore the Ring image to the right. I have just found this error & will
fix this). A 2 server setup with 4 GPU example would look like this, where switches 8/9 are the intra-node connection, and switch 10 is the inter-node connection.
Hi Jinsun,
Thank you so much for your responses. We tried out some of the things you mentioned in your answers and faced some issues again. Can you help us with these queries?
[#1]. The first issue is that when we run astra-sim with ns3, the program gets stuck and hangs on the terminal - no matter the workload or the number of NPUs. The output of the
terminal is as shown below, with no further output even after 1+ hour of waiting in some instances. We must manually press ctrl + c to exit the ns3 simulation without an indication of when ns3 is finished. Now, for the Llama2, Resnet50, and DLRM workloads,
output is displayed in the fct.txt file and qlen.txt file but not to the other output files such as pfc.txt or flow.txt.

How do we fix this issue of ns3 getting stuck as well as not outputting to certain output files? Similar issues are also listed on the GitHub - https://github.com/astra-sim/astra-sim/issues/172 and https://github.com/astra-sim/astra-sim/issues/240. We reduced the SIMULATOR_STOP_TIME parameter in the config.txt file in scratch/config and also followed this potential fix talked about in issue 240 (https://github.com/astra-sim/astra-sim/commit/95898d9d6c165e7b09519d96203c99db30fa3c3e). We moved the Simulator::Stop before Simulator::Run in the main method in AstraSimNetwork.cc, but neither tweak seemed to solve the problem.
For reference, our steps to running ns3 are as follows.

[#2]. The second question is if we want to simulate running a workload on 32 GPUs for example, do we need 32 chakra traces? When we simulate a workload of 8 traces with 32 GPUs, the ns3 simulator stops immediately without outputting to the output files and prints the message below on the terminal. While we can change the physical topology file to scale up to 8, 16, 32, etc. GPUs, do we need to scale up the number of traces we have as well? Not sure if this is feasible because simulating 1000s of GPUs in a cluster would then require access to an actual 1000 GPU cluster to collect Chakra traces.

[#3]. The third question is we have generated 4 chakra traces by instrumenting a BERT model running across 4 GPUs. When we then try to run astra-sim + ns3 on these 4 traces with 4 GPUs, the same issue as #1 happens where things get stuck but there is also no output to the output files, including fct.txt and qlen.txt. What would be the cause of this?
Thank you.
Best,
Jit.
Juniper Business Use Only
From:
Yoo, Jinsun <jin...@gatech.edu>
Date: Wednesday, January 15, 2025 at 1:38 PM
To: Jit Gupta <gj...@juniper.net>, astrasi...@googlegroups.com <astrasi...@googlegroups.com>
Subject: Re: AstraSim Queries
[External Email. Be cautious of content]