AstraSim Queries

6 views
Skip to first unread message

Jit Gupta

unread,
Dec 18, 2024, 8:25:12 PM12/18/24
to astrasi...@googlegroups.com

Hi AstraSim team,

 

Hope you are well. We have been trying to use AstraSim with NS3 and have been facing some basic queries. Was hoping someone could answer them. Here they are:

 

  1. Is there a repository of Chakra traces that we can use to run on AstraSim?
  2. We are trying to use Chakra traces generated from an 8 GPU cluster and run it on AstraSim+NS3 with a different topology (ex. 16 GPUs, 32GPUs, etc.). How do I do this? 
  3. How do I try out a topology with different intranode configurations? Ex. Servers with 1 GPU, 2 GPU, 4GPU, 8GPUs, etc.
  4. Does NS3 have any congestion control mechanism implemented into it that makes it congestion aware? Ex. DCQCN 
  5. Is there a way to hook up an actual network to AstraSim? Has there any work been done on this front? The network would substitute the work of NS3 or the Analytical Network backend.

 

Looking forward to the responses.

Thank you.

Best,

Jit Gupta.


Juniper Business Use Only

Yoo, Jinsun

unread,
Jan 15, 2025, 4:38:12 PM1/15/25
to Jit Gupta, astrasi...@googlegroups.com
Hi Jit, 
Hope you are doing well & my apologies for the late reply. 


  1. Is there a repository of Chakra traces that we can use to run on AstraSim?
Please refer to this Google drive link for some real Chakra traces we have collected: https://drive.google.com/drive/u/4/folders/1r6OngjmeEZkYezOW_7h6nG-0DD-rfxkI
(You will need to join the Chakra working group within MLCommons: https://mlcommons.wpenginepowered.com/working-groups/research/chakra/)
We also have a tool to synthetically generate Chakra traces for transformer models: https://github.com/astra-sim/symbolic_tensor_graph

  1. Does NS3 have any congestion control mechanism implemented into it that makes it congestion aware? Ex. DCQCN 
This link points to the configuration files that sets the ns3 simulator: https://github.com/astra-sim/astra-network-ns3/tree/astra-sim/scratch/config
These configurations are provided to astra-sim with the --network-configuration flag (ref: https://astra-sim.github.io/astra-sim-docs/network-backend/ns3-network-backend.html#configuring-the-ns3-backend)
Within the configuration file, setting CC_MODE will change the congestion control algorithms: 1: DCQCN, 3: HPCC, 7: TIMELY, 8: DCTCP, 10: HPCC-PINT.        

  1. We are trying to use Chakra traces generated from an 8 GPU cluster and run it on AstraSim+NS3 with a different topology (ex. 16 GPUs, 32GPUs, etc.). How do I do this? 
  2. How do I try out a topology with different intranode configurations? Ex. Servers with 1 GPU, 2 GPU, 4GPU, 8GPUs, etc.
  • [Q3]For the physical topology, which you specify in the TOPOLOGY_FILE item of the configuration passed to --network-configuration, a 1 server with 8 GPU example would look like the text example in the link. (Please ignore the Ring image to the right. I have just found this error & will fix this). A 2 server setup with 4 GPU example would look like this, where switches 8/9 are the intra-node connection, and switch 10 is the inter-node connection. 
11 3 16
8 9 10 
8 0 400Gbps ...
8 1 400Gbps ...
8 2 400Gbps ...
8 3 400Gbps ...
9 4 400Gbps ...
9 5 400Gbps ...
...
10 0 50Gbps ...
...
10 7 50Gbps ...

  1. Is there a way to hook up an actual network to AstraSim? Has there any work been done on this front? The network would substitute the work of NS3 or the Analytical Network backend.
Yes, we have been thinking about this problem & we will keep you updated! :) 


Best,
Jinsun

 



From: 'Jit Gupta' via ASTRA-sim Users <astrasi...@googlegroups.com>
Sent: Wednesday, December 18, 2024 8:25 PM
To: astrasi...@googlegroups.com <astrasi...@googlegroups.com>
Subject: AstraSim Queries
 
--
You received this message because you are subscribed to the Google Groups "ASTRA-sim Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astrasim-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/astrasim-users/BY3PR05MB80349BDD45666FA52A7871EAC4062%40BY3PR05MB8034.namprd05.prod.outlook.com.
For more options, visit https://groups.google.com/d/optout.

Jit Gupta

unread,
Jan 17, 2025, 2:51:54 PM1/17/25
to Yoo, Jinsun, astrasi...@googlegroups.com

Hi Jinsun,

 

Thank you so much for your responses. We tried out some of the things you mentioned in your answers and faced some issues again. Can you help us with these queries?

[#1]. The first issue is that when we run astra-sim with ns3, the program gets stuck and hangs on the terminal - no matter the workload or the number of NPUs. The output of the terminal is as shown below, with no further output even after 1+ hour of waiting in some instances. We must manually press ctrl + c to exit the ns3 simulation without an indication of when ns3 is finished. Now, for the Llama2, Resnet50, and DLRM workloads, output is displayed in the fct.txt file and qlen.txt file but not to the other output files such as pfc.txt or flow.txt.

 

Image

 

How do we fix this issue of ns3 getting stuck as well as not outputting to certain output files? Similar issues are also listed on the GitHub - https://github.com/astra-sim/astra-sim/issues/172 and https://github.com/astra-sim/astra-sim/issues/240.  We reduced the SIMULATOR_STOP_TIME parameter in the config.txt file in scratch/config and also followed this potential fix talked about in issue 240 (https://github.com/astra-sim/astra-sim/commit/95898d9d6c165e7b09519d96203c99db30fa3c3e). We moved the Simulator::Stop before Simulator::Run in the main method in AstraSimNetwork.cc, but neither tweak seemed to solve the problem. 

 

For reference, our steps to running ns3 are as follows.

  1. We first compile the analytical backend as to avoid this issue listed here: https://github.com/astra-sim/astra-sim/issues/238
  2. We then compile the ns3 backend. The build file is shown below – the parameters are just the default files in the examples folder, and the physical topology file is the default 8_nodes_1_switch_topology.txt in the scratch/topology folder. When we run different workloads, we change the workload parameter, and we change the physical and logical topology files following the already-existing structure of those files.
  3. Finally, we run astra-sim + ns3 with ./build/astra_ns3/build.sh -r

Image

 

 

[#2]. The second question is if we want to simulate running a workload on 32 GPUs for example, do we need 32 chakra traces? When we simulate a workload of 8 traces with 32 GPUs, the ns3 simulator stops immediately without outputting to the output files and prints the message below on the terminal. While we can change the physical topology file to scale up to 8, 16, 32, etc. GPUs, do we need to scale up the number of traces we have as well? Not sure if this is feasible because simulating 1000s of GPUs in a cluster would then require access to an actual 1000 GPU cluster to collect Chakra traces.

 

Image

 

[#3]. The third question is we have generated 4 chakra traces by instrumenting a BERT model running across 4 GPUs. When we then try to run astra-sim + ns3 on these 4 traces with 4 GPUs, the same issue as #1 happens where things get stuck but there is also no output to the output files, including fct.txt and qlen.txt. What would be the cause of this?

 

Thank you.

Best,

Jit.

 

 


Juniper Business Use Only

From: Yoo, Jinsun <jin...@gatech.edu>
Date: Wednesday, January 15, 2025 at 1:38
PM
To: Jit Gupta <gj...@juniper.net>, astrasi...@googlegroups.com <astrasi...@googlegroups.com>
Subject: Re: AstraSim Queries

[External Email. Be cautious of content]

Yoo, Jinsun

unread,
Jan 17, 2025, 5:11:23 PM1/17/25
to Jit Gupta, astrasi...@googlegroups.com
Hi Jit, let me inspect this issue & get back to you, ETA end of weekend. 
In the meantime, could you send the Chakra traces for the BERT model?

Best,
Jinsun

From: Jit Gupta <gj...@juniper.net>
Sent: Friday, January 17, 2025 2:51 PM
To: Yoo, Jinsun <jin...@gatech.edu>; astrasi...@googlegroups.com <astrasi...@googlegroups.com>
Subject: Re: AstraSim Queries
 
Reply all
Reply to author
Forward
0 new messages