
--
You received this message because you are subscribed to the Google Groups "ASTRA-sim Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astrasim-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/astrasim-users/TL0P290MB0558632734EB9F34AC47525ED267A%40TL0P290MB0558.ISRP290.PROD.OUTLOOK.COM.
For more options, visit https://groups.google.com/d/optout.

You don't often get email from sha...@optimalnets.com. Learn why this is important
Hello Jalil.Thank you very much for your detailed response.Yes, both the computation (COMP) and communication (COMM) nodes show the duration time value as can see in the Chakra ET nodes themselves.The "replay-only" is zero by default, but I've added the "replay-only":0 anyway, and also "roofline-enabled": 1.The "roofline-enabled" flag causes the HTsim simulation to time out (".......Warning: Simulation timed out"), so I removed this flag.The following graph shows the visualization of the logs I've added in the AstraSim and HTsim, and one can see that the HTSim COMM logs are out of sync with the AstraSim COMM logs, both are out of sync in the duration and the Start / End times :
<image.png>
<image.png>
Best,
Jalil
On Wed, May 28, 2025 at 7:58 AM Sharon Fraiman <sha...@optimalnets.com> wrote:--Hello.I'm running the AstraSim simulation together with the HTSim network backend, feeding Chakra Workload Execution files from Meta (Llama2).I am printing logs in AstraSim for start and end of a flow (from Workload::issue and Workload::callback), and on the HTSim side in the tcp.cpp, also on start and end flow, printing the time in ticks.The timing that the AstraSim outputs seems to be the exact duration in the Chakra ET files (after activating the chakra_jsonizer on the et files), while the HTsim start and end flow seem not connected to the logs from Astra.Can you please help me understand the issue ?
Thanks,Sharon FraimanOptimal Nets--
You received this message because you are subscribed to the Google Groups "ASTRA-sim Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astrasim-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/astrasim-users/TL0P290MB0558632734EB9F34AC47525ED267A%40TL0P290MB0558.ISRP290.PROD.OUTLOOK.COM.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "ASTRA-sim Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astrasim-user...@googlegroups.com.
Hello,
Thank you for the note and I wanted to follow up on the email thread.
We will have Adam Latos reply to your message tomorrow EU time.
He is trying to setup his google account with the proper Marvell account.
If that does not work in a timely manner, he would respond by email.
Regards,
Senad
From: Krishna, Tushar <tus...@ece.gatech.edu>
Sent: Friday, May 30, 2025 11:58 AM
To: Morris, Jalil <jmo...@g.harvard.edu>; Sharon Fraiman <sha...@optimalnets.com>
Cc: astrasi...@googlegroups.com; Lior Friedman <li...@optimalnets.com>; Senad Durakovic <sdura...@marvell.com>
Subject: [EXTERNAL] Re: Running the AstraSim simulation together with the HTSim network backend
I'm cc: ing Senad Durakovic from Marvell, whose team contributed the HTsim network backend into ASTRA-sim. Senad Durakovic - can you please forward this to any relevant folks who can respond? Thanks, Tushar On May 29, 2025 at 5: 24 AM -0400, Sharon
ZjQcmQRYFpfptBannerStart
|
ZjQcmQRYFpfptBannerEnd
Hi Adam,
Thank you for your reply.
I believe I've identified the issue (though not yet the root cause): missing
dependencies between collective communication nodes and computation nodes that
should depend on them.
Issue Details:
I'm using Llama2 Chakra files from the CommonML public Google Drive as workload
input for Astra-Sim simulation:
- Path: `MLC Public -> Working Groups (Public) -> Chakra -> Chakra
Traces -> v0.0.4 -> Llama2`
- Specific file tested: `llama_chakra.0.et`
What I Found:
When I applied `chakra_jsonizer` to convert the ET file to JSON, I discovered
that certain collective communication nodes have no dependent computation or
communication nodes waiting for them to complete. This issue specifically
affects:
- `COMM_COLL_NODE` nodes with `ncclKernel_` prefix in their names
- Nodes where `is_cpu_op` attribute is `false`
Example:
The node shown below (ID 1178) is a collective communication operation, but no
subsequent nodes reference it as a dependency:
```json
{
"id": "1178",
"name":
"ncclKernel_AllGather_RING_LL_Sum_int8_t(ncclDevComm*, unsigned long,
ncclWork*)",
"type": "COMM_COLL_NODE",
"ctrlDeps": ["1177"],
"dataDeps": ["1179", "93", "95",
"827", "829", "1174", "1176"],
"durationMicros": "3898",
"inputs": {
"values": "[[226, 92, 0, 25297920, 2,
'cuda:0']]",
"shapes": "[[25297920]]",
"types": "['Tensor(c10::Half)']"
},
"outputs": {
"values": "[]",
"shapes": "[]",
"types": "[]"
},
"attr": [
{
"name": "is_cpu_op",
"boolVal": false
},
{
"name": "tid",
"int64Val": "84"
},
{
"name": "comm_type",
"int64Val": "2"
},
{
"name": "comm_size",
"int64Val": "50595840"
},
{
"name": "involved_dim",
"boolList": {
"values": [true]
}
}
]
}
```
This suggests that the dependency graph may be incomplete, potentially causing
simulation inaccuracies where collective operations appear to have no impact on
subsequent computations. Astra-Sim schedules other nodes in parallel to these
ncclKernel_ COMM_COLL_NODE, so the Job Completion Time (JCT) is wrong.
Has anyone else encountered similar dependency issues with the Llama2 Chakra
traces? Any insights into potential causes or solutions would be greatly
appreciated.
Best regards,
Sharon
Dear Sharon,
I saw the discussion regarding the missing dependencies.
It appears that the trace and Chakra version used for linking/conversion are quite outdated.
Path: MLC Public → Working Groups (Public) → Chakra → Chakra Traces → v0.0.4 → Llama2
File tested: llama_chakra.0.et
Since then, Chakra has been updated multiple times to improve its correctness.
Could you try using a more recent trace and the latest Chakra version?
You might find this PR helpful. There is the trace I used previously (Mixtral 8x3B):
https://github.com/mlcommons/chakra/pull/185
Best regards,
Joongun