Hi Joongun,
Thank you very much for looking into this!
I tried both approaches, but neither has resolved the issue.
This is what I get.
For approach 1, I commented out the code that used HTA in the src/trace_link/trace_linker.py in the chakra repo:
sync_deps = self.load_sync_dependencies(rank, chakra_device_trace)
self.enforce_sync_dep(
kineto_external_id_to_kineto_op_map,
sorted_kineto_cpu_ops,
sorted_kineto_cpu_op_ts,
kineto_tid_ops_map,
sync_deps,
)
However, after this, and after I run all the commands in the tool chain to generate the final chakra json trace, I don't see any chakra nodes in the json trace file. (The json trace file is attached).
Then, instead of ignoring the HTA, we tried different versions of HTA over the last two years (from May 18, 2023 to Jun 12, 2025), and didn't find a version that works. Here is a summary of the versions we tried:
Any version that is later than Nov 15, 2023 (d755d9940374f389018f9e4f09d94dbd0dca4d06 (v0.2.0)) gave us the following error:
File "/data/userdata/dli/dev/chakra-dev-new/HolisticTraceAnalysis/hta/analyzers/critical_path_analysis.py", line 843, in _construct_graph_from_kernels
.join(q[["queue_length"]], on="index_correlation")
TypeError: 'NoneType' object is not subscriptable
A version on Oct 23, 2023 (fc409a2a149f92c76345b933dd7f8148875fb81b (v0.2.0)) gave us this error:
File "/home/dli/.conda/envs/llama3_trace_collection/lib/python3.9/site-packages/chakra/src/trace_link/trace_linker.py", line 122, in load_sync_dependencies
cp_graph, success = trace_analysis.critical_path_analysis(
TypeError: cannot unpack non-iterable NoneType object
Any version that is older than Sep 7, 2023 (54bddd51ffd16f628040453d9b2f508e7d7a47f0 (v0.2.0)) gave us this error:
from hta.analyzers.critical_path_analysis import CPEdgeType
ModuleNotFoundError: No module named 'hta.analyzers.critical_path_analysis'
For approach 2, we tried the resnet50.py code you shared, but we got the following error while trying to capture the PyTorch and Kineto traces on an instance with 8 A-100 GPUs.
/home/ubuntu/miniconda3/envs/llama3_trace_collection/lib/python3.9/site-packages/torch/profiler/profiler.py:354: UserWarning: Profiler won't be using warmup, this can skew profiler results
warn("Profiler won't be using warmup, this can skew profiler results")
Process Process-8:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/llama3_trace_collection/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/llama3_trace_collection/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/chakra_trace_capturing_resnet50/resnet50_capture.py", line 149, in init_process
fn(rank, size)
File "/home/ubuntu/chakra_trace_capturing_resnet50/resnet50_capture.py", line 108, in example
with torch.profiler.profile(
TypeError: __init__() got an unexpected keyword argument 'execution_trace_observer'
{<ProfilerActivity.CPU: 0>, <ProfilerActivity.CUDA: 2>}
Can you please advise? Appreciate any response in advance!
Best regards,
Dawei Li
ScalaComputing