Having an issue with capturing/converting chakra execution traces

16 views
Skip to first unread message

Dawei Li

unread,
Jul 2, 2025, 12:17:17 PM7/2/25
to joon...@gmail.com, jinsun...@gmail.com, astrasi...@googlegroups.com, Mehryar Garakani
Hi Joongun and Jinsun,

I'm following the instructions on the wiki page to capture and convert chakra traces, using the matrix multiplication example provided in the wiki page:

I'm using PyTorch version 2.2.0, and the latest chakra tool set in the repo.

I've captured the Kineto and PyTorch traces. (The traces are attached to this email.)
However, when I try to use the chakra_trace_link tool to merge them into the PyTorch ET+ trace, it fails.
The command described in the wiki page is out of date. It does not have the rank argument. 
However, after I add the rank parameter, and use the following command:
chakra_trace_link --rank 0 \
--chakra-host-trace /data/userdata/dli/dev/chakra-dev-new/scala-chakra-toolkit/tmp/chakra_trace_capturing/original_captured_traces_dir/pytorch_et_rank_0.json \
--chakra-device-trace /data/userdata/dli/dev/chakra-dev-new/scala-chakra-toolkit/tmp/chakra_trace_capturing/original_captured_traces_dir/kineto_trace_rank_0.json \
--output-file /data/userdata/dli/dev/chakra-dev-new/scala-chakra-toolkit/tmp/chakra_trace_capturing/merged_host_and_device_json_traces/chakra.0.json

I still get the error message. 
The error message is included in chakra_trace_link_error.log.
Can you please advise? Or, do you have updated instructions for capturing and converting chakra traces that you can share with us?

Best regards,
Dawei Li, ScalaComputing

kineto_trace_rank_0.json
pytorch_et_rank_0.json
chakra_trace_link_error.log

Joongun Park

unread,
Jul 2, 2025, 2:20:12 PM7/2/25
to Dawei Li, jinsun...@gmail.com, astrasi...@googlegroups.com, Mehryar Garakani
Hi Dawei, 

I were able to reproduce the error you've encountered. 
The error seems from HTA (Holistic Trace Analysis) which is 3rd party module Chakra uses. 

I think you can either (1) disable HTA in the code or (2) try with resnet-50 (https://github.com/JoongunPark/kineto/blob/main/tb_plugin/examples/pace-ice-new_resnet50.py). 

Best regards, 
Joongun
image.png

2025년 7월 2일 (수) 오후 12:17, Dawei Li <d...@scalacomputing.com>님이 작성:

Dawei Li

unread,
Jul 2, 2025, 6:03:29 PM7/2/25
to Joongun Park, jinsun...@gmail.com, astrasi...@googlegroups.com, Mehryar Garakani
Hi Joongun,

Thank you very much for looking into this!
I tried both approaches, but neither has resolved the issue.
This is what I get.

For approach 1, I commented out the code that used HTA in the src/trace_link/trace_linker.py in the chakra repo:
        sync_deps = self.load_sync_dependencies(rank, chakra_device_trace)
        self.enforce_sync_dep(
            kineto_external_id_to_kineto_op_map,
            sorted_kineto_cpu_ops,
            sorted_kineto_cpu_op_ts,
            kineto_tid_ops_map,
            sync_deps,
        )
However, after this, and after I run all the commands in the tool chain to generate the final chakra json trace, I don't see any chakra nodes in the json trace file. (The json trace file is attached).
Then, instead of ignoring the HTA, we tried different versions of HTA over the last two years (from May 18, 2023 to Jun 12, 2025), and didn't find a version that works. Here is a summary of the versions we tried:
Any version that is later than Nov 15, 2023 (d755d9940374f389018f9e4f09d94dbd0dca4d06 (v0.2.0)) gave us the following error:
  File "/data/userdata/dli/dev/chakra-dev-new/HolisticTraceAnalysis/hta/analyzers/critical_path_analysis.py", line 843, in _construct_graph_from_kernels
    .join(q[["queue_length"]], on="index_correlation")
TypeError: 'NoneType' object is not subscriptable

A version on Oct 23, 2023 (fc409a2a149f92c76345b933dd7f8148875fb81b (v0.2.0)) gave us this error:
  File "/home/dli/.conda/envs/llama3_trace_collection/lib/python3.9/site-packages/chakra/src/trace_link/trace_linker.py", line 122, in load_sync_dependencies
    cp_graph, success = trace_analysis.critical_path_analysis(
TypeError: cannot unpack non-iterable NoneType object

Any version that is older than Sep 7, 2023 (54bddd51ffd16f628040453d9b2f508e7d7a47f0 (v0.2.0)) gave us this error:
    from hta.analyzers.critical_path_analysis import CPEdgeType
ModuleNotFoundError: No module named 'hta.analyzers.critical_path_analysis'


For approach 2, we tried the resnet50.py code you shared, but we got the following error while trying to capture the PyTorch and Kineto traces on an instance with 8 A-100 GPUs.
/home/ubuntu/miniconda3/envs/llama3_trace_collection/lib/python3.9/site-packages/torch/profiler/profiler.py:354: UserWarning: Profiler won't be using warmup, this can skew profiler results
warn("Profiler won't be using warmup, this can skew profiler results")
Process Process-8:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/llama3_trace_collection/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/llama3_trace_collection/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/chakra_trace_capturing_resnet50/resnet50_capture.py", line 149, in init_process
fn(rank, size)
File "/home/ubuntu/chakra_trace_capturing_resnet50/resnet50_capture.py", line 108, in example
with torch.profiler.profile(
TypeError: __init__() got an unexpected keyword argument 'execution_trace_observer'
{<ProfilerActivity.CPU: 0>, <ProfilerActivity.CUDA: 2>}

Can you please advise? Appreciate any response in advance!

Best regards,
Dawei Li
ScalaComputing

Joongun Park

unread,
Jul 2, 2025, 6:38:52 PM7/2/25
to Dawei Li, jinsun...@gmail.com, astrasi...@googlegroups.com, Mehryar Garakani
Hi Dawei,

Thank you for sharing the results.

It seems you PyTorch version is bit outdated. So it does not find the keyword, PyTorch execution observer (in second approach).

Can you update PyTorch, 2.5 or later?

Best regards,
Joongun 

Dawei Li

unread,
Jul 3, 2025, 11:50:42 AM7/3/25
to Joongun Park, jinsun...@gmail.com, astrasi...@googlegroups.com, Mehryar Garakani

Hi Joongun,


Thank you very much for your help! After updating the PyTorch version to 2.5.1, I was able to capture and generate chakra traces for the resnet50 training application. 


I know you guys are working on updating the official wiki page in the Chakra repo. Here are a few issues that I encountered in the process and you may want to address while updating it. (Included here as this could be helpful to others who are interested.)

  1. The wiki page still has PyTorch 2.1.2. However, if using PyTorch 2.1.2, the host traces captured would have schema “1.0.1”, which is no longer supported by the recent chakra_trace_link tool. (The supported versions are "1.0.2-chakra.0.0.4", "1.0.3-chakra.0.0.4", "1.1.0-chakra.0.0.4", and "1.1.1-chakra.0.0.4”; and we have to use at least PyTorch 2.2.0 for capturing.)

  2. The chakra_trace_link tool described on the wiki page doesn’t accept the rank parameter, but the current chakra_trace_link requires the rank parameter. So, we had to add the rank parameter when using it.

  3. When I tried the capturing and converting example with the simple matrix multiplication application, it still failed with the issue related to HTA (Holistic Trace Analysis) in the trace linking step. I tried to capture the traces with both PyTorch 2.2.0 and PyTorch 2.5.1. So, this seems not related to the PyTorch version. (When I follow the capturing and converting steps with the resnet50 training application, it works fine and I can get the final chakra traces.)



Thanks again!


Best regards,

Dawei Li

ScalaComputing

Joongun Park

unread,
Jul 3, 2025, 12:34:26 PM7/3/25
to Dawei Li, jinsun...@gmail.com, astrasi...@googlegroups.com, Mehryar Garakani
Hi Dawei, 

Thank you so much for the update and suggestion!
I believe HTA requires more complicated traces for Critical Path Analysis. 

Will update wiki soon based on your observations.
Thank you again!

Best regards, 
Joongun


2025년 7월 3일 (목) 오전 8:50, Dawei Li <d...@scalacomputing.com>님이 작성:
Reply all
Reply to author
Forward
0 new messages