Extra trace file generation and wrong number of instruction lengths in the tracefiles for certain applications

39 views
Skip to first unread message

Vivek Govindasamy

unread,
Jun 22, 2026, 1:21:36 PM (yesterday) Jun 22
to DynamoRIO Users
When generating traces for certain workloads such as stressapp, we observe that the number of trace files generated does not match the number of threads that the application is run with, and some extra trace files are generated. The total number of instructions that are present in the trace files do not match what is observed when running perf stat and we observe a lower number of total instructions in the trace files. So generating the trace causes more trace files to be generated compared to the number of threads, while checking the total instruction count results in fewer instructions.

We currently observe this issue for stressapp and multiload. For most applications we observe the same number of trace files as the number of threads and correct instruction counts. We have our own instruction counter tool to check the number of instructions in the trace file. This issue seems to occur regardless of x86 or ARM platforms.

The traces are generated by using 
/bin64/drrun -t drcachesim -offline -outdir . -- ./application, and preprocessed using /bin64/drrun -t drcachesim - indir tracefile


Bin Wang

unread,
Jun 22, 2026, 3:43:19 PM (yesterday) Jun 22
to DynamoRIO Users
Hi Vivek,

Thank you for reporting this.

1. My guess for the extra trace files is that there were transient/auxiliary threads created while the workload was running and they were captured by DynamoRIO. Can you share the exact flags and thread counts are you passing when executing stressapp and multiload? How many more trace files do you observe?
2. Did you run `perf stat -e instructions`? Do you see the inconsistent number of instruction for every workload or just stressapp/multiload? Because `perf stat -e instructions` would count both user space and kernel space instructions (e.g., systemcall handling). DynamoRIO operates strictly in user space and have no visibility in the kernel. You could try running `perf stat -e instructions:u` (restricting perf to user-mode code), which will eliminate kernel overhead and should align closely with DR's instruction count.

Vivek Govindasamy

unread,
1:49 PM (10 hours ago) 1:49 PM
to DynamoRIO Users
Thank you Bin for the quick response!

1. For generating the traces we ran the workloads with these commands-

Stressapp - ./stressapptest -s 20 -M 256 -m 8
Multiload - ./multiload -n 2 -t 16 - m 512M
Multichase- ./ multiload -n 5 -t 16 -m 512M -1 memcpy-line

We observed 27 trace files when running stressapp when we should have 8. For multiload and multichase we observed one extra trace file so we ran 16 threads but had 17 traces. 

2. Yes, we ran the perf stat command in user mode with the :u, i.e. perf stat command-e instructions:u ./stressapptest -s 20 -M 256 -m 8. 
We have tested many workloads from the SpecCPU 2017 benchmarks and AI workloads like Llama2, and we did not observe this issue, the instruction count correctly matches our profiler, and the number of threads match the trace files generated. 

Please let us know if you have any suggestions and let us know if we can provide further information.

Vivek Govindasamy

unread,
2:35 PM (9 hours ago) 2:35 PM
to DynamoRIO Users
Thank you Bin for the quick response!

1. For generating the traces we ran the workloads with these commands-

Stressapp - ./stressapptest -s 20 -M 256 -m 8
Multiload - ./multiload -n 2 -t 16 - m 512M
Multichase- ./ multiload -n 5 -t 16 -m 512M -1 memcpy-line

We observed 27 trace files when running stressapp when we should have 8. For multiload and multichase we observed one extra trace file so we ran 16 threads but had 17 traces. 

2. Yes, we ran the perf stat command in user mode with the :u, i.e. perf stat command-e instructions:u ./stressapptest -s 20 -M 256 -m 8. 
We have tested many workloads from the SpecCPU 2017 benchmarks and AI workloads like Llama2, and we did not observe this issue, the instruction count correctly matches our profiler, and the number of threads match the trace files generated. 

Please let us know if you have any suggestions and let us know if we can provide further information.

On Monday, June 22, 2026 at 12:43:19 PM UTC-7 bi...@google.com wrote:

Bin Wang

unread,
5:05 PM (7 hours ago) 5:05 PM
to DynamoRIO Users
Hi Vivek,

Thank you for providing the additional information

1. When you run a workload a tell it to use n worker thread, it doesn't necessarily mean only n threads are created throughout the whole lifetime of the workload. A common pattern is a main thread parses the input, starts n worker threads, then wait for all of them to finish (join). In this case you will have n+1 threads, which is the case with multiload. Stressapp is more complex because it creates other auxiliary threads: for each worker thread, there is a fill thread that populates work queue before it starts, and there is also a verification thread that checks the results after it finishes. There is also a logger thread and a ErrorPollThread. So the total is 3 x 8 + 1 (main) + 1 (Logger) + 1 (ErrorPoll) = 27. You can check by searching for SpawnThread() and pthread_create() in the stressapp codebase.

2. The instruction count could be explained by certain instructions are counted differently by perf and DR. For example, `rep movsb` might be counted multiple times by perf (because there are multiple iterations), while DR count only 1 instruction. We need more information to be sure. How many instructions are counted by perf and DR in your experiments? Can you also record INST_RETIRED.REP_ITERATION with perf in your experiments?
Reply all
Reply to author
Forward
0 new messages