--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/cdcef637-606c-4e03-aa80-b85001dd9131n%40googlegroups.com.
Hi Derek and Abhinav,
I'm looking to extend memtrace_x86 to build a cache simulator for multi-core CPUs. I know DR has a simulator, drcachesim, which is pretty powerful, but it uses a multi-process structure that can be slow. Since memtrace_x86 is efficient at gathering memory traces, I’m thinking about building a faster cache simulator by keeping everything in a single process.
I've also noticed that drcachesim reports less coherence-related cache misses than expected. Maybe adding timestamps to memory traces could help address this issue.
I have two approaches in mind, but I’m not sure if either of them would work effectively. Would you please give me some advice?
Approach 1: Use a Dedicated Thread for Cache Simulation
I want to use a specific thread to handle the cache simulation. This thread would pull memory traces from the private data of other threads and run the simulation independently. The problem is, I’m not sure how to create this thread in DR. It looks like pthread_create isn’t supported, so I’m wondering if dr_create_client_thread() could work instead.
Approach 2: Use Shared Data Between Threads
The cache simulator would live in shared memory, allocated with dr_global_alloc(). Threads would update this shared memory whenever they acquire a memory trace. To ensure consistency, I’d use locks, but I’m worried this might slow things down too much.
Does either of these approaches make sense? Are there better ways to optimize this or avoid bottlenecks?
Looking forward to feedback and suggestions!
Best regards,
Jin
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/6e57d3f2-b0b8-4317-8ed9-032ffc19e46cn%40googlegroups.com.
Hi Derek,
Thanks for your response! I’m from the University of Queensland. We are working on using cache simulation to improve data centric performance debugging. Using an online simulator can avoid keeping large memory trace files. If online simulation is fast enough, it can assist studying expected cache behaviour. In comparison, performance counters, while fast, they can generate confusing results.
To evaluate cache coherence traffic, I tested L1D cache misses reported by drcachesim using a simple benchmark and compared the results with PAPI. The benchmark consists of multiple threads, with each thread repeatedly updating a shared variable in a loop. Concurrent writes can be serialized using a lock. With ideal scheduling, the expected number of cache misses generated by each thread should be approximate its loop count, or multiples of it, depending on if the lock is enabled.
The tests were conducted on an Intel Sapphire Rapids node using the following cmd:
drrun -t drcachesim -data_prefetcher none -sched_quantum 1000 -sched_time_units_per_us 100000 -cores 3 -L1D_size 48K -L1D_assoc 12 -LL_size 60M -LL_assoc 15 -coherence -- my_benchmark …
I adjusted -data_prefetcher, -sched_quantum and -sched_time_units_per_us, but the result didn’t change significantly. The test was run multiple times to filter out outliers, I focused on L1 data cache misses. PAPI reported L1_DCM much lower than expected, mainly due to modern CPU optimizations on shared cachelines. drcachesim reported around 10% of expected cache misses. I didn’t investigate how the simulator handles time. But I think maintaining the relative order for each memory operation from different threads can honestly replay coherence traffic. Therefore, I tried using memtrace_x86 with your help to add a timestamp to each memory trace. Verifying the generated trace files shows cache misses are close to expectation.
I’m also interested in augmenting drcachesim to improve its online mode performance. Wondering if your team have the bandwidth to explore this enhancement? If not, I’d be happy to try it with your guidance.
Finally, I hope I used drcachesim correctly. If you’d like, I can share my testing example and results.
Best regards,
Chao Jin