HELP on adding a timestamp to each memory access in memtrace

Chao Jin

unread,

Nov 19, 2024, 7:20:17 PM11/19/24

to DynamoRIO Users

Dear DR friends,

I'm extending memtrace_x86.c to add a timestamp to each memory access using rdtsc. My update caused DR internal crash. I have attached the source code.

The following is diff between my code and the original code:

80d79
< uint64 tsc_hi, tsc_lo;
367c366
< fprintf(data->logf, PIFX ",%c,%d," PIFX "," PIFX "\n", (ptr_uint_t)mem_ref->pc,
---
> fprintf(data->logf, PIFX ",%c,%d," PIFX "\n", (ptr_uint_t)mem_ref->pc,
369c368
< (ptr_uint_t)mem_ref->addr, (ptr_uint_t)(mem_ref->tsc_lo | (mem_ref->tsc_hi << 32)));
---
> (ptr_uint_t)mem_ref->addr);
466d464
< * buf_ptr->tsc = rdstc;
505,526d502
<
< /* Store tsc in memory ref */
< reg_id_t scratch1 = DR_REG_EDX;
< reg_id_t scratch2 = DR_REG_EAX;
< dr_save_reg(drcontext, ilist, where, scratch1, SPILL_SLOT_1);
< dr_save_reg(drcontext, ilist, where, scratch2, SPILL_SLOT_2);
<
< instr = INSTR_CREATE_rdtsc(drcontext);
< instrlist_meta_preinsert(ilist, where, instr);
<
< opnd1 = OPND_CREATE_MEMPTR(reg1, offsetof(mem_ref_t, tsc_hi));
< opnd2 = opnd_create_reg(scratch1);
< instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
< instrlist_meta_preinsert(ilist, where, instr);
<
< opnd1 = OPND_CREATE_MEMPTR(reg2, offsetof(mem_ref_t, tsc_lo));
< opnd2 = opnd_create_reg(scratch2);
< instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
< instrlist_meta_preinsert(ilist, where, instr);
<
< dr_restore_reg(drcontext, ilist, NULL, scratch1, SPILL_SLOT_1);
< dr_restore_reg(drcontext, ilist, NULL, scratch2, SPILL_SLOT_2);

Specifically, I did the following

1) updated _mem_ref_t

typedef struct _mem_ref_t {
bool write;
void *addr;
size_t size;
app_pc pc;
uint64 tsc_hi, tsc_lo; //to keep EDX and EAX returned by rdtsc
} mem_ref_t;

2) updated instrument_mem(...) to add the following lines:

/* Store tsc in memory ref */
reg_id_t scratch1 = DR_REG_EDX;
reg_id_t scratch2 = DR_REG_EAX;
dr_save_reg(drcontext, ilist, where, scratch1, SPILL_SLOT_1);
dr_save_reg(drcontext, ilist, where, scratch2, SPILL_SLOT_2);

instr = INSTR_CREATE_rdtsc(drcontext);
instrlist_meta_preinsert(ilist, where, instr);

opnd1 = OPND_CREATE_MEMPTR(reg1, offsetof(mem_ref_t, tsc_hi));
opnd2 = opnd_create_reg(scratch1);
instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);

opnd1 = OPND_CREATE_MEMPTR(reg2, offsetof(mem_ref_t, tsc_lo));
opnd2 = opnd_create_reg(scratch2);
instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);

dr_restore_reg(drcontext, ilist, NULL, scratch1, SPILL_SLOT_1);
dr_restore_reg(drcontext, ilist, NULL, scratch2, SPILL_SLOT_2);

I tried my updates: bin64/drrun -debug -c xxx/api/bin/libmemtrace_x86_text.so -- mytest

drrun crashed with the following errors:

1)

<Application /home/jinchao/Work/MemProfiling/mine/benchmark/fs/pthreads/false_sharing.exe (945763) DynamoRIO usage error : dr_save_reg requires pointer-sized gpr>
<Usage error: dr_save_reg requires pointer-sized gpr (/home/jinchao/Work/tools/DynamoRIO/working/dynamorio/core/lib/instrument.c, line 5524)

2)

<Application /home/jinchao/Work/MemProfiling/mine/benchmark/fs/pthreads/false_sharing.exe (946640) DynamoRIO usage error : instr_encode error: no encoding found (see log)>
<Usage error: instr_encode error: no encoding found (see log) (/home/jinchao/Work/tools/DynamoRIO/working/dynamorio/core/ir/x86/encode.c, line 3195)

I guess this error is caused by INSTR_CREATE_rdtsc().

For my test, all log files contained nothing. Wondering how to fix the above errors.

Thanks for your help!

Chao

memtrace_x86.c

Abhinav Sharma

unread,

Nov 20, 2024, 10:56:48 AM11/20/24

to DynamoRIO Users

Hi,

> Usage error: dr_save_reg requires pointer-sized gpr (/home/jinchao/Work/tools/DynamoRIO/working/dynamorio/core/lib/instrument.c, line 5524

Your implementation invokes dr_save_reg with DR_REG_EDX which is a 32-bit register. Looks like you are on a 64-bit machine, therefore should use DR_REG_RDX instead. Better yet, simply use DR_REG_XDX which is set to the correct reg (edx or rdx) based on the environment.

> Usage error: instr_encode error: no encoding found (see log)

You can find more information in the DR debug logfiles.

> I guess this error is caused by INSTR_CREATE_rdtsc().

You can confirm this from the log file. I think it may be the INSTR_CREATE_mov_st which is currently using a 32-bit reg for opnd2 (see the edx vs rdx point noted above)

> For my test, all log files contained nothing.

See https://dynamorio.org/page_logging.html#autotoc_md207 on how to get debug logs. Briefly: you'd need to build DR with -DDEBUG=ON, and specify "-debug -loglevel 1" when you invoke drrun.

Hope this helps.

Abhinav

Derek Bruening

unread,

Nov 20, 2024, 2:16:26 PM11/20/24

to Abhinav Sharma, DynamoRIO Users

Another key point is that memtrace_x86.c is using drreg (https://dynamorio.org/page_drreg.html) to preserve registers. You should generally not mix the base dr_save_reg() API with drreg as you can too easily end up with conflicts. Using dr_save_reg() only in an inner scope might happen to work if you gave drreg enough of its own separate TLS slots, but it's fragile. Best to switch your code to use drreg instead.

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/cdcef637-606c-4e03-aa80-b85001dd9131n%40googlegroups.com.

Chao Jin

unread,

Nov 22, 2024, 10:35:27 AM11/22/24

to DynamoRIO Users

Hi Abhinav and Derek,

Thank you for your quick response!

I have tried your suggestions, and made some progress. But my update still has some issues. Could you please help me on the following questions?

1) 64-bit registers DR_REG_RDX and DR_REG_RAX are used for INSTR_CREATE_rdtsc, but it generated the following error:

${DR_DEBUG}/bin64/drrun -debug -loglevel 4 -c libmemtrace_x86_text.so -- ls
<log dir=/home/jinchao/Work/tools/DynamoRIO/working/dynamorio/build/bin64/../logs/ls.1013981.00000000>
<Starting application /usr/bin/ls (1013981)>
<Initial options = -no_dynamic_options -loglevel 4 -client_lib '/home/jinchao/Work/tools/DynamoRIO/working/dynamorio/build/api/bin/libmemtrace_x86_text.so;0;' -client_lib64 '/home/jinchao/Work/tools/DynamoRIO/working/dynamorio/build/api/bin/libmemtrace_x86_text.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >

.

<(1+x) Handling our fault in a TRY at 0x00007fca3a8f0f3a>
<spurious rep/repne prefix @0x00007fc9f65f9a00 (f3 0f 1e fa): >
Client memtrace is running
Data file /home/jinchao/Work/tools/DynamoRIO/working/dynamorio/build/api/bin/memtrace.ls.1013981.0000.log created
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<Application /usr/bin/ls (1013981). Application exception at PC 0x00007fca3a5e3e92.
Signal 11 delivered to application as default action.
Callstack:
0x00007fca3a5e3e92 </usr/lib/x86_64-linux-gnu/ld-2.31.so+0x1e92>
0x00007fca3a5e3108 </usr/lib/x86_64-linux-gnu/ld-2.31.so+0x1108>
>
<Stopping application /usr/bin/ls (1013981)>
Instrumentation results:
saw 13 memory references

<Application /usr/bin/ls (1013981). Internal Error: DynamoRIO debug check failure: /home/jinchao/Work/tools/DynamoRIO/working/dynamorio/core/heap.c:1975 IF_WINDOWS(doing_detach ||) vmh->num_free_blocks == vmh->num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() >= WINDOWS_VERSION_8_1)) && vmh->num_free_blocks >= vmh->num_blocks - unfreed_blocks)
(Error occurred @9 frags in tid 1013981)

If required, I'll send log files.

2) I tried following Dereck's suggestion not to use dr_save_reg(). But I didn't figure out which drreg APIs suits my case.

a. What I need is to call rdtsc to access the time-stamp counter, which reads the value into EDX:EAX. Therefore, before calling it, I need to save RDX:RAX, and restore them after rdtsc returns.

Shall I use drreg_get_app_value()? Using drreg_get_app_value() to save DR_REG_RDX and DR_REG_RAX, how to restore them? Shall I use drreg_restore_app_values?

b. for my case, I need to check whether reserved registers (reg1 and reg2) that are used to store and calculate memory addresses conflicts with RDX:RAX. It is a bit verbose to compare if two set of registers are same. Wondering if there is any better approach here?

I attached my updates, diff of which with the original code is as the following:

80d79
< uint64 tsc_hi, tsc_lo;
367c366
< fprintf(data->logf, PIFX ",%c,%d," PIFX "," PIFX "\n", (ptr_uint_t)mem_ref->pc,
---
> fprintf(data->logf, PIFX ",%c,%d," PIFX "\n", (ptr_uint_t)mem_ref->pc,
369c368
< (ptr_uint_t)mem_ref->addr, (ptr_uint_t)(mem_ref->tsc_lo | (mem_ref->tsc_hi << 32)));
---
> (ptr_uint_t)mem_ref->addr);

433c432
< reg_id_t reg1, reg2, reg3;
---
> reg_id_t reg1, reg2;
437,439d435
< reg_id_t scratch1 = DR_REG_RDX;
< reg_id_t scratch2 = DR_REG_RAX;
<
449d444
< drreg_reserve_register(drcontext, ilist, where, NULL, &reg3) != DRREG_SUCCESS ||
470d464

< * buf_ptr->tsc = rdstc;

510,545d503

< /* Store tsc in memory ref */

< dr_log(NULL, DR_LOG_ALL, 1, "Client 'memtrace' reg1: %x, reg2: %x, reg3: %x, scratch1: %x, scratch2: %x\n", reg1, reg2, reg3, scratch1, scratch2);
<
< if(scratch1 != reg1 && scratch1 != reg3 && scratch1 != reg3)
< {
< dr_log(NULL, DR_LOG_ALL, 1, "Client 'memtrace' save scratch1: %x to reg1: %x\n", scratch1, reg1);
< drreg_get_app_value(drcontext, ilist, where, scratch1, reg1);
< }
<
< if(scratch2 != reg1 && scratch2 != reg2 && scratch2 != reg3)
< {
< if(scratch1 == reg1 || scratch1 == reg2)
< {
< dr_log(NULL, DR_LOG_ALL, 1, "Client 'memtrace' save scratch2: %x to reg3: %x\n", scratch2, reg3);
< drreg_get_app_value(drcontext, ilist, where, scratch2, reg3);
< }
< else
< {
< dr_log(NULL, DR_LOG_ALL, 1, "Client 'memtrace' save scratch2: %x to reg1: %x\n", scratch2, reg1);
< drreg_get_app_value(drcontext, ilist, where, scratch2, reg1);

< }
< }
<
< instr = INSTR_CREATE_rdtsc(drcontext);
< instrlist_meta_preinsert(ilist, where, instr);
<

< opnd1 = OPND_CREATE_MEMPTR(reg2, offsetof(mem_ref_t, tsc_hi));

< opnd2 = opnd_create_reg(scratch1);
< instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
< instrlist_meta_preinsert(ilist, where, instr);
<
< opnd1 = OPND_CREATE_MEMPTR(reg2, offsetof(mem_ref_t, tsc_lo));
< opnd2 = opnd_create_reg(scratch2);
< instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
< instrlist_meta_preinsert(ilist, where, instr);

<
607d564
< drreg_unreserve_register(drcontext, ilist, where, reg3) != DRREG_SUCCESS ||

memtrace_x86.c

Abhinav Sharma

unread,

Nov 22, 2024, 3:08:36 PM11/22/24

to DynamoRIO Users

> I have tried your suggestions, and made some progress.

Good to know there's been progress.

> 64-bit registers DR_REG_RDX and DR_REG_RAX are used for INSTR_CREATE_rdtsc, but it generated the following error:

So the first error that you see is:

<Application /usr/bin/ls (1013981). Application exception at PC 0x00007fca3a5e3e92.
Signal 11 delivered to application as default action.

Could it be some issue in your instrumentation that's causing the SIGSEGV signal? Maybe run in gdb and check what's at 0x00007fca3a5e3e92; see https://dynamorio.org/page_debugging.html for tips on debugging using gdb.

A few other things that could help:

- does the app run properly without any client? (Just plain DynamoRIO without any client specified)

- if it does, then try adding your instrumentation in steps to see which one caused the crash.

- another possibility is that the crash is caused due to mixing up drreg and base reg reservation APIs (as Derek suggested)

> Shall I use drreg_get_app_value()? Using drreg_get_app_value() to save DR_REG_RDX and DR_REG_RAX, how to restore them? Shall I use drreg_restore_app_values?

You just need drreg_reserve_register (https://dynamorio.org/group__drreg.html#gadc8a4ec5c9263b11c18bcd1bd2c6b104) and drreg_unreserve_register (https://dynamorio.org/group__drreg.html#ga3226af61d5322e93c97546e49d79d983) around your instrumentation; these APIs will handle spilling and restoring automatically. Remember to pass the required reg_allowed to drreg_reserve_register so you get the regs you want (rdx and rax).

If you need to see examples, take a look at the drreg unit tests (https://github.com/DynamoRIO/dynamorio/blob/5e6429b71801d2ed043074eaac0862f962ee9b93/suite/tests/client-interface/drreg-test.dll.c#L588); may be a bit difficult to read if you're not familiar with the code though. There are also some examples in our sample clients, like https://github.com/DynamoRIO/dynamorio/blob/5e6429b71801d2ed043074eaac0862f962ee9b93/api/samples/bbbuf.c#L82.

> for my case, I need to check whether reserved registers (reg1 and reg2) that are used to store and calculate memory addresses conflicts with RDX:RAX

As suggested above, you can specify reg_allowed to drreg_reserve_register to constraint reg selection and avoid conflicts.

Chao Jin

unread,

Nov 24, 2024, 2:18:04 PM11/24/24

to DynamoRIO Users

Thanks for your comments, Abhinav. Everything is working now.

Chao Jin

unread,

Dec 15, 2024, 6:39:16 AM12/15/24

to DynamoRIO Users

Hi Derek and Abhinav,

I'm looking to extend memtrace_x86 to build a cache simulator for multi-core CPUs. I know DR has a simulator, drcachesim, which is pretty powerful, but it uses a multi-process structure that can be slow. Since memtrace_x86 is efficient at gathering memory traces, I’m thinking about building a faster cache simulator by keeping everything in a single process.

I've also noticed that drcachesim reports less coherence-related cache misses than expected. Maybe adding timestamps to memory traces could help address this issue.

I have two approaches in mind, but I’m not sure if either of them would work effectively. Would you please give me some advice?

Approach 1: Use a Dedicated Thread for Cache Simulation

I want to use a specific thread to handle the cache simulation. This thread would pull memory traces from the private data of other threads and run the simulation independently. The problem is, I’m not sure how to create this thread in DR. It looks like pthread_create isn’t supported, so I’m wondering if dr_create_client_thread() could work instead.

Approach 2: Use Shared Data Between Threads

The cache simulator would live in shared memory, allocated with dr_global_alloc(). Threads would update this shared memory whenever they acquire a memory trace. To ensure consistency, I’d use locks, but I’m worried this might slow things down too much.

Does either of these approaches make sense? Are there better ways to optimize this or avoid bottlenecks?

Looking forward to feedback and suggestions!

Best regards,

Jin

Derek Bruening

unread,

Dec 16, 2024, 10:57:23 PM12/16/24

to Chao Jin, DynamoRIO Users

Generally drcachesim is used as an offline simulator these days. If your use case can record an offline trace and simulate it separately, then I would say it would be better to use drcachesim (and augment it if there are any missing features). Are there reasons you need online simulation? Even there I would consider augmenting drcachesim rather than building a whole new simulator as you'd likely just end up duplicating a lot of very similar code in the simulation itself (unless you had plans for very different key design points from drcachesim). It seems like drcachesim could be given a feature of avoiding the piped data and living in-process; the private loader should keep it isolated.

Re: coherence not looking right: could you elaborate? If you think you've found a bug or problem in the drcachesim code, we would like to know about it.

To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/6e57d3f2-b0b8-4317-8ed9-032ffc19e46cn%40googlegroups.com.

Derek Bruening

unread,

Dec 17, 2024, 6:01:36 PM12/17/24

to Jin, Chao, DynamoRIO Users

Several comments here:

The -sched_* parameters only apply to dynamic rescheduling which is only supported in offline mode. Online mode is currently limited to a simple static mapping of threads to virtual cores.
Online mode batches up trace records and only sends them over the pipe to the simulator when a buffer fills up. Thus, probably what is happening is that each thread has a several thousand instruction sequence with the only interleaving at buffer boundaries, resulting in 9 out of 10 accesses being in a row from the same thread on the same virtual core, explaining your 10% result.
If you instead used offline mode with dynamic rescheduling via -core_serial (should be the default for offline analysis with recent builds), you would get fine-grained interleaving of threads and should see something matching your expectations. The offline scheduling is also much better than the very simple online "scheduler".
We would like to see online mode improved, but it is not a current use case for us so while we could consult and advise we have no plans for direct development in that area.
I would still think it would make more sense to start with drcachesim and add an in-process online mode or otherwise augment the existing online mode than to try to start with memtrace_x86. The drmemtrace framework has a lot of infrastructure to leverage for analysis tools vs starting from scratch; the drmemtrace tracer has many improvements beyond memtrace_x86; the drcachesim simulator operates on the drmemtrace record format; there are just many things it seems you'd end up duplicating trying to use a different tracer/format without any related infrastructure.

On Tue, Dec 17, 2024 at 6:12 AM Jin, Chao <jin...@gmail.com> wrote:

Hi Derek,

Thanks for your response! I’m from the University of Queensland. We are working on using cache simulation to improve data centric performance debugging. Using an online simulator can avoid keeping large memory trace files. If online simulation is fast enough, it can assist studying expected cache behaviour. In comparison, performance counters, while fast, they can generate confusing results.

To evaluate cache coherence traffic, I tested L1D cache misses reported by drcachesim using a simple benchmark and compared the results with PAPI. The benchmark consists of multiple threads, with each thread repeatedly updating a shared variable in a loop. Concurrent writes can be serialized using a lock. With ideal scheduling, the expected number of cache misses generated by each thread should be approximate its loop count, or multiples of it, depending on if the lock is enabled.

The tests were conducted on an Intel Sapphire Rapids node using the following cmd:

drrun -t drcachesim -data_prefetcher none -sched_quantum 1000 -sched_time_units_per_us 100000 -cores 3 -L1D_size 48K -L1D_assoc 12 -LL_size 60M -LL_assoc 15 -coherence -- my_benchmark …

I adjusted -data_prefetcher, -sched_quantum and -sched_time_units_per_us, but the result didn’t change significantly. The test was run multiple times to filter out outliers, I focused on L1 data cache misses. PAPI reported L1_DCM much lower than expected, mainly due to modern CPU optimizations on shared cachelines. drcachesim reported around 10% of expected cache misses. I didn’t investigate how the simulator handles time. But I think maintaining the relative order for each memory operation from different threads can honestly replay coherence traffic. Therefore, I tried using memtrace_x86 with your help to add a timestamp to each memory trace. Verifying the generated trace files shows cache misses are close to expectation.

I’m also interested in augmenting drcachesim to improve its online mode performance. Wondering if your team have the bandwidth to explore this enhancement? If not, I’d be happy to try it with your guidance.

Finally, I hope I used drcachesim correctly. If you’d like, I can share my testing example and results.

Best regards,

Chao Jin

Reply all

Reply to author

Forward

HELP on adding a timestamp to each memory access in memtrace_x86

Chao Jin

Abhinav Sharma

Derek Bruening

Chao Jin

Abhinav Sharma

Chao Jin

Chao Jin

Derek Bruening

Derek Bruening