Thread interleaving when using online drmemtrace

algra...@gmail.com

unread,

Feb 9, 2026, 8:30:14 PMFeb 9

to DynamoRIO Users

We have a custom drmemtrace tool that processes memrefs online.
The typical workload is multithreaded and has a lot of data sharing and

synchronization between threads - the purpose of the tool is to understand

these data sharing patterns. Ideally we want to observe individual reads

and writes (and barriers) in as close as possible to the order they occurred.

Now, it seems that the design of drmemtrace tools is to run the tool in a

single thread, and consume events from all threads, so memref.data.tid

will indicate the original thread. What we observe is we get a batch of

events from thread A, then a batch from thread B etc., rather than

individual events interleaved as we believe the workload is doing them.

Is this something we can control? Is there any way to have events pushed

to the tool thread more frequently?

Derek Bruening

unread,

Feb 9, 2026, 10:24:35 PMFeb 9

to algra...@gmail.com, DynamoRIO Users

I believe this is coming from trying to write as large of blocks as possible to the pipe.

If you reduce named_pipe_t::get_atomic_write_size() you should get finer grained data but with more pipe writes.

If this works out nicely to reduce the value, maybe a runtime option could be created to control this?

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/d17b18d9-7abb-4936-970e-2f7f62a95997n%40googlegroups.com.

algra...@gmail.com

unread,

Feb 10, 2026, 4:35:53 AMFeb 10

to DynamoRIO Users

Thanks - looking at the code (tracer.cpp) I have reservations about this approach.

The pipe atomic limit needs to be large enough that groups of related records can

always be written out atomically. Reducing the limit below the maximum size of

an atomic group would break drmemtrace. E.g. suppose an instruction could

generate up to 20 records (maybe it's a SIMD scatter load?), or there are other

sequences that need atomicity (a comment in output_buffer says a branch and

its target need to be output together). atomic_write_size() needs to be large

enough to allow those to be written atomically. If it's at its default of 4096,

then it probably is. But I don't want the tracer to hold on to 20 independent

records (or 4096 bytes of records) waiting to write them out as a chunk,

or even as multiple atomic writes - the idea would be to have minimal groups

of records written to the pipe as soon as they are ready, using atomicity only

to preserve the record grouping mandated by the trace format, and not to

batch up independent records simply to reduce write() calls.

I'm having difficulty understanding where to change this in tracer.cpp.

It looks like output_buffer() will write out all available data in a buffer and

create a new one. Reducing the buffer size below the maximum group size

to force more frequent pipe writes would break drmemtrace for the same

reason that reducing the pipe atomic write limit would. So instead we need

to write out record groups when available, and advance the buffer pointer.

I just don't know whether this is feasible without completely re-engineering

tracer.cpp.

Derek Bruening

unread,

Feb 10, 2026, 11:11:21 AMFeb 10

to algra...@gmail.com, DynamoRIO Users

Actually the pipe limit by itself would not help as the buffer is already all written when the output code splits it up for the pipe.

As you say, you'd need the tracer to send its buffer for output more often.

The tracer naturally completes an entire basic block at once, using a redzone to handle overflow.

An easy change is to output after each basic block: just always go to the callout instead of checking the redzone. All related records will be grouped, except branch targets, but that should be handleable with flexible tools.

To make even more frequent you'd change -max_bb_instrs: no need to rewrite code.

But overhead will go up.

The alternative would be to embed a nanosecond-granularity timestamp per block or per instruction in the current buffers and sort later.

That's going to add overhead too.

To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/69c44471-f4ee-4a0c-9b7c-8c51c9596435n%40googlegroups.com.

Reply all

Reply to author

Forward