Thread interleaving when using online drmemtrace

12 views
Skip to first unread message

algra...@gmail.com

unread,
Feb 9, 2026, 8:30:14 PM (4 days ago) Feb 9
to DynamoRIO Users
We have a custom drmemtrace tool that processes memrefs online.
The typical workload is multithreaded and has a lot of data sharing and
synchronization between threads - the purpose of the tool is to understand
these data sharing patterns. Ideally we want to observe individual reads
and writes (and barriers) in as close as possible to the order they occurred.
Now, it seems that the design of drmemtrace tools is to run the tool in a
single thread, and consume events from all threads, so memref.data.tid
will indicate the original thread. What we observe is we get a batch of
events from thread A, then a batch from thread B etc., rather than
individual events interleaved as we believe the workload is doing them.
Is this something we can control? Is there any way to have events pushed
to the tool thread more frequently?

Derek Bruening

unread,
Feb 9, 2026, 10:24:35 PM (4 days ago) Feb 9
to algra...@gmail.com, DynamoRIO Users
I believe this is coming from trying to write as large of blocks as possible to the pipe.
If you reduce named_pipe_t::get_atomic_write_size() you should get finer grained data but with more pipe writes.
If this works out nicely to reduce the value, maybe a runtime option could be created to control this?

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/d17b18d9-7abb-4936-970e-2f7f62a95997n%40googlegroups.com.

algra...@gmail.com

unread,
Feb 10, 2026, 4:35:53 AM (4 days ago) Feb 10
to DynamoRIO Users
Thanks - looking at the code (tracer.cpp) I have reservations about this approach.
The pipe atomic limit needs to be large enough that groups of related records can
always be written out atomically. Reducing the limit below the maximum size of
an atomic group would break drmemtrace. E.g. suppose an instruction could
generate up to 20 records (maybe it's a SIMD scatter load?), or there are other
sequences that need atomicity (a comment in output_buffer says a branch and
its target need to be output together). atomic_write_size() needs to be large
enough to allow those to be written atomically. If it's at its default of 4096,
then it probably is. But I don't want the tracer to hold on to 20 independent
records (or 4096 bytes of records) waiting to write them out as a chunk,
or even as multiple atomic writes - the idea would be to have minimal groups
of records written to the pipe as soon as they are ready, using atomicity only
to preserve the record grouping mandated by the trace format, and not to
batch up independent records simply to reduce write() calls.

I'm having difficulty understanding where to change this in tracer.cpp.
It looks like output_buffer() will write out all available data in a buffer and
create a new one. Reducing the buffer size below the maximum group size
to force more frequent pipe writes would break drmemtrace for the same
reason that reducing the pipe atomic write limit would. So instead we need
to write out record groups when available, and advance the buffer pointer.
I just don't know whether this is feasible without completely re-engineering
tracer.cpp.

Derek Bruening

unread,
Feb 10, 2026, 11:11:21 AM (4 days ago) Feb 10
to algra...@gmail.com, DynamoRIO Users
Actually the pipe limit by itself would not help as the buffer is already all written when the output code splits it up for the pipe.
As you say, you'd need the tracer to send its buffer for output more often.
The tracer naturally completes an entire basic block at once, using a redzone to handle overflow.
An easy change is to output after each basic block: just always go to the callout instead of checking the redzone. All related records will be grouped, except branch targets, but that should be handleable with flexible tools.
To make even more frequent you'd change -max_bb_instrs: no need to rewrite code.
But overhead will go up.
The alternative would be to embed a nanosecond-granularity timestamp per block or per instruction in the current buffers and sort later.
That's going to add overhead too.

Reply all
Reply to author
Forward
0 new messages