memtrace.c example so slow

617 views
Skip to first unread message

Wonjoon Song

unread,
Sep 13, 2012, 5:47:45 PM9/13/12
to dynamor...@googlegroups.com
Hello All

I am trying to use dynamorio, valgrind, and pin to check every memory reference(somewhat like drmemory, memcheck but track which part (heap/stack/global) of the memory the program is reading or writing) and compare their performance. I am adding code to memtrace but memtrace seems so slow. I've commented out the print out part still it is so slow compared to valgrind, and pin. I use bzip2 (10MB random file) to check performance and it seems drmemory's memtrace is about 5 times slower than others(It takes about 5 mins while others take about 1 min. It takes about 2 sec natively). I've read the paper that says drmemory is faster than valgrind memcheck. Is the memtrace not optimized? Am I doing something wrong?

Derek Bruening

unread,
Sep 13, 2012, 5:54:13 PM9/13/12
to dynamor...@googlegroups.com
The sample client memtrace.c that comes with DynamoRIO has no relation to Dr. Memory.  The two have completely different goals.  Dr. Memory has no need to store the entire history of memory accesses: it checks each as it occurs, updating and propagating shadow values that reflect the current state of memory.  OTOH, memtrace wants a record of the full history, which it writes to a buffer which is written out to a file.

- Derek



On Thu, Sep 13, 2012 at 5:47 PM, Wonjoon Song <tempt...@gmail.com> wrote:
Hello All

I am trying to use dynamorio, valgrind, and pin to check every memory reference(somewhat like drmemory, memcheck but track which part (heap/stack/global) of the memory the program is reading or writing) and compare their performance. I am adding code to memtrace but memtrace seems so slow. I've commented out the print out part still it is so slow compared to valgrind, and pin. I use bzip2 (10MB random file) to check performance and it seems drmemory's memtrace is about 5 times slower than others(It takes about 5 mins while others take about 1 min. It takes about 2 sec natively). I've read the paper that says drmemory is faster than valgrind memcheck. Is the memtrace not optimized? Am I doing something wrong?

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/iTJY9kPLxnAJ.
To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.

Qin Zhao

unread,
Sep 13, 2012, 6:19:11 PM9/13/12
to dynamor...@googlegroups.com
memtrace is used as an example to show how to collect memory and dump to file and it is not optimized.
You can tune the buffer size and see if the performance change.
I am not sure how you changed the code. But if you remove all the code in pin or valgind's callout, the instrumentation might be removed completely.
In contrast, memtrace insert code to fill the buffer, which will be always executed even the clean call might be optimized away.

Qin


On Thu, Sep 13, 2012 at 5:47 PM, Wonjoon Song <tempt...@gmail.com> wrote:
Hello All

I am trying to use dynamorio, valgrind, and pin to check every memory reference(somewhat like drmemory, memcheck but track which part (heap/stack/global) of the memory the program is reading or writing) and compare their performance. I am adding code to memtrace but memtrace seems so slow. I've commented out the print out part still it is so slow compared to valgrind, and pin. I use bzip2 (10MB random file) to check performance and it seems drmemory's memtrace is about 5 times slower than others(It takes about 5 mins while others take about 1 min. It takes about 2 sec natively). I've read the paper that says drmemory is faster than valgrind memcheck. Is the memtrace not optimized? Am I doing something wrong?

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/iTJY9kPLxnAJ.
To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.



--
Interested in Yoga? Be careful of The Yoga Cult or The Scary Yoga Obsession.
More information from  Lorie Anderson and Rick Ross.

Wonjoon Song

unread,
Sep 13, 2012, 6:22:00 PM9/13/12
to dynamor...@googlegroups.com
Hello Derek

Thank you for the reply. I am using memtrace to track whole memory reference and try to find out how many bytes the program reads or writes on the specific region (stack, heap, global). It seems memtrace is significantly slower than the similar examples valgrind, and pin provide so I was wondering if I'm doing something wrong since the dynamorio is regarded as the most fast DBI. Is there any optimization I could do?

- Wonjoon

Qin Zhao

unread,
Sep 13, 2012, 6:28:03 PM9/13/12
to dynamor...@googlegroups.com
 
Thank you for the reply. I am using memtrace to track whole memory reference and try to find out how many bytes the program reads or writes on the specific region (stack, heap, global). It seems memtrace is significantly slower than the similar examples valgrind, and pin provide so I was wondering if I'm doing something wrong since the dynamorio is regarded as the most fast DBI. Is there any optimization I could do?

If you try to count the size of memory reference, you should use example like instruction count, and get the each memory reference size and update the counter once in each basic block.

Qin
 

- Wonjoon

On Thursday, September 13, 2012 5:54:14 PM UTC-4, Derek Bruening wrote:
The sample client memtrace.c that comes with DynamoRIO has no relation to Dr. Memory.  The two have completely different goals.  Dr. Memory has no need to store the entire history of memory accesses: it checks each as it occurs, updating and propagating shadow values that reflect the current state of memory.  OTOH, memtrace wants a record of the full history, which it writes to a buffer which is written out to a file.

- Derek



On Thu, Sep 13, 2012 at 5:47 PM, Wonjoon Song <tempt...@gmail.com> wrote:
Hello All

I am trying to use dynamorio, valgrind, and pin to check every memory reference(somewhat like drmemory, memcheck but track which part (heap/stack/global) of the memory the program is reading or writing) and compare their performance. I am adding code to memtrace but memtrace seems so slow. I've commented out the print out part still it is so slow compared to valgrind, and pin. I use bzip2 (10MB random file) to check performance and it seems drmemory's memtrace is about 5 times slower than others(It takes about 5 mins while others take about 1 min. It takes about 2 sec natively). I've read the paper that says drmemory is faster than valgrind memcheck. Is the memtrace not optimized? Am I doing something wrong?

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/iTJY9kPLxnAJ.
To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/NumyT9mOsLUJ.

To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.

Wonjoon Song

unread,
Sep 14, 2012, 3:57:50 PM9/14/12
to dynamor...@googlegroups.com
Hello Qin

Thank you for the reply. Can I also get the address of the memory when getting the each memory reference size if I use example like instruction count?

Wonjoon

Qin Zhao

unread,
Sep 14, 2012, 6:16:58 PM9/14/12
to dynamor...@googlegroups.com
No, most of time, memory reference size is statically determined (there are some special cases), so you can get the size directly.
However, the address could change for every time the instruction is executed, so you cannot get the memory address by analyzing the code.

Qin

To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/jkKITczziIgJ.

To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.

Wonjoon Song

unread,
Sep 15, 2012, 12:58:25 PM9/15/12
to dynamor...@googlegroups.com
Hello Qin

Thank you for the reply. If I'm going to need address as well as the size of every reference, is memtrace the best I can do using dynamorio? Maybe some optimization can be done by changing the size of the buffer size?

Wonjoon

Qin Zhao

unread,
Sep 15, 2012, 7:30:40 PM9/15/12
to dynamor...@googlegroups.com
Yes, enlarge the buffer size to reduce the number of clean call invocation would be the easiest optimization.
Other possible optimization including smart register stealing to avoid un-necessary register save and restore, optimizing buffer filling code, etc.

Qin

To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/xfXEwr3bwZwJ.

To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.

Derek Bruening

unread,
Sep 17, 2012, 9:59:00 AM9/17/12
to dynamor...@googlegroups.com
On Thu, Sep 13, 2012 at 6:22 PM, Wonjoon Song <tempt...@gmail.com> wrote:
Thank you for the reply. I am using memtrace to track whole memory reference and try to find out how many bytes the program reads or writes on the specific region (stack, heap, global). It seems memtrace is significantly slower than the similar examples valgrind, and pin provide so I was wondering if I'm doing something wrong since the dynamorio is regarded as the most fast DBI. Is there any optimization I could do?

Can you provide pointers to the examples you're using from the other frameworks?  Are they all storing the full history?

- Derek

Wonjoon Song

unread,
Sep 25, 2012, 2:44:48 PM9/25/12
to dynamor...@googlegroups.com
Hello Derek

Sorry for the late reply. I'm trying to make each tool so that they function as same as possible ( and then break down the performance to find out how dynamoRIO, pin, and valgrind differ when they are trying to track memory reference). 

For Valgrind, I've used lackey tool (which is included in valgrind source) with --trace-mem=yes option which trace all memory access. If I use lackey tool it is a bit slower that dynamo's memtrace since it does something more than just tracing memory, such as count instructions, jumps, etc. Basically Valgrind is made of VEX IR and lackey iterates IR statements to find memory reference. I modified lackey a little and commented out parts that are not related and now it is way faster than memtrace.

For pin, I added function { INS_AddInstrumentFunction(load_store_inst, 0); } to instrument at instruction level. Then at the function I look for memory reference.

int main()
{
/* 
* some source
*/
INS_AddInstrumentFunction(load_store_inst, 0);

/*
* some source
*/
}

 VOID load_store_inst(INS ins, VOID *v)
 {
     UINT32 memOperands = INS_MemoryOperandCount(ins);
 
     for (UINT32 memOp = 0; memOp < memOperands; memOp++) {
         if (INS_MemoryOperandIsRead(ins, memOp)) {
             INS_InsertPredicatedCall(
                ins, IPOINT_BEFORE, (AFUNPTR)RecordMemRead, IARG_FAST_ANALYSIS_CALL,
                IARG_INST_PTR,
                IARG_MEMORYOP_EA, memOp,
                IARG_UINT32, INS_MemoryReadSize(ins),
                IARG_END);
         }
         if (INS_MemoryOperandIsWritten(ins, memOp)) {
             INS_InsertPredicatedCall(
                ins, IPOINT_BEFORE, (AFUNPTR)RecordMemWrite, IARG_FAST_ANALYSIS_CALL,
                IARG_INST_PTR,
                IARG_MEMORYOP_EA, memOp,
                IARG_UINT32, INS_MemoryReadSize(ins),
                IARG_END);
         }
      }
}

Both tools are quite faster than memtrace so I'm trying to find bottleneck of memtrace since dynamorio is considered the most fastest DBI.

Thank you.

Wonjoon

Qin Zhao

unread,
Sep 25, 2012, 4:20:04 PM9/25/12
to dynamor...@googlegroups.com
Might I know what  RecordMemRead  and  RecordMemWrite do, which may cause huge difference?

Qin

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/6gFpjRugWpUJ.

To post to this group, send email to dynamor...@googlegroups.com.
To unsubscribe from this group, send email to dynamorio-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dynamorio-users?hl=en.

Derek Bruening

unread,
Sep 27, 2012, 11:12:53 PM9/27/12
to dynamor...@googlegroups.com
If all you did was modify the memtrace() function, you did not remove the buffer filling.  The inserted code in instrument_mem() fills a buffer and only calls out to memtrace() when it's full (every 8K refs).  You should modify that code to increment global counters instead, to match the other tools.

- Derek




On Tue, Sep 25, 2012 at 4:37 PM, Wonjoon Song <kni...@gmail.com> wrote:
Hello Qin

Thank you for the reply. RecordMemWrite  just adds up bytes to each region counter (heap, stack, global, other).

VOID PIN_FAST_ANALYSIS_CALL RecordMemWrite(VOID * ip, ADDRINT  addr, UINT32 size)
{
     if (stack_range.upper > addr && addr > stack_range.lower) {
        stack_count.write += size;
    }
    else if (heap_range.upper > addr  && addr > heap_range.lower) {
        heap_count.write += size;
    }   
    else if (stack_range.upper > addr && addr > stack_range.lower) {
        stack_count.write += size;
    }   
    else {
        other_count.write += size;
    }   
}

Basically valgrind, and dynamo tool also has this part since I am trying to do same work with each tool. At dynamorio's memtrace.c I modified memtrace function from

memtrace(void *drcontext)
{
/* some code
*/
    for (i = 0; i < num_refs; i++) {
        dr_fprintf(data->log, PFX",%c,%d,"PFX"\n",
                   mem_ref->pc, mem_ref->write ? 'w' : 'r', mem_ref->size, mem_ref->addr);
        ++mem_ref;
    }   
/*some code
*/
}

to
memtrace(void *drcontext)
{
/* some code
*/
    for (i = 0; i < num_refs; i++) {
        if (mem_ref->write) 
            RecordMemWrite(mem_ref->size, mem_ref->addr);
        else
            RecordMemRead(mem_ref->size, mem_ref->addr);
        ++mem_ref;
    }   
/*some code
*/
}

Even if I don't use RecordMemWrite in memtrace, it is still slow when I use bzip2 to evaluate performance (10MB random file). pin takes about 75 secs while dynamorio takes about 400 secs.

Also, for pin, I think I should use INS_InsertFillBufferPredicated instead of INS_InsertPredicatedCall since dynamorio's memtrace uses buffer. But still, memtrace seems slower.

Wonjoon
To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/QygjpFopC8AJ.

Wonjoon Song

unread,
Oct 1, 2012, 12:35:03 PM10/1/12
to dynamor...@googlegroups.com
Hello Derek.

Thank you for the reply. If I add a jmp to RecordMemWrite or RecordMemRead code cache at instrument_mem() and remove buffer filling, would it match other tools? 

- Wonjoon

Reid Kleckner

unread,
Oct 1, 2012, 1:21:45 PM10/1/12
to dynamor...@googlegroups.com
It looks to me like the Pin memtrace and the DR memtrace are doing completely different things.

The DR sample is recording a trace of every memory access and its address, while the Pin tool is only maintaining a count.  The Pin tool produces far less data.  Pin may also be inlining RecordMem*, but I'd have to look more closely.

Furthermore, making a call from the code cache is much more complicated than a simple jmp instruction.  Consider how you return, how you prevent clobbering registers, etc.  You'd want to use dr_insert_clean_call(), but be aware that using it on every single memory access is prohibitively slow.

To view this discussion on the web visit https://groups.google.com/d/msg/dynamorio-users/-/OYyyOxtligUJ.

Derek Bruening

unread,
Oct 1, 2012, 1:39:09 PM10/1/12
to dynamor...@googlegroups.com
On Mon, Oct 1, 2012 at 12:35 PM, Wonjoon Song <kni...@gmail.com> wrote:
Thank you for the reply. If I add a jmp to RecordMemWrite or RecordMemRead code cache at instrument_mem() and remove buffer filling, would it match other tools? 

I assume you mean a (clean) call rather than a jmp, unless you have a scheme in mind for jmp-and-link.

The strength of DynamoRIO ("DR") is enabling fine-grained control over the inserted instrumentation.  For a tool
performing simple counter increments as you describe, the fastest DR client would use raw instruction sequences and would not call out to C/C++ code for the increments.  Using a clean call to high-level language code is simpler to write, and you would certainly use it for less-performance-critical parts of the tool, but for the core instrumentation that's being run on every single memory operation, you want carefully selected code.  DR will inline simple clean calls, but not callees with branches or many arguments.

I assume that when you're analyzing your DR and Pin client performance, you're checking whether the callee is inlined.  That will make a huge difference in performance (often an order of magnitude for the core instrumentation).  For guaranteed performance that won't suddenly become 10x slower due to a small change in a callee that thwarts the inliner, use raw instruction sequences.

- Derek

Message has been deleted

Wonjoon Song

unread,
Oct 1, 2012, 5:18:09 PM10/1/12
to dynamor...@googlegroups.com
Hello Reid

Thank you for the reply. Yes DR's memtrace is different from pin version. Maybe if I use INS_InsertFillBufferPredicated() function in pin which fills trace buffer, and call memtrace() function when the buffer is full, it may be the same.

I also implemented alternative version of pin memtrace which just fills up buffer and the flush when it is full.

VOID PIN_FAST_ANALYSIS_CALL RecordMemRead1(VOID * ip, ADDRINT addr, UINT32 size)
{
    buff[pos].addr = (void *)addr;
    buff[pos].write = false;
    buff[pos].size = size;

    pos++;
    if (pos == BUF_SIZE)
        memtrace();
}

Maybe this function is same as the original DR's memtrace? buff is a global variable that has 8192 size.

Also for making a call from the code cache, at first I was trying to use dr_insert_clean_call() but got Basic block or trace instrumentation exceeded maximum size. I searched this forum, and tried lowering max bb size using max_bb_instrs but it didn't work. Maybe I should use jmp code_cache like the one in the memtrace example.

- Wonjoon

Wonjoon Song

unread,
Oct 1, 2012, 6:46:47 PM10/1/12
to dynamor...@googlegroups.com
Hello Derek.

Thank you for the reply. I read a paper about pin which claimed pin automatically inlines the call so I should use raw instructions. Also, I think tls maybe a slowdown. Since other tools (pin, and valgrind) uses global buffer I am thinking of moving buffer from tls to global and try again if it helps.

- Wonjoon 
Reply all
Reply to author
Forward
0 new messages