How to write efficient DynamoRIO client for edge profiling?

66 views
Skip to first unread message

wl mu

unread,
Mar 5, 2024, 10:25:48 AMMar 5
to DynamoRIO Users
Hi, @derekbruening and @AssadHashmi.

I wrote a DynamoRIO client for edge profiling as follows. And we have used this client to collect the branch information in SPEC CPU's mcf benchmark, which includes the source and destination addresses of branches.

But Compared with the native execution, our client will lead to huge performance slowdown(more than 400x). In Derek's CGO'13 paper, the overhead of mcf's edge profiling is less than 5x.

 DynamoRIO edge profiling overhead.png

Would you like to tell me how to collect the edge information with lower overhead through DynamoRIO, or how can I improve current client?

Best regards, 

Wenlong.

bool
tag_is_main_module(void *tag)
{
    module_data_t *main_module = dr_get_main_module();
    return dr_module_contains_addr(main_module, dr_fragment_app_pc(tag));
}
static void
at_br_cond(app_pc src, app_pc dst, int taken)
{
    if (taken != 0) {
        void *drcontext = dr_get_current_drcontext();
        file_t log = (file_t)(ptr_uint_t)drmgr_get_tls_field(drcontext, tls_idx);
        dr_fprintf(log, "%lx,%lx\n", src, dst);
    }
}
static void
at_br_no_cond(app_pc src, app_pc dst)
{
    void *drcontext = dr_get_current_drcontext();
    file_t log = (file_t)(ptr_uint_t)drmgr_get_tls_field(drcontext, tls_idx);
    dr_fprintf(log, "%lx,%lx\n", src, dst);
}
static dr_emit_flags_t
event_app_instruction(void *drcontext, void *tag, instrlist_t *bb, instr_t *instr,
                      bool for_trace, bool translating, void *user_data)
{
    if (tag_is_main_module(tag)) {
        if (instr_is_cti(instr)) {
            if (instr_is_ubr(instr)) {
                dr_insert_ubr_instrumentation(drcontext, bb, instr, (void *)at_br_no_cond);
            } else if (instr_is_cbr(instr)) {
                dr_insert_cbr_instrumentation(drcontext, bb, instr, (void *)at_br_cond);
            } else if (instr_is_call_direct(instr)) {
                    dr_insert_call_instrumentation(drcontext, bb, instr,
                                                   (void *)at_br_no_cond);
            } else if (instr_is_call_indirect(instr) || instr_is_return(instr) || instr_is_jmp_indirect(instr)) {
                dr_insert_mbr_instrumentation(drcontext, bb, instr, (void *)at_br_no_cond,
                                              SPILL_SLOT_1);
            }
        }
    }
    return DR_EMIT_DEFAULT;
}
static void
event_exit(void)
{
    dr_log(NULL, DR_LOG_ALL, 1, "Client 'branch_recorder' exiting");
    if (!drmgr_unregister_bb_insertion_event(event_app_instruction) ||
        !drmgr_unregister_tls_field(tls_idx))
        DR_ASSERT(false);
    drmgr_exit();
}
static void
event_thread_init(void *drcontext)
{
    file_t log;
    log = log_file_open(client_id, drcontext, NULL /* using client lib path */,
                        "branch_recorder", DR_FILE_ALLOW_LARGE);
    DR_ASSERT(log != INVALID_FILE);
    drmgr_set_tls_field(drcontext, tls_idx, (void *)(ptr_uint_t)log);
}
static void
event_thread_exit(void *drcontext)
{
    log_file_close((file_t)(ptr_uint_t)drmgr_get_tls_field(drcontext, tls_idx));
}
DR_EXPORT
void
dr_client_main(client_id_t id, int argc, const char *argv[])
{
    dr_set_client_name("DynamoRIO Sample Client 'branch_recorder'",
                       "http://dynamorio.org/issues");
    dr_log(NULL, DR_LOG_ALL, 1, "Client 'branch_recorder' initializing");
    drmgr_init();
    client_id = id;
    tls_idx = drmgr_register_tls_field();
    dr_register_exit_event(event_exit);
    if (!drmgr_register_thread_init_event(event_thread_init) ||
        !drmgr_register_thread_exit_event(event_thread_exit) ||
        !drmgr_register_bb_instrumentation_event(NULL, event_app_instruction, NULL))
        DR_ASSERT(false);
}


Derek Bruening

unread,
Mar 5, 2024, 2:27:21 PMMar 5
to wl mu, DynamoRIO Users
To get better performance you want to inline the performance-critical kernels of the tool and batch up the i/o.  So instead of an expensive clean call to a function that makes an expensive call to write text to a file on every single branch instruction, you would want inlined instrumentation that records addresses to a buffer, with the buffer periodically written out to a file.  See the Buffer Filling library which helps with this; see example usage in the memval_simple sample tool.

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dynamorio-users/24f7f296-0e07-43c9-a5e9-9bdc83e4dbf9n%40googlegroups.com.

wl mu

unread,
Mar 18, 2024, 10:43:14 PMMar 18
to DynamoRIO Users
Thank you, I'm attempting to profile the edge information on the AArch64 platform. The operation of conditional branch profiling takes a large proportion of overhead because I need to know whether the conditional branch is taken or not. 
As shown below, I have write a instr_cbr_taken function for the AArch64 platform following the implementation of ARM platform, but it will lead to obvious overhead. Would you like to tell me how to achieve such a function slightly, or is it possible to inline this part of code?

Best regards,
Wenlong.

bool
instr_cbr_taken(int opcode, uint pred, reg_id_t reg, uint64 val1)
{
    void *drcontext = dr_get_current_drcontext();
    dr_mcontext_t mcontext;
    mcontext.size = sizeof(mcontext);
    mcontext.flags = DR_MC_ALL;
    dr_get_mcontext(drcontext, &mcontext);
    if (opcode == OP_bcond) {
        switch (pred) {
        case DR_PRED_NONE: return false;
        case DR_PRED_EQ: /* Z == 1 */ return (TEST(EFLAGS_Z, mcontext.xflags));
        case DR_PRED_NE: /* Z == 0 */ return (!TEST(EFLAGS_Z, mcontext.xflags));
        case DR_PRED_CS: /* C == 1 */ return (TEST(EFLAGS_C, mcontext.xflags));
        case DR_PRED_CC: /* C == 0 */ return (!TEST(EFLAGS_C, mcontext.xflags));
        case DR_PRED_MI: /* N == 1 */ return (TEST(EFLAGS_N, mcontext.xflags));
        case DR_PRED_PL: /* N == 0 */ return (!TEST(EFLAGS_N, mcontext.xflags));
        case DR_PRED_VS: /* V == 1 */ return (TEST(EFLAGS_V, mcontext.xflags));
        case DR_PRED_VC: /* V == 0 */ return (!TEST(EFLAGS_V, mcontext.xflags));
        case DR_PRED_HI: /* C == 1 and Z == 0 */
            return (TEST(EFLAGS_C, mcontext.xflags) && !TEST(EFLAGS_Z, mcontext.xflags));
        case DR_PRED_LS: /* C == 0 or Z == 1 */
            return (!TEST(EFLAGS_C, mcontext.xflags) || TEST(EFLAGS_Z, mcontext.xflags));
        case DR_PRED_GE: /* N == V */
            return BOOLS_MATCH(TEST(EFLAGS_N, mcontext.xflags),
                               TEST(EFLAGS_V, mcontext.xflags));
        case DR_PRED_LT: /* N != V */
            return !BOOLS_MATCH(TEST(EFLAGS_N, mcontext.xflags),
                                TEST(EFLAGS_V, mcontext.xflags));
        case DR_PRED_GT /* Z == 0 and N == V */:
            return (!TEST(EFLAGS_Z, mcontext.xflags) &&
                    BOOLS_MATCH(TEST(EFLAGS_N, mcontext.xflags),
                                TEST(EFLAGS_V, mcontext.xflags)));
        case DR_PRED_LE: /* Z == 1 or N != V */
            return (TEST(EFLAGS_Z, mcontext.xflags) ||
                    !BOOLS_MATCH(TEST(EFLAGS_N, mcontext.xflags),
                                 TEST(EFLAGS_V, mcontext.xflags)));
        case DR_PRED_AL: return true;
        case DR_PRED_NV: return false;
        }
    } else if (opcode == OP_cbnz || opcode == OP_cbz) {
        reg_t val;
        val = reg_get_value(reg, &mcontext);
        if (opcode == OP_cbnz)
            return (val != 0);
        else
            return (val == 0);
    } else if (opcode == OP_tbnz || opcode == OP_tbz) {

        reg_t val;
        val = reg_get_value(reg, &mcontext);
        uint a = 0x1;
        uint b = val >> val1;
        if (opcode == OP_tbnz) {
            return TEST(a, b);
        } else {
            return (!TEST(a, b));
        }
    }
    DR_ASSERT(false);
    return false;
}

Abhinav Sharma

unread,
Mar 19, 2024, 8:45:49 AMMar 19
to DynamoRIO Users
Hi Wenlong,

dr_insert_clean_call has an optimization that inlines the callee where possible (see [1]). Could you examine the basic block fragment after calling dr_insert_clean_call to see if the callee was inlined automatically in your case? You can do this either using the debug build and running with "-loglevel 3 -logmask 0x20" (see [2]; 0x20 is LOG_INTERP), or just print the instrlist_t in the instrumentation event in your client using instrlist_disassemble[3].

If it turns out that it is not automatically inlined, I guess you'll need to manually create the instrumentation using our instruction API (see [4] and [5] etc) and add that instead of the clean call. Clean calls can be expensive, so investing the effort to inline manually would be worthwhile.

Derek Bruening

unread,
Mar 19, 2024, 11:14:32 AMMar 19
to Abhinav Sharma, DynamoRIO Users
If you insert manually inlined instrumentation, you might want something like a conditional move (e.g., CSEL) to help store your taken vs untaken bit into your buffer prior to a branch that depends on the flags.  In some ways inlined instrumentation can be simpler as the state you want to evaluate is the processor state and you can directly act on it rather than emulating; on the other hand you are directly modifying that state so be sure to use drreg to preserve the register values.

Reply all
Reply to author
Forward
0 new messages