Maybe inconsistencies in attach/detach/reattach implementation on Linux

Alexei Budankov

unread,

Dec 26, 2024, 1:17:25 AM12/26/24

to DynamoRIO Users

Hello,

i am trying to use attach/detach/reattach feature on Ubuntu 20.04 AArch64 for some single thread application and currently it crashes. dynamorio is built in debug mode and with full logging enabled:
./bin64/drrun -attach 4042305 -ops "-logdir /home/abudanko/dynamorio_fixes/build/log_dir -logmask 0x0fffffff -loglevel 4" -c ./clients/lib64/debug/libdrmemtrace.so -offline
Below is a part of execution log of signal handler processing nudge detach signal:
...
86800:d_r_dispatch:
d_r_dispatch: target = 0x0000ffffab42755c
86802:check_thread_vm_area: check_thread_vm_area: pc = 0x0000ffffab42755c
86804:prepend_entry_to_fraglist: prepend_entry_to_fraglist: putting fragment @0x0000ffffab42755c (shared) on vmarea 0x0000ffffab427000-0
86806:check_thread_vm_area: check_thread_vm_area: check_stop = 0x0000ffffab428000
86808:build_bb_ilist:
interp: 86810:build_bb_ilist: start_pc = 0x0000ffffab42755c
0x0000ffffab42755c b98007e0 ldrsw +0x04(%sp)[4byte] -> %x0
0x0000ffffab427560 f94007e1 ldr +0x08(%sp)[8byte] -> %x1
0x0000ffffab427564 8b000020 add %x1 %x0 lsl $0x00 -> %x0
0x0000ffffab427568 f90007e0 str %x0 -> +0x08(%sp)[8byte]
0x0000ffffab42756c b94007e0 ldr +0x04(%sp)[4byte] -> %w0
0x0000ffffab427570 11000400 add %w0 $0x0001 lsl $0x00 -> %w0
0x0000ffffab427574 b90007e0 str %w0 -> +0x04(%sp)[4byte]
0x0000ffffab427578 b94007e1 ldr +0x04(%sp)[4byte] -> %w1
0x0000ffffab42757c 5290d3c0 movz $0x869e lsl $0x00 -> %w0
0x0000ffffab427580 72a00020 movk %w0 $0x0001 lsl $0x10 -> %w0
0x0000ffffab427584 6b00003f subs %w1 %w0 lsl $0x00 -> %wzr
0x0000ffffab427588 54fffead b.le $0x0000ffffab42755c
86812:build_bb_ilist: end_pc = 0x0000ffffab42758c

87048:dispatch_enter_fcache_stats: Entry into F2(0x0000ffffab42755c).0x0000aaaa929fb170
...
87086:dispatch_exit_fcache_stats: Exit from F2(0x0000ffffab42755c).0x0000aaaa929fb36c
...

87612:main_signal_handler_C:
main_signal_handler: thread=4042305, sig=32, xsp=0x0000fffda61c3da0, retaddr=0x1000000400000020
87614:main_signal_handler_C: siginfo: sig = 32, pid = 0, status = 0, errno = 268435460, si_code = -1
87616:dump_sigcontext: Signal context:
87618:dump_sigcontext: x0 = 0x000000000000a248
87620:dump_sigcontext: x1 = 0x0000ffffab425560
...
87682:dump_sigcontext: pc = 0x0000aaaa929fb208
...
87756:dump_sigcontext:
87758:is_signal_for_us: is_signal_for_us T4042305: sig=32 code=-1 errno=268435460
87760:is_signal_for_us: is_signal_for_us T4042305: sig=32 code=-1 errno=268435460
87762:handle_nudge_signal: received nudge version=1 flags=0x0 mask=0x4 id=0x00000000 arg=0x0000000000000000
87766:d_r_notify: SYSLOG_INFORMATION: received nudge mask=0x4 id=0x00000000 arg=0x0000000000000000
87768:unlink_fragment_for_signal: F2, f->tag = 0x0000ffffab42755c, f->start_pc = 0x0000aaaa929fb16c, pc = 0x0000aaaa929fb208
87770:unlink_fragment_for_signal: unlinking outgoing for interrupted F2
87772:unlink_fragment_outgoing: unlinking outgoing from frag F2(0x0000ffffab42755c)
87774:unlink_branch: unlinking branch F2.0x0000aaaa929fb368
87776:unlink_branch: unlinking branch F2.0x0000aaaa929fb36c
87778:unlink_fragment_for_signal: dcontext->nudge_pending = 0x0000000000000000
87780:enter_nolinking: entered function
87782:enter_nolinking: with CODE_CACHE
87784:flush_fragments_synch_unlink_priv: FLUSH STAGE 1: synch_unlink_priv(thread 4042305 flushtime 0): 0x0000000000000000-0xffffffffffff
87786:make_writable: make_writable: pc 0x0000aaaa929f5000 -> 0x0000aaaa929f5000-0x0000aaaa929f7000 0
87808:make_unwritable: make_unwritable: pc 0x0000aaaa929f5000 -> 0x0000aaaa929f5000-0x0000aaaa929f7000
87824:flush_fragments_synch_priv: considering thread #1/1 = 4042305
87826:flush_fragments_synch_priv: thread 4042305 synch not required
87828:flush_fragments_thread_unlink: thread 4042305 has no fragments in region to flush

i added code at the end of unlink_fragment_for_signal() function at signal.c file and that code implements unlink of the fragment that nudge detach signal hits:

static bool
unlink_fragment_for_signal(dcontext_t *dcontext, fragment_t *f,
byte *pc /interruption pc/)
{
...
if (!waslinking)
enter_nolinking(dcontext, NULL, false);

/* Added here to get out to the dispatcher in a forced order */ flush_fragments_synch_unlink_priv(dcontext, UNIVERSAL_REGION_BASE, UNIVERSAL_REGION_SIZE, false, false, false);

flush_fragments_unlink_shared(dcontext, UNIVERSAL_REGION_BASE, UNIVERSAL_REGION_SIZE, NULL _IF_DGCDIAG(NULL)); flush_fragments_end_synch(dcontext, false);

return changed;
}

87830:flush_fragments_unlink_shared: FLUSH STAGE 2: unlink_shared(thread 4042305): flusher is 4042305
87832:flush_fragments_unlink_shared: flushing shared fragments
87836:add_process_lock: add_process_lock: 0 lock 0x00000000715f3e58: name=shared_delete_lock(mutex)@/home/abudanko/dynamorio_fixes/core/
rank=14 owner=4042305 owning_dc=0x0000fffda6188680 contended_event=0x7ffffffff prev=0x0000000000000000
lock count_times_acquired= 1 0 0 0
87838:vm_area_unlink_fragments: vm_area_unlink_fragments 0x0000000000000000..0xffffffffffffffff
87840:vm_area_unlink_fragments: marking region 0x0000ffffab427000..0x0000ffffab428000 for deletion & unlinking all its frags
87860:hashtable_fragment_add: hashtable_fragment_add(0x0000ffffab427578) mask=0x00000000000003ff offset=0 trying 0x000000000000032d:
87862:hashtable_fragment_add: hashtable_fragment_add: added 0x0000ffffab427578 to shared_future at table[813]
87864:hashtable_fragment_remove_helper_open_address: hashtable_fragment_remove_helper_open_address(table=0x0000fffda6180248, hindex=813)
87880:hashtable_fragment_add: hashtable_fragment_add(0x0000ffffab427554) mask=0x00000000000003ff offset=0 trying 0x000000000000022d:
87882:hashtable_fragment_add: hashtable_fragment_add: added 0x0000ffffab427554 to shared_future at table[557]
87884:hashtable_fragment_remove_helper_open_address: hashtable_fragment_remove_helper_open_address(table=0x0000fffda6180248, hindex=557)
87902:hashtable_fragment_add: hashtable_fragment_add(0x0000ffffab42758c) mask=0x00000000000003ff offset=0 trying 0x000000000000009e:
87904:hashtable_fragment_add: hashtable_fragment_add: added 0x0000ffffab42758c to shared_future at table[158]
87906:hashtable_fragment_remove_helper_open_address: hashtable_fragment_remove_helper_open_address(table=0x0000fffda6180248, hindex=158)
87922:hashtable_fragment_add: hashtable_fragment_add(0x0000ffffab42755c) mask=0x00000000000003ff offset=0 trying 0x00000000000001f4:
87924:hashtable_fragment_add: hashtable_fragment_add: added 0x0000ffffab42755c to shared_future at table[500]
87926:hashtable_fragment_remove_helper_open_address: hashtable_fragment_remove_helper_open_address(table=0x0000fffda6180248, hindex=500)
87942:hashtable_fragment_add: hashtable_fragment_add(0x0000ffffab42757c) mask=0x00000000000003ff offset=0 trying 0x0000000000000110:
87944:hashtable_fragment_add: hashtable_fragment_add: added 0x0000ffffab42757c to shared_future at table[272]
87946:hashtable_fragment_remove_helper_open_address: hashtable_fragment_remove_helper_open_address(table=0x0000fffda6180248, hindex=272)
87954:vm_area_unlink_fragments: Before removing vm area:
0x0000ffffab427000-0x0000ffffab428000 ---- ELF SO test_lib.so
87956:vm_area_unlink_fragments: Removing shared vm area 0x0000ffffab427000-0x0000ffffab428000
87962:vm_area_unlink_fragments: After removing vm area:
87964:vm_area_unlink_fragments: Unlinked 5 frags
87966:make_writable: make_writable: pc 0x0000aaaa929f5000 -> 0x0000aaaa929f5000-0x0000aaaa929f7000 0
87982:make_unwritable: make_unwritable: pc 0x0000aaaa929f5000 -> 0x0000aaaa929f5000-0x0000aaaa929f7000
87998:flush_fragments_unlink_shared: Flushed 5 fragments from 0x0000000000000000-0xffffffffffffffff
88000:flush_fragments_end_synch: FLUSH STAGE 3: end_synch(thread 4042305): flusher is 4042305
88002:main_signal_handler_C: main_signal_handler 32 returning now to 0x0000aaaa929fb208
...
88014:vm_area_check_shared_pending: thread 4042305 (flushtime 0) walking pending deletion list (was_I_flushed==F2)
88016:vm_area_check_shared_pending: Considering #0: 0x0000ffffab427000..0x0000ffffab428000 flushtime 1
88018:vm_area_check_shared_pending: dec => ref_count is now 0, flushtime diff is 0
88020:add_process_lock: add_process_lock: 0 lock 0x00000000715f3ed0: name=lazy_delete_lock(mutex)@/home/abudanko/dynamorio_fixes/core/vm
rank=15 owner=4042305 owning_dc=0x0000fffda6188680 contended_event=0xfffdffffffff prev=0x0000000000000000
lock count_times_acquired= 1 0 0 0
88022:check_lazy_deletion_list: checking lazy list @ timestamp 1
88024:vm_area_check_shared_pending: deleting all fragments in region 0x0000ffffab427000..0x0000ffffab428000 flushtime 1
88026:fragment_delete: fragment_delete: *0x0000fffda6151740 F5(0x0000ffffab427578).0x0000aaaa929fb6dc bb 0x47
88028:fcache_remove_fragment: fcache_remove_fragment: F5 from Basic block (shared) unit
88030:fragment_delete: fragment_delete: *0x0000fffda61516d0 F4(0x0000ffffab427554).0x0000aaaa929fb598 bb 0x47
88032:fcache_remove_fragment: fcache_remove_fragment: F4 from Basic block (shared) unit
88034:fragment_delete: fragment_delete: *0x0000fffda6151650 F3(0x0000ffffab42758c).0x0000aaaa929fb3c8 bb 0x47
88036:fcache_remove_fragment: fcache_remove_fragment: F3 from Basic block (shared) unit
88038:fragment_delete: fragment_delete: *0x0000fffda61515d0 F2(0x0000ffffab42755c).0x0000aaaa929fb16c bb 0x47
88040:fcache_remove_fragment: fcache_remove_fragment: F2 from Basic block (shared) unit
88042:fragment_delete: fragment_delete: *0x0000fffda6151520 F1(0x0000ffffab42757c).0x0000aaaa929fb008 bb 0x47
88044:fcache_remove_fragment: fcache_remove_fragment: F1 from Basic block (shared) unit
88046:add_process_lock: add_process_lock: 0 lock 0x00000000715f36a8: name=unit_flush_lock(mutex)@/home/abudanko/dynamorio_fixes/core/fca
rank=23 owner=4042305 owning_dc=0x0000fffda6188680 contended_event=0xfffdffffffff prev=0x0000000000000000
lock count_times_acquired= 1 0 0 0
88048:vm_area_check_shared_pending: thread 4042305 done walking pending list @flushtime 1
88050:vm_area_check_shared_pending: Flushed 5 frags
88052:dispatch_enter_dynamorio: Just flushed last_fragment

After return from signal handler thread gets back to dynamorio dispatch and starts instrumenting code and building code cache again:

88054:d_r_dispatch:
d_r_dispatch: target = 0x0000ffffab42755c
88056:check_thread_vm_area: check_thread_vm_area: pc = 0x0000ffffab42755c
new shared vm area: 0x0000ffffab427000-0x0000ffffab428000 ---- ELF SO test_lib.so
88062:check_thread_vm_area: checking thread vmareas against executable_areas
88064:prepend_entry_to_fraglist: prepend_entry_to_fraglist: putting fragment @0x0000ffffab42755c (shared) on vmarea 0x0000ffffab427000-0
88066:check_thread_vm_area: check_thread_vm_area: check_stop = 0x0000ffffab428000
88068:build_bb_ilist:
interp: 88070:build_bb_ilist: start_pc = 0x0000ffffab42755c
0x0000ffffab42755c b98007e0 ldrsw +0x04(%sp)[4byte] -> %x0
0x0000ffffab427560 f94007e1 ldr +0x08(%sp)[8byte] -> %x1
0x0000ffffab427564 8b000020 add %x1 %x0 lsl $0x00 -> %x0
0x0000ffffab427568 f90007e0 str %x0 -> +0x08(%sp)[8byte]
0x0000ffffab42756c b94007e0 ldr +0x04(%sp)[4byte] -> %w0
0x0000ffffab427570 11000400 add %w0 $0x0001 lsl $0x00 -> %w0
0x0000ffffab427574 b90007e0 str %w0 -> +0x04(%sp)[4byte]
0x0000ffffab427578 b94007e1 ldr +0x04(%sp)[4byte] -> %w1
0x0000ffffab42757c 5290d3c0 movz $0x869e lsl $0x00 -> %w0
0x0000ffffab427580 72a00020 movk %w0 $0x0001 lsl $0x10 -> %w0
0x0000ffffab427584 6b00003f subs %w1 %w0 lsl $0x00 -> %wzr
0x0000ffffab427588 54fffead b.le $0x0000ffffab42755c
88072:build_bb_ilist: end_pc = 0x0000ffffab42758c
88074:instrument_basic_block:
instrument_basic_block ******************
88076:instrument_basic_block:
before instrumentation:
TAG 0x0000ffffab42755c
+0 L3 @0x0000fffda61a9448 b98007e0 ldrsw +0x04(%sp)[4byte] -> %x0
+4 L3 @0x0000fffda623b898 f94007e1 ldr +0x08(%sp)[8byte] -> %x1
+8 L3 @0x0000fffda61a8190 8b000020 add %x1 %x0 lsl $0x00 -> %x0
...

Should detach code i.e.

handle_nudge() -> call_switch_stack(dcontext, (byte )((ptr_uint_t)d_r_detachstack + DYNAMORIO_STACK_SIZE), (void ()(void *))detach_externally_on_new_stack, NULL, true);

execute prior getting back to dynamorio dispatcher so application code instrumentation would not happen till the next attach?

Thanks,
Alexei

Abhinav Sharma

unread,

Dec 27, 2024, 12:15:52 PM12/27/24

to DynamoRIO Users

Hi Alexei,

I assume this is related to the "-attach <pid>" feature. I'm trying to understand the unexpected behavior you're observing; see below.

> Should detach code i.e.

> handle_nudge() -> call_switch_stack(dcontext, (byte )((ptr_uint_t)d_r_detachstack + DYNAMORIO_STACK_SIZE), (void ()(void *))detach_externally_on_new_stack, NULL, true);
> execute prior getting back to dynamorio dispatcher so application code instrumentation would not happen till the next attach?

I see that handle_nudge_signal -> nudge_add_pending adds the nudge signal to dcontext->nudge_pending, which should invoke enter_nolinking -> handle_nudge -> call_switch_stack.

> i added code at the end of unlink_fragment_for_signal() function at signal.c file and that code implements unlink of the fragment that nudge detach signal hits:

> /* Added here to get out to the dispatcher in a forced order */ flush_fragments_synch_unlink_priv(dcontext, UNIVERSAL_REGION_BASE, UNIVERSAL_REGION_SIZE, false, false, false);
> flush_fragments_unlink_shared(dcontext, UNIVERSAL_REGION_BASE, UNIVERSAL_REGION_SIZE, NULL _IF_DGCDIAG(NULL)); flush_fragments_end_synch(dcontext, false);

Can you clarify the goal behind "Added here to get out to the dispatcher in a forced order"?

> After return from signal handler thread gets back to dynamorio dispatch and starts instrumenting code and building code cache again:

Can you clarify: do you see the above unexpected behavior with your unlink_fragment_for_signal code changes or also without them? Do you mean DynamoRIO does not detach successfully on the nudge signal? If this is a bug in the current implementation, please file it separately https://github.com/DynamoRIO/dynamorio/issues/ also.

Abhinav

Alexei Budankov

unread,

Dec 27, 2024, 3:31:47 PM12/27/24

to DynamoRIO Users

Hi Abhinav,

Thanks for your reply. Please see my answers below inline.

Thanks,

Alexei

пятница, 27 декабря 2024 г. в 20:15:52 UTC+3, sharma...@google.com:

Hi Alexei,

I assume this is related to the "-attach <pid>" feature. I'm trying to understand the unexpected behavior you're observing; see below.

Correct. i am trying to use attach/detach feature and actually willing to fix code here and there to eventually get the feature working.

> Should detach code i.e.
> handle_nudge() -> call_switch_stack(dcontext, (byte )((ptr_uint_t)d_r_detachstack + DYNAMORIO_STACK_SIZE), (void ()(void *))detach_externally_on_new_stack, NULL, true);
> execute prior getting back to dynamorio dispatcher so application code instrumentation would not happen till the next attach?

I see that handle_nudge_signal -> nudge_add_pending adds the nudge signal to dcontext->nudge_pending, which should invoke enter_nolinking -> handle_nudge -> call_switch_stack.

According to implementation handle_nudge_signal calls nudge_add_pending in the very end. Prior that handle_nudge_signal can call unlink_fragment_for_signal which, in turn, does call enter_nolinking( , , cache_transition == false) but because of that false param it never calls handle_nudge() -> call_switch_stack() in the context for nudge signal handler. At least this is what the logs tell me. If nudge signal hits code running in code cache it looks like this:

87640:main_signal_handler_C:
main_signal_handler: thread=628615, sig=4, xsp=0x0000fffd959c3da0, retaddr=0x1000000400000004
87642:main_signal_handler_C: siginfo: sig = 4, pid = 0, status = 0, errno = 268435460, si_code = -1
87644:dump_sigcontext: Signal context:
87646:dump_sigcontext: x0 = 0x0000908acd0247fb
87648:dump_sigcontext: x1 = 0x0000ffff9ac65f40

...

87708:dump_sigcontext: sp = 0x0000ffffdbcf3840
87710:dump_sigcontext: pc = 0x0000aaaa73e7b234

...

87786:is_signal_for_us: is_signal_for_us T628615: sig=4 code=-1 errno=268435460
87792:handle_suspend_signal: sig context has to be inited now
87794:handle_suspend_signal: handling nudge signal from handle_suspend_signal 8699
87796:is_signal_for_us: is_signal_for_us T628615: sig=4 code=-1 errno=268435460
87798:handle_nudge_signal: received nudge version=1 flags=0x0 mask=0x4 id=0x00000000 arg=0x0000000000000000
87802:d_r_notify: SYSLOG_INFORMATION: received nudge mask=0x4 id=0x00000000 arg=0x0000000000000000
87804:handle_nudge_signal: sc_xip == 0x0000aaaa73e7b234, sc_xsp == 0x0000ffffdbcf3840
87806:unlink_fragment_for_signal: f->tag = 0x0000ffff9ac6755c, pc = 0x0000aaaa73e7b234
87808:unlink_fragment_for_signal: unlinking outgoing for interrupted F2
87810:unlink_fragment_outgoing: unlinking outgoing from frag F2(0x0000ffff9ac6755c)
87812:unlink_branch: unlinking branch F2.0x0000aaaa73e7b368
87814:unlink_branch: unlinking branch F2.0x0000aaaa73e7b36c
87816:enter_nolinking: entered function
87818:enter_nolinking: with CODE_CACHE
87820:handle_nudge_signal: unlink from cache (ret = 1) F2(0x0000ffff9ac6755c.0x0000aaaa73e7b16c)
87822:main_signal_handler_C: main_signal_handler 4 returning now to 0x0000aaaa73e7b234

87824:d_r_dispatch: MAIN ENTER
87826:dispatch_enter_dynamorio: entering
87828:dispatch_exit_fcache_stats: Exit from F2(0x0000ffff9ac6755c).0x0000aaaa73e7b368
87830:dispatch_exit_fcache_stats: (shared)
87832:dispatch_exit_fcache_stats:
87834:dispatch_exit_fcache_stats: (UNKNOWN DIRECT EXIT F2.0x0000aaaa73e7b368->F2)
87836:dispatch_exit_fcache_stats:
87838:dispatch_exit_fcache_stats: dcontext mcontext:
priv_mcontext_t @0x0000fffd95988680
r0 = 0x000000000001869e

...

87842:d_r_dispatch:
d_r_dispatch: target = 0x0000ffff9ac6755c
87844:enter_nolinking: entered function
87846:enter_nolinking: with CODE_CACHE
87848:enter_nolinking: cache_transition == true
87850:enter_nolinking: dcontext->nudge_pending = 0x0000fffd959a4d08
87852:enter_nolinking: re-linking outgoing for interrupted F2
87854:link_fragment_outgoing: linking outgoing links for F2(0x0000ffff9ac6755c)
87856:link_branch: linking F2(0x0000ffff9ac6755c).0x0000aaaa73e7b368 -> F2(0x0000ffff9ac6755c)=0x0000aaaa73e7b170
87858:link_branch: linking F2(0x0000ffff9ac6755c).0x0000aaaa73e7b36c -> F3(0x0000ffff9ac6758c)=0x0000aaaa73e7b3cc
87870:d_r_notify: SYSLOG_INFORMATION: Detaching from application /home/abudanko/python3/bin/python3.8 (628615)
87872:synch_with_all_threads: synch with all threads my id = 628615 Giving 4 permission and seeking 5 state
87874:add_process_lock: add_process_lock: 0 lock 0x00000000716050e8: name=all_threads_synch_lock(mutex)@/home/abudanko/dyna
morio_fixes/core/synch.c:94
rank=2 owner=628615 owning_dc=0x0000fffd95988680 contended_event=0xffffffffffff prev=0x0000000000000000

lock count_times_acquired= 1 0 0

0 0+2 all_threads_synch_lock(mutex)@/home/abudanko/dynamorio_fixes/core/synch.c:94
87876:synch_with_all_threads: Looping over all threads (1 threads)
87878:synch_with_all_threads: Skipping synch with thread 628615
87880:synch_with_all_threads: Finished synch with all threads: result=1
87882:synch_with_all_threads: returning holding initexit_lock and all_threads_synch_lock
87888:add_process_lock: add_process_lock: 0 lock 0x0000fffd95946ba0: name=synch_lock(mutex)@/home/abudanko/dynamorio_fixes/
core/synch.c:144
rank=10 owner=628615 owning_dc=0x0000fffd95988680 contended_event=0x8ffffffff prev=0x0000000000000000

lock count_times_acquired= 1 0 0

0 0+2 synch_lock(mutex)@/home/abudanko/dynamorio_fixes/core/synch.c:144
87890:translate_mcontext: translate context, thread 628615 at pc_recreatable spot translating
87892:recreate_app_state: recreate_app_state -- translating from:
priv_mcontext_t @0x0000fffd95988680

> i added code at the end of unlink_fragment_for_signal() function at signal.c file and that code implements unlink of the fragment that nudge detach signal hits:
> /* Added here to get out to the dispatcher in a forced order */ flush_fragments_synch_unlink_priv(dcontext, UNIVERSAL_REGION_BASE, UNIVERSAL_REGION_SIZE, false, false, false);
> flush_fragments_unlink_shared(dcontext, UNIVERSAL_REGION_BASE, UNIVERSAL_REGION_SIZE, NULL _IF_DGCDIAG(NULL)); flush_fragments_end_synch(dcontext, false);

Can you clarify the goal behind "Added here to get out to the dispatcher in a forced order"?

Frankly, i am not sure if this code is needed. it is now commented out and, as you can see from the log above, thread still enters dispatcher w/o problems.

enter_nolinking() -> handle_nudge() -> call_switch_stack() is called but from dispatcher code, as you can see from the logs again.

As a part of detach code mcontext translation is implemented (line 87890:). Which execution point context mcontext should contain? Point where nudge signal hit the code in fcache? Execution point on the enter to dispatcher? Execution point on the branch link closest to the point that signal hit? Some other point?

> After return from signal handler thread gets back to dynamorio dispatch and starts instrumenting code and building code cache again:

Can you clarify: do you see the above unexpected behavior with your unlink_fragment_for_signal code changes or also without them? Do you mean DynamoRIO does not detach successfully on the nudge signal? If this is a bug in the current implementation, please file it separately https://github.com/DynamoRIO/dynamorio/issues/ also.

Will clarify answers to the questions above and report.

Derek Bruening

unread,

Jan 6, 2025, 1:07:52 PM1/6/25

to Alexei Budankov, DynamoRIO Users

If the nudge signal handling was delayed, the point where the nudge signal arrived is no longer relevant. If the enter_nolinking is on dispatch entering the cache, the mcontext is the app state to restore on entering the cache.

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/7e067067-558d-4b65-8641-fe3729366821n%40googlegroups.com.

Alexei Budankov

unread,

Jan 9, 2025, 4:36:46 AM1/9/25

to DynamoRIO Users

Hello, see below inline.

пятница, 27 декабря 2024 г. в 23:31:47 UTC+3, Alexei Budankov:

So, behavior mentioned above is observed regardless of changes related to unlink_fragment_for_signal().

To me it looks like after nudge detach signal dynamorio immediately starts instrumenting code again.

However, logically, i suppose, it should happen after the next attach signal.

Also on my debug build, after detach "completes", i am now facing assert like that below :

Detach: thread 1046783 starting detach process
SYSLOG_INFORMATION: Detaching from application /home/abudanko/python3/bin/python3.8 (1046783)
Detach : unpatched ntdll.dll and fixed memory permissions
Detach: starting to translate contexts

mutex_delete lock 0x0000fffd9c623810
...

mutex_delete lock 0x0000fffd9c623810

recreate_app : looking for 0x0000aaaa97d3b5d8 in frag @ 0x0000aaaa97d3b5dc (tag 0x0000ffffa182755c)
recreate_app -- cache pc 0x0000aaaa97d3b5dc != 0x0000aaaa97d3b5d8, assuming a prefix instruction

Fragment 5, tag 0x0000ffffa182755c, flags 0x9000030, shared, size 596:

<------>
-------- prefix entry: --------
0x0000aaaa97d3b5d8 a9400780 ldp (%x28)[16byte] -> %x0 %x1
-------- normal entry: --------
0x0000aaaa97d3b5dc d2846900 movz $0x2348 lsl $0x00 -> %x0
0x0000aaaa97d3b5e0 f2b2fa40 movk %x0 $0x97d2 lsl $0x10 -> %x0
0x0000aaaa97d3b5e4 f2d55540 movk %x0 $0xaaaa lsl $0x20 -> %x0

...

SYSLOG_WARNING: CURIOSITY : tdcontext != get_thread_private_dcontext() || (((OPTION_IS_INTERNAL_stress_recreate_pc)) ? (((
(void)(((dynamo_options.checklevel >= (1)) && !(!((OPTION_IS_STRING_stress_recreate_pc)) || ((atomic_aligned_read_int(&(&o
ptions_lock)->num_readers) > 0) || self_owns_write_lock(&options_lock)))) ? (d_r_internal_error("/home/abudanko/mica_dev_d
eb/core/translate.c", 1041, "!((OPTION_IS_STRING_stress_recreate_pc)) || READWRITE_LOCK_HELD(&options_lock)"), 0) : 0)), d
ynamo_options.stress_recreate_pc)) : ((((dynamo_options.checklevel >= (1)) && !((0))) ? (d_r_internal_error("non-internal
option argument " "stress_recreate_pc" " @" "/home/abudanko/mica_dev_deb/core/translate.c", 1041, "false"), (void)0) : (vo
id)0), (((void)(((dynamo_options.checklevel >= (1)) && !(!((OPTION_IS_STRING_stress_recreate_pc)) || ((atomic_aligned_read
_int(&(&options_lock)->num_readers) > 0) || self_owns_write_lock(&options_lock)))) ? (d_r_internal_error("/home/abudanko/m
ica_dev_deb/core/translate.c", 1041, "!((OPTION

SYSLOG_ERROR: Application /home/abudanko/python3/bin/python3.8 (1046783) DynamoRIO usage error : meta-instr faulted? must set translation field and handle fault!
SYSLOG_ERROR: Usage error: meta-instr faulted? must set translation field and handle fault! (/home/abudanko/mica_dev_deb/core/translate.c, line 1074)

i guess it might be because of immediate reattach after nudge detach.

That meta instructions shouldn't be generated after detach and before the next reattach signal.

Could you please clarify?

Thanks,

Alexei

Derek Bruening

unread,

Jan 9, 2025, 10:19:33 AM1/9/25

to DynamoRIO Users

If you could describe how you are executing this: this is running an application in one shell and while it's running in another shell you say "drrun -attach <pid>" followed by "drconfig -detach pid" followed by "drrun -attach pid"? Is there no time gap between the detach and 2nd attach? Does it all work fine if there is a time gap?

Does the translation code not handle a prefix instruction on a synchall==detach translation? There are prefixes in x86 ibl targets: but they are rarer than aarch64 so I wonder if it's possible that such a failure has always been the case...are all your failure instances with a prefix translation?

Alexei Budankov

unread,

Jan 9, 2025, 1:25:51 PM1/9/25

to DynamoRIO Users

четверг, 9 января 2025 г. в 18:19:33 UTC+3, Derek Bruening:

If you could describe how you are executing this: this is running an application in one shell and while it's running in another shell you say "drrun -attach <pid>" followed by "drconfig -detach pid" followed by "drrun -attach pid"? Is there no time gap between the detach and 2nd attach? Does it all work fine if there is a time gap?

1st console:

/home/abudanko/python/bin/python ./test.py

4042305
...

2nd console:

./bin64/drrun -attach 4042305 -ops "-logdir /home/abudanko/dynamorio_fixes/log_dir -logmask 0x0fffffff -loglevel 4 -ignore_assert_list *" -c ./clients/lib64/debug/libdrmemtrace.so -offline

3rd console:

./bin64/drconfig -detach 4042305

it has not worked well even after first detach command from 3rd console. i am now trying to bootstrap and figure out reasons of failures using build of release 11.2.0. This code at signal.c: 8684 looks incorrect because signal context is already destroyed when detach is handled and this is most likely the reason of SEGV below:

Program received signal SIGSEGV, Segmentation fault.

memcpy () at /home/abudanko/dynamorio/core/arch/aarch64/memfuncs.asm:54
54 cbnz ARG3, 1b
(gdb) bt
#0 memcpy () at /home/abudanko/dynamorio/core/arch/aarch64/memfuncs.asm:54
#1 0x000000007146a828 in sigcontext_to_mcontext (mc=0xffff8ff24620, sc_full=0xfffd8b0c3cb0, flags=DR_MC_ALL) at /home/abudanko/dynamorio/core/unix/signal.c:2892
#2 0x000000007144e144 in thread_get_nudged_mcontext (tr=0xfffd8b0422e8, mc=0xffff8ff24620) at /home/abudanko/dynamorio/core/unix/os.c:4022
#3 0x00000000711d9a08 in detach_externally_on_new_stack () at /home/abudanko/dynamorio/core/synch.c:2388
#4 0x0000000071446d0c in call_dispatch_alt_stack_no_free () at /home/abudanko/dynamorio/core/arch/aarch64/aarch64.asm:150
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

...

core/unix/signal.c:

static bool

handle_suspend_signal(dcontext_t *dcontext, kernel_siginfo_t *siginfo,
kernel_ucontext_t *ucxt, sigframe_rt_t *frame)
{
os_thread_data_t *ostd = (os_thread_data_t *)dcontext->os_field;
kernel_sigset_t prevmask;
sig_full_cxt_t sc_full;
ASSERT(ostd != NULL);

if (!is_signal_for_us(siginfo, SUSPEND_SIGNAL))
return true; /* pass to app */

if (is_sigqueue_supported() && SUSPEND_SIGNAL == NUDGESIG_SIGNUM) {
nudge_arg_t *arg = (nudge_arg_t *)siginfo;
if (!TEST(NUDGE_IS_SUSPEND, arg->flags)) {
#ifdef LINUX
sig_full_initialize(&sc_full, ucxt);
8684: !!!!!!===> ostd->nudged_sigcxt = &sc_full;
#endif
return handle_nudge_signal(dcontext, siginfo, ucxt);
}
}

...

P.S.

Sorry for inconsistency and mixed info in this thread so far. Still hope to get thru and get attach/detach issues fixed one after another.

Does the translation code not handle a prefix instruction on a synchall==detach translation? There are prefixes in x86 ibl targets: but they are rarer than aarch64 so I wonder if it's possible that such a failure has always been the case...are all your failure instances with a prefix translation?

Let's postpone investigation of this case and get back to it after resolution of the one above.

Thanks,

Alexei

Alexei Budankov

unread,

Jan 9, 2025, 1:38:13 PM1/9/25

to DynamoRIO Users

понедельник, 6 января 2025 г. в 21:07:52 UTC+3, Derek Bruening:

Now as it seen from below message nudge signal handling is postponed to detach_externally_on_new_stack().

detach_externally_on_new_stack() tries to use signal context data, pointer to which is preserved in buggy fashion

and even that still incorrect

since signal context data is invalid outside of signal handler execution time.

So where could correct mcontext be captured to be used later to perform detach procedure?

Any ideas how and where to look and investigate are more than needed and welcome.

Thanks in advance, Alexei

Derek Bruening

unread,

Jan 9, 2025, 1:50:14 PM1/9/25

to DynamoRIO Users

You mean the nudge detach calls thread_get_nudged_mcontext() for the nudged thread, while the nudge signal is always delayed and it should use the mcontext stored in the dcontext? Was the nudge signal for detach not delayed in the past or something, or is there some other invocation point where the signal just arrived? It would be worth looking into the history of that code. How do the automated tests of detach nudges pass then? The nudge point is close enough to the next fragment exit point that it doesn't matter?

Note that we use internally-triggered detach (start/stop API) extensively and detach is pretty stable on large x86 and aarch64 applications there: so problems observed are much more likely to be related to the signal trigger and thus the detaching thread itself and not the other threads.

At this point I would suggest filing an issue and putting further details there.

Alexei Budankov

unread,

Jan 10, 2025, 3:03:58 AM1/10/25

to DynamoRIO Users

четверг, 9 января 2025 г. в 21:50:14 UTC+3, Derek Bruening:

You mean the nudge detach calls thread_get_nudged_mcontext() for the nudged thread, while the nudge signal is always delayed and it should use the mcontext stored in the dcontext?

i mean detach_externally_on_new_stack() calls thread_get_nudged_mcontext(threads[i], &my_mcontext) at line 2388 of core/synch.c and that call happens outside of signal handler execution context according to my debug logs:

--- start of log fragment ---

received nudge version=1 flags=0x0 mask=0x4 id=0x00000000 arg=0x0000000000000000

SYSLOG_INFORMATION: received nudge mask=0x4 id=0x00000000 arg=0x0000000000000000

SYSLOG_INFORMATION: Detach requested in code cache.

unlinking outgoing for interrupted F2

unlinking outgoing from frag F2(0x0000ffff8526755c)
unlinking branch F2.0x0000aaaa74d1d3b0
unlinking branch F2.0x0000aaaa74d1d3b4
main_signal_handler 4 returning now to 0x0000aaaa74d1d3ac

Exit from F2(0x0000ffff8526755c).0x0000aaaa74d1d3b0 (shared)
(UNKNOWN DIRECT EXIT F2.0x0000aaaa74d1d3b0->F2)

d_r_dispatch: target = 0x0000ffff8526755c

re-linking outgoing for interrupted F2

linking outgoing links for F2(0x0000ffff8526755c)
linking F2(0x0000ffff8526755c).0x0000aaaa74d1d3b0 -> F2(0x0000ffff8526755c)=0x0000aaaa74d1d208
linking F2(0x0000ffff8526755c).0x0000aaaa74d1d3b4 -> F3(0x0000ffff8526758c)=0x0000aaaa74d1d414
SYSLOG_INFORMATION: Detaching from application /home/abudanko/python3/bin/python3.8 (1192499)

synch with all threads my id = 1192499 Giving 4 permission and seeking 5 state
add_process_lock: 0 lock 0x00000000715960e8: name=all_threads_synch_lock(mutex)@/home/abudanko/dynamorio/core/synch.c:94
rank=2 owner=1192499 owning_dc=0x0000fffd7ff88680 contended_event=0xffffffff prev=0x0000000000000000

lock count_times_acquired= 1 0 0 0

0+2 all_threads_synch_lock(mutex)@/home/abudanko/dynamorio/core/synch.c:94

Looping over all threads (1 threads)

Skipping synch with thread 1192499

Finished synch with all threads: result=1

returning holding initexit_lock and all_threads_synch_lock

add_process_lock: 0 lock 0x0000fffd7ff46ba0: name=synch_lock(mutex)@/home/abudanko/dynamorio/core/synch.c:144
rank=10 owner=1192499 owning_dc=0x0000fffd7ff88680 contended_event=0x8ffffffff prev=0x0000000000000000

lock count_times_acquired= 1 0 0 0

0+2 synch_lock(mutex)@/home/abudanko/dynamorio/core/synch.c:144
translate context, thread 1192499 at pc_recreatable spot translating

...

--- end of log fragment ---

Neither signal context captured at signal handler nor mcontext extracted from dcontext via call get_mcontext(my_dcontext) can be successfully used to complete detach procedure detach_externally_on_new_stack() and it ends up in:

SYSLOG_ERROR: Application /home/abudanko/python3/bin/python3.8 (1192499) DynamoRIO usage error : meta-instr faulted? must set translation field and handle fault!

Was the nudge signal for detach not delayed in the past or something, or is there some other invocation point where the signal just arrived?

i am now investigating only the case of hitting a code cache fragment by the nudge detach signal sent from other process/console.

No information on the past code behavior yet.

It would be worth looking into the history of that code. How do the automated tests of detach nudges pass then?

i would be happy to run attach/detach auto tests locally to see what happens and get back with results.

would appreciate more refs (https://dynamorio.org/page_test_suite.html#autotoc_md306) to docs how to setup and run auto/tests.

The nudge point is close enough to the next fragment exit point that it doesn't matter?

Nudge point is at code cache fragment. See log fragment above.

Note that we use internally-triggered detach (start/stop API) extensively and detach is pretty stable on large x86 and aarch64 applications

there:

i would be happy to learn examples of such API usage, if it is available as a part of DynamoRIO repository code.

so problems observed are much more likely to be related to the signal trigger and thus the detaching thread itself and not the other threads.

it might be some specifics of fully asynchronous (from console) nudge signal delivery and could require some additional specific handling in the code.

it should be no problem to support this specific as well.

At this point I would suggest filing an issue and putting further details there.

Ok. Will sort out and submit one by one.

Reply all

Reply to author

Forward