Detach on linux does not work on the first attempt(second mail thread)

97 views
Skip to first unread message

Artem Shcherbak

unread,
Mar 26, 2024, 5:56:35 AMMar 26
to DynamoRIO Users
I'm sorry, but my replies are removed from my previous thread (apparently a failure in Google groups) so I had to create a second thread.

Let me remind that I asked about detach on a Linux:
"""
I do detach by sending "/bin64/drconfig -detach pid" signal to application (I already attached to this app.). 

And the detach doesn't work the first attempt. (sometimes from the first, but in general the number of attempts is not defined) And after debugging it turned out that in when we can successfully detach in the case of safe nolinking spot. See the code below. If these checks do not pass, then the detach does not occur and it is necessary to send a signal again.

I plan to make the detach work on the first attempt and send PR. Could you tell me how to implement this? It is possible to check the marker at the time when it is possible to accurately make the detach?
"""

and Derek Bruening reply:

"""
The code you show is unlinking to ensure the thread goes back to dispatch and sees the pending nudge.  But if the thread is not in the cache it should end up at dispatch within a reasonable time frame on its own: so the question is why didn't that happen in your cases?  Where did the signal arrive?  Was it successfully marked pending but the thread never reached dispatch before you re-sent it?  Did you end up with multiple pending nudges?  Did it arrive at fcache_enter or other glue code? I thought the pending signal code had handling for all of those corner cases: but maybe only for app signals and not nudge signals?
"""

and now I write a new message:

I've tested this on AArch64 on several examples where there is CPU intensive work, such as an infinite loop with a simple arithmetic operation. And usually the detach works in 3-5 attempts. If I insert the 0.1s slip in the loop then the detach triggers the first time stably.

 

Debug log for a simple app. infinite loop with a simple arithmetic operation:

 

dispatch.c:374 dispatch_enter_fcache()

fragment.c:5686 enter_nolinking() 0  tag -1307965664

 

---send signal first attept--- (bin64/drconfig -detach `pidof simple`)

 

 Start signal handler:

 signal.c:6098 main_signal_handler_C() sig 4 call handle_suspend_signal

 signal.c:8507 handle_suspend_signal()

 signal.c:8672 handle_nudge_signal()

 safe_is_in_fcache() check on fcache_fragment_pclookup (not found pc in the table)

 signal.c:8748 call nudge_add_pending()

 nudge.c:474 nudge_add_pending pending -1220444792, version=1 flags=0x0 mask=0x4 id=0x00000000

nudge.c:489 nudge_add_pending change pending -1220444792

 signal.c:6110 main_signal_handler_C() sig 4 after call handle_suspend_signal

 

 

---send signal second attept--- (bin64/drconfig -detach `pidof simple`)

 

 Start signal handler:

 signal.c:6098 main_signal_handler_C() sig 4 call handle_suspend_signal

 signal.c:8507 handle_suspend_signal()

 signal.c:8672 handle_nudge_signal()

 safe_is_in_fcache() check on fcache_fragment_pclookup (not found pc in the table)

 signal.c:8748 call nudge_add_pending()

 nudge.c:474 nudge_add_pending pending -1220425635, version=1 flags=0x0 mask=0x4 id=0x00000000

 signal.c:6110 main_signal_handler_C() sig 4 after call handle_suspend_signal

 

---send signal 3rd attept (successful)--- (bin64/drconfig -detach `pidof simple`)

 

 Start signal handler:

 signal.c:6098 main_signal_handler_C() sig 4 call handle_suspend_signal

 signal.c:8507 handle_suspend_signal()

 signal.c:8672 handle_nudge_signal()

 safe_is_in_fcache() check on fcache_fragment_pclookup (found pc in the table)

signal.c:4480 unlink_fragment_for_signal()       (this does not happen in previous attempts)

 signal.c:8748 call nudge_add_pending

 nudge.c:474 nudge_add_pending pending -1220457864, version=1 flags=0x0 mask=0x4 id=0x00000000

 signal.c:6110 main_signal_handler_C() sig 4 after call handle_suspend_signal

d_r_dispatch()

dispatch.c:374 dispatch_enter_fcache()

fragment.c:5686 enter_nolinking -1220444792  tag -1307965664

fragment.c:5693 dcontext->interrupted_for_nudge != NULL -1220457864

fragment.c:5706 call handle_nudge()

nudge.c:291 handle_nudge()

synch.c:2304 detach_externally_on_new_stack()

fcache_fragment_pclookup

 

Probably I need to add something else in the debug log?

BR, 

Artem




Derek Bruening

unread,
Mar 26, 2024, 12:01:49 PMMar 26
to Artem Shcherbak, DynamoRIO Users
Sometimes things are flagged by the Groups spam rule and put into a pending queue, and we are only notified a day later: so only today do we notice 5 different messages from you in the pending queue from the last 2 days.  I assume those can be just removed at this point with this other thread here.

Yes the key is to figure out why the first attempt did not go back to dispatch: where was that thread when the detach signal interrupted it?

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dynamorio-users/47d2042b-c65c-4312-b335-2fd4c744a93bn%40googlegroups.com.
Message has been deleted

Derek Bruening

unread,
Mar 27, 2024, 3:50:18 PMMar 27
to Artem Shcherbak, DynamoRIO Users
If the application was in the code cache, why did safe_is_in_fcache() say it was not?  Was it in some code in between like the indirect branch lookup routine?  You would want the precise PC where the failed signal interrupted it: where is that and why is that not returning true for safe_is_in_fcache().  See the earlier comments about gencode corner cases.  Regular signals destined for the app go through all these corner cases; but the nudge signals do not?  Is that the difference?

On Wed, Mar 27, 2024 at 3:26 PM Artem Shcherbak <artem.sh...@gmail.com> wrote:
The application is in the code cache (there is a simple infinite loop with an arithmetic operation). Therefore, the dispatcher is not called. 

Now we need to understand  where was that thread when the detach signal successfully interrupted it.

вторник, 26 марта 2024 г. в 19:01:49 UTC+3, Derek Bruening:
Message has been deleted

Derek Bruening

unread,
Mar 29, 2024, 11:19:32 PMMar 29
to Artem Shcherbak, DynamoRIO Users
There is a known case for signals that is not well-handled today: a signal arriving during a client's clean call out of the cache can result in unbounded delivery, filed as https://github.com/DynamoRIO/dynamorio/issues/569.  The same thing would happen on detach.  Yet you listed the ones that failed as being in libdynamorio.so itself: I would expect those to work since if it's in DR itself it should go back to dispatch before re-entering the cache as DR avoids using clean calls itself.

What are the function names in libdynamorio.so that failed to go back to dispatch?

On Fri, Mar 29, 2024 at 2:39 PM Artem Shcherbak <artem.sh...@gmail.com> wrote:

Yes, you were right, the detach works after checking for execution from the cache code and when detach not pass we are in the client code or dynamo_dll.

This check in safe_is_in_fcache() is triggered for drcachesim client:

                  is_in_client_lib(pc)

This check in safe_is_in_fcache() is triggered for write trace buffer to file:

                  is_in_dynamo_dll(pc)

 

Cache entry at this location: entry 0x0000aaaa944f50c0 pc 0x0000aaaa944fb008 

               Not success detach.       in “dynamo_dll”           pc 0x0000000071203484

               Not success detach.       in “dynamo_dll”           pc 0x000000007120349c

               Not success detach.       in “dynamo_dll”           pc 0x000000007120349c

                Success detach.          in “fcache”                     pc 0x0000aaaa944fb08c

 

And if I run drrun without any client on the same app (infinite simple loop), then the detach triggers every time.

So, as I understand it is necessary to add the processing of nudge in case when the application is not in the code cache.

Could you help with determining the appropriate place to add this feature?

Regular signals handle also in the main_signal_handler_C()?
среда, 27 марта 2024 г. в 22:50:18 UTC+3, Derek Bruening:

Artem Shcherbak

unread,
Apr 1, 2024, 8:25:28 AMApr 1
to DynamoRIO Users

Added function names from libdynamorio.so.

And what if after sending a nudge signal through "drconfig -detach" expect a response and in case the detach did not work, for example, when we realized that we are not in the code cache, then send a signal in response, that we weren't detached. In this case, send the signal again until success.?

 

dispatch.c:550 enter_fcache()  entry 0x0000aaaa888850c0 pc 0x0000aaaa8888b008

…..

Not success detach. pc 0x00000000712036d0     /data/disk5/artemshc/dynamorio_mica/build/lib64/release/libdynamorio.so: memcpy + 8

Not success detach. pc 0x000000007120157c     /data/disk5/artemshc/dynamorio_mica/build/lib64/release/libdynamorio.so: dynamorio_syscall + 44

Not success detach. pc 0x00000000712036d4     /data/disk5/artemshc/dynamorio_mica/build/lib64/release/libdynamorio.so: memcpy + 12

Not success detach. pc 0x00000000712036f4     /data/disk5/artemshc/dynamorio_mica/build/lib64/release/libdynamorio.so: memset + 16

Not success detach. pc 0x00000000712036d4     /data/disk5/artemshc/dynamorio_mica/build/lib64/release/libdynamorio.so: memcpy + 12

Not success detach. pc 0x0000fffd81275458     /usr/lib/aarch64-linux-gnu/liblz4.so.1.9.2

Not success detach. pc 0x0000fffd8127544c     /usr/lib/aarch64-linux-gnu/liblz4.so.1.9.2

Success detach.    pc 0x0000aaaa8888b038   in code cache.


Artem

суббота, 30 марта 2024 г. в 06:19:32 UTC+3, Derek Bruening:

Derek Bruening

unread,
Apr 1, 2024, 11:29:13 AMApr 1
to Artem Shcherbak, DynamoRIO Users
The callstack is needed to understand the memcpy, memset, and dynamorio_syscall cases.  Are they all called from client clean calls?  The lz4 certainly is.  If they all are, then that is good news: everything falls under the aforementioned issue #569.  That is what needs to be implemented, if you would like to try to tackle that.

Artem Shcherbak

unread,
Apr 3, 2024, 9:00:56 AMApr 3
to DynamoRIO Users

The call stacks (see below) show that memcpy and memset were called from the client. dynamorio_syscall was not caught under bugger.

So, I think we can try fixing nudge part of the bug #569 and see what happens with detach.

As I understood from the description to the bug, it is necessary to implement write address of the current fragment each time before calling clean_call. Reset this value when return from clean call. If during the nudge signal we are not in the cache code, then make unlink from the recorded fragment. I`m getting it right?

 


#0  memset () at dynamorio/core/arch/aarch64/memfuncs.asm:71

#1  0x0000ffffb3fb99e4 in dynamorio::drmemtrace::online_instru_t::append_thread_header (this=0x10218b3b409e140,

      buf_ptr=0xfffdb4097000 "Hp\t\264\375\377", tid=65533, file_type=3019990656)

      at dynamorio/clients/drcachesim/tracer/instru_online.cpp:188

#2  0x0000ffffb3fa7f84 in dynamorio::drmemtrace::event_post_syscall (drcontext=0x714847d0 <get_dr_tls_base_addr+16>,

      sysnum=0) at dynamorio/clients/drcachesim/tracer/tracer.cpp:1624

 

 

#0  memcpy () at dynamorio/core/arch/aarch64/memfuncs.asm:55

#1  0x000000007120316c in d_r_memmove (dst=0xfffdb40b1050, src=0xfffdb40f1050, n=65536)

    at dynamorio/core/string.c:179

#2  0x0000fffff7d734a4 in encode_opndsgen_6594a000_00001fff (pc=0xfffff7aac610 <extend_unit_end+1420> " \a",

      instr=0xfffdb40988b0, enc=65535, di=0xfffdb402eee0)

      at dynamorio/build_debug/opnd_encode_funcs.h:16398

#3  0x0000fffff7d771a8 in encode_opndsgen_8540c000_003f1fff (pc=0x0, instr=0xfffdb402ef80, enc=1, di=0x6610)

      at dynamorio/build_debug/opnd_encode_funcs.h:16839

#4  0x0000ffffb3fb7058 in dynamorio::drmemtrace::offline_instru_t::get_modoffs (this=0xfffdb401f708,

      drcontext=0xfffff3f8b610, pc=0x0, modidx=0xfffff3f8b730)

      at dynamorio/clients/drcachesim/tracer/instru_offline.cpp:491

#5  0x0000ffffb3fb7f70 in dynamorio::drmemtrace::offline_instru_t::instr_has_multiple_different_memrefs (this=0x0, instr=0x0)

      at dynamorio/clients/drcachesim/tracer/instru_offline.cpp:731


Artem


понедельник, 1 апреля 2024 г. в 18:29:13 UTC+3, Derek Bruening:

Derek Bruening

unread,
Apr 3, 2024, 11:24:21 AMApr 3
to Artem Shcherbak, DynamoRIO Users
An alternative would be to obtain the return address in the code cache, if it's simple and reliable to locate: which it might well be, given that get_priv_mcontext_from_dstack() easily locates the mcontext laid out on the stack in a standard way.

Artem Shcherbak

unread,
Apr 5, 2024, 8:39:19 AMApr 5
to DynamoRIO Users

The way to find the return address from the stack, as I understand, should be something like this (taken from the find_next_fragment_from_gencode function):

 

    cache_pc retaddr = NULL;

    byte *ra_slot =

        dcontext->dstack - get_clean_call_switch_stack_size() - sizeof(retaddr);

    if (in_clean_call_save(dcontext, dcontext-> interrupted _pc)) {

        ra_slot -= get_clean_call_temp_stack_size();

    }

    if (d_r_safe_read(ra_slot, sizeof(retaddr), &retaddr)) {

        dr_printf("RETADDR %p\n", retaddr);

    }

 

At the same time, it is not clear how to use it, because, as I understand, unlink should be done when we are in the code cache, but at the same time we can not do a delayed signal check and make unlink in the code cache, because there is the code of the application itself. I also tried to make unlink before exiting clean_call and tried to make unlink in the signal handler itself passing the last fragment of unlink_fragment_for_signal function, in both cases detach works, but the application crashes SIGSEGV.

Where we should use return address and do unlink?


среда, 3 апреля 2024 г. в 18:24:21 UTC+3, Derek Bruening:

Artem Shcherbak

unread,
Apr 5, 2024, 9:14:01 AMApr 5
to DynamoRIO Users

BTW, what happens when we do unlink? The code shows that unlink occurs for each branches in the code cache but it is not clear what means unlink is for a branch?


пятница, 5 апреля 2024 г. в 15:39:19 UTC+3, Artem Shcherbak:

Derek Bruening

unread,
Apr 5, 2024, 11:54:28 AMApr 5
to Artem Shcherbak, DynamoRIO Users
For the concepts of linking and unlinking: see the tutorial slides https://github.com/DynamoRIO/dynamorio/releases/download/release_7_0_0_rc1/DynamoRIO-tutorial-feb2017.pdf slides 36-41 on linking.  Unlinking modifies the exit branches so the block goes back to the dispatcher instead of to another block.

I think we want to call find_next_fragment_from_gencode() for detach too to handle the clean call save/restore code.  If it's not in those and not in_fcache, I think what you would do is see whether whereami is DR_WHERE_CLEAN_CALLEE.  If that's the case, that should indicate it's in the initial callee or something it called.  Sanity check by confirming it's on the dstack and not an app stack.  Then you know there's an mcontext laid out on the dstack and I believe the return address slot should be in a constant location (IIRC clean call optimizations that only save some regs still use a full mcontext layout?).  Then you pclookup and unlink that fragment and when the clean call returns it will go back to dispatch.  There can still be delays if the clean call blocks for i/o or something but at least it's not unbounded.

Artem Shcherbak

unread,
Apr 10, 2024, 10:10:49 AMApr 10
to DynamoRIO Users

Thank you for reminding us of the unlink concept. Now it's clear why we should call unlink.


пятница, 5 апреля 2024 г. в 18:54:28 UTC+3, Derek Bruening:

Artem Shcherbak

unread,
Apr 10, 2024, 10:20:21 AMApr 10
to DynamoRIO Users
My message wasn`t sent here, so I sent it in the issue -  https://github.com/DynamoRIO/dynamorio/issues/569

среда, 10 апреля 2024 г. в 17:10:49 UTC+3, Artem Shcherbak:

Artem Shcherbak

unread,
Apr 12, 2024, 11:30:55 AMApr 12
to DynamoRIO Users
Now I'm trying to replace mcontext from the signal mcontext to the one we switched to after exiting clean call and put it in detach procedure in translate_mcontext(threads[i], &my_mcontext, true, NULL) and thread_set_self_mcontext( &my_mcontext, true  ) functions.
How I can get  mcontext  for first instruction in code cache after clean call exit?

среда, 10 апреля 2024 г. в 17:20:21 UTC+3, Artem Shcherbak:

Derek Bruening

unread,
Apr 15, 2024, 10:27:51 AMApr 15
to Artem Shcherbak, DynamoRIO Users
The mcontext for the code cache state while in the clean call is obtained in dr_get_mcontext() from the dstack: we would want the same mcontext.

Reply all
Reply to author
Forward
0 new messages