Locking in clean calls and drwrap_wrap pre/post callbacks

23 views
Skip to first unread message

Michael Flanders

unread,
Nov 24, 2025, 1:23:53 PM (5 days ago) Nov 24
to DynamoRIO Users
Hello!

I am running DynamoRIO and my DR client on a Windows 10 IA-32 application in debug mode. This is all running in a 4 core, 64GB 64bit Windows 10 VM. 

I am inserting clean calls before nearly every instruction and have wrapped, using `drwrap_wrap` of some file manipulating and memory allocation functions like `operator new`, `malloc`, `CreateFileA`, etc. I am aware that this amount of clean calls is not amazing, but this is a prototype and performance is not much of a concern at the moment.

Error
My `drwrap_wrap` callbacks and clean call functions both access the same global arrays and vectors using `drvector_lock` and `dr_mutex_lock`. I have triple checked that every lock has a corresponding unlock on all code paths, but I am running into the following error:
```
Finished synch with all threads: result=0
SYSLOG_WARNING: Failed to suspend for ProcessTlsInformation
SYSLOG_ERROR: Application ... (5644).  Internal Error: DynamoRIO debug check failure: C:\Users\name\Desktop\dynamorio\core\dispatch.c:545 dc == NULL || OWN_NO_LOCKS(dc)
(Error occurred @260272 frags in tid 1804)
version 11.3.20098, custom build
```

From the logs, it looks like from a call in `presys_SetInformationProcess`, `sync_with_all_threads` is called, and the amount of times DR loops, waiting for all threads to suspend, exceeds `max_loops` and DR stops attempting to suspend all threads. It seems like when execution tries to enter the code cache again, I get this OWN_NO_LOCKS failure. I get this assert curiosity:
```
SYSLOG_WARNING: CURIOSITY : loop_count < max_loops in file C:\Users\name\Desktop\dynamorio\core\synch.c line 1488
```

On loglevel=3, I get that the threads that are not at a safe spot to suspend are commonly at these PCs as well as some others which I resolved with symquery:
```
thread 8092 not at safe spot (pc=0x6d2463d2) for 4
PS C:\Users\name\Desktop> .\drmemory\bin\symquery.exe -e ..\dynamorio\build32\lib32\debug\dynamorio.dll  -f -a 0x3253d2                             d_r_print_encoding_first_line_to_buffer+0x62    
```
and this one in ntdll:
```
thread 6376 not at safe spot (pc=0x7743324c) for 4
```
which looks like a syscall/code cache code?:
```
interp: start_pc = 0x774332a0                                                                                      0x774332a0  b8 0e 00 07 00       mov    eax, 0x0007000e                                                          0x774332a5  ba 50 91 44 77       mov    edx, 0x77449150                                                          0x774332aa  ff d2                call   edx                                                                    end_pc = 0x774332ac
find_syscall_num: found syscall number write: 458766                                                             syscall # is 458766                                                                                              found optimizable system call 0x7000e                                                                            ending bb at syscall & NOT removing the interrupt itself                                                         setting cur_pc (for fall-through) to 0x774332ac                                                                  exit_branch_type=0x0 bb->exit_target=0x18e1bc00                                                                  emit_fragment: bb use ibl <0x18e1bc00>                                                                           exit_branch_type=0x0 target=0x18e1bc00 l->flags=0x9002                                                           Fragment 8713, tag 0x774332a0, flags 0x1801030, shared, size 26, must end trace:                                         [ntdll.dll!ZwSetEvent]                                                                                   Entry into F8713(0x774332a0).0x1db85920 (shared)                                                                         [ntdll.dll!ZwSetEvent]                                                                                   Exit from sourceless ibl: trace jmp*    []                                                                        (target 0x774332ac not in cache)                                                                                fragment_add_ibl_target tag 0x774332ac, branch 2, F0                                                                                                                                                                              d_r_dispatch: target = 0x774332ac                                                                                                                                                                                                 interp: start_pc = 0x774332ac                                                                                      0x774332ac  c2 08 00             ret    0x0008                                                                 end_pc = 0x774332af
```

I also have a lot of SYSLOG warnings about memory consumption:
```
program.exe.0.1804.html:85949:SYSLOG_WARNING: Application
program.exe(5644). Out of contiguous memory. Alloc type: 0x22.
program.exe.0.1804.html:85950:SYSLOG_WARNING: Application
program.exe (5644). Out of contiguous memory. Alloc type: 0x41.
```

Debugging so far
While debugging this, I removed instrumentation added instructions to try to figure out if a particular instruction caused this (like computing args to clean calls). I noticed that after removing a certain amount of added instructions, I no longer got this error and execution succeeded reliably. After adding back one more instrumentation instruction which does not touch locks, I started reliably getting the error again regardless of which one instrumentation instruction I added back. I can add this one instruction in without getting the error by bumping `synch_all_threads_max_loops`, but this doesn't work when I add back all my instrumentation instructions and doesn't seem like a proper solution. I am guessing that at a certain basic block size or memory size of basic blocks, things get moved around triggering a suspend all threads that fails?

Solution?
I am sort of stuck as far as pinpointing the exact cause of this problem, but I was thinking the problem could be:
  1. using the same locks in clean calls (app code?) and `drwrap_wrap` functions (DR code). 
  2. the amount of generated instrumentation code because of the problem happening after adding in a certain amount of instrumentation code (above)
For 1) I was reading through the docs and it seems like a fix to this might be some combo of using the functions: `dr_app_recurlock_lock`,  `dr_mark_safe_to_suspend`, and/or `dr_redirect_execution`. The locks used in the clean calls and drwrap_wrap call backs are pretty much only for synchronizing use of a custom hashtable and a couple of arena allocators. 


Derek Bruening

unread,
Nov 24, 2025, 1:53:25 PM (5 days ago) Nov 24
to Michael Flanders, DynamoRIO Users
This may be https://github.com/DynamoRIO/dynamorio/issues/569 where as you noted DR is not able to safely relocate a thread in a clean call.
In many cases this doesn't matter as no relocation is needed during app runs, particularly on Linux with the "reset" feature turned off: but on Windows there are cases such as the one you hit with the app doing SetInformationProcess.ProcessTlsInformation on another process.
There are ideas for relocating in that issue but that is not necessarily simple to implement and probably relies on clients not changing contexts so probably needs some client API to handle corner cases.
A probably simpler option is to reduce the # of clean calls you're using, which would help with performance and other things. 
Another option that just tries to improve ProcessTlsInformation would be to use a less strict synchall target: we don't really need a valid relocation context; we just need to update the app TEB pointer of each thread. We'd have to study this operation to figure out a relaxation: we might have to add a new one that disallows DR code itself but allows client code; or maybe simple locks could be relied on but I don't remember the synch model for the app TEB fields: likely locks are not used for a thread's own fields today.

--
You received this message because you are subscribed to the Google Groups "DynamoRIO Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dynamorio-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dynamorio-users/66d2983d-ef6a-40f0-979f-a7caf1f79fa5n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages