Hello!
I am running DynamoRIO and my DR client on a Windows 10 IA-32 application in debug mode. This is all running in a 4 core, 64GB 64bit Windows 10 VM.
I am inserting clean calls before nearly every instruction and have wrapped, using `drwrap_wrap` of some file manipulating and memory allocation functions like `operator new`, `malloc`, `CreateFileA`, etc. I am aware that this amount of clean calls is not amazing, but this is a prototype and performance is not much of a concern at the moment.
Error
My `drwrap_wrap` callbacks and clean call functions both access the same global arrays and vectors using `drvector_lock` and `dr_mutex_lock`. I have triple checked that every lock has a corresponding unlock on all code paths, but I am running into the following error:
```
Finished synch with all threads: result=0
SYSLOG_WARNING: Failed to suspend for ProcessTlsInformation
SYSLOG_ERROR: Application ... (5644). Internal Error: DynamoRIO debug check failure: C:\Users\name\Desktop\dynamorio\core\dispatch.c:545 dc == NULL || OWN_NO_LOCKS(dc)
(Error occurred @260272 frags in tid 1804)
version 11.3.20098, custom build
```
From the logs, it looks like from a call in `presys_SetInformationProcess`, `sync_with_all_threads` is called, and the amount of times DR loops, waiting for all threads to suspend, exceeds `max_loops` and DR stops attempting to suspend all threads. It seems like when execution tries to enter the code cache again, I get this OWN_NO_LOCKS failure. I get this assert curiosity:
```
SYSLOG_WARNING: CURIOSITY : loop_count < max_loops in file C:\Users\name\Desktop\dynamorio\core\synch.c line 1488
```
On loglevel=3, I get that the threads that are not at a safe spot to suspend are commonly at these PCs as well as some others which I resolved with symquery:
```
thread 8092 not at safe spot (pc=0x6d2463d2) for 4
PS C:\Users\name\Desktop> .\drmemory\bin\symquery.exe -e ..\dynamorio\build32\lib32\debug\dynamorio.dll -f -a 0x3253d2 d_r_print_encoding_first_line_to_buffer+0x62
```
and this one in ntdll:
```
thread 6376 not at safe spot (pc=0x7743324c) for 4
```
which looks like a syscall/code cache code?:
```
interp: start_pc = 0x774332a0 0x774332a0 b8 0e 00 07 00 mov eax, 0x0007000e 0x774332a5 ba 50 91 44 77 mov edx, 0x77449150 0x774332aa ff d2 call edx end_pc = 0x774332ac
find_syscall_num: found syscall number write: 458766 syscall # is 458766 found optimizable system call 0x7000e ending bb at syscall & NOT removing the interrupt itself setting cur_pc (for fall-through) to 0x774332ac exit_branch_type=0x0 bb->exit_target=0x18e1bc00 emit_fragment: bb use ibl <0x18e1bc00> exit_branch_type=0x0 target=0x18e1bc00 l->flags=0x9002 Fragment 8713, tag 0x774332a0, flags 0x1801030, shared, size 26, must end trace: [ntdll.dll!ZwSetEvent] Entry into F8713(0x774332a0).0x1db85920 (shared) [ntdll.dll!ZwSetEvent] Exit from sourceless ibl: trace jmp* [] (target 0x774332ac not in cache) fragment_add_ibl_target tag 0x774332ac, branch 2, F0 d_r_dispatch: target = 0x774332ac interp: start_pc = 0x774332ac 0x774332ac c2 08 00 ret 0x0008 end_pc = 0x774332af
```
I also have a lot of SYSLOG warnings about memory consumption:
```
program.exe.0.1804.html:85949:SYSLOG_WARNING: Application
program.exe(5644). Out of contiguous memory. Alloc type: 0x22.
program.exe.0.1804.html:85950:SYSLOG_WARNING: Application
program.exe (5644). Out of contiguous memory. Alloc type: 0x41.
```
Debugging so far
While debugging this, I removed instrumentation added instructions to try to figure out if a particular instruction caused this (like computing args to clean calls). I noticed that after removing a certain amount of added instructions, I no longer got this error and execution succeeded reliably. After adding back one more instrumentation instruction which does not touch locks, I started reliably getting the error again regardless of which one instrumentation instruction I added back. I can add this one instruction in without getting the error by bumping `synch_all_threads_max_loops`, but this doesn't work when I add back all my instrumentation instructions and doesn't seem like a proper solution. I am guessing that at a certain basic block size or memory size of basic blocks, things get moved around triggering a suspend all threads that fails?
Solution?
I am sort of stuck as far as pinpointing the exact cause of this problem, but I was thinking the problem could be:
- using the same locks in clean calls (app code?) and `drwrap_wrap` functions (DR code).
- the amount of generated instrumentation code because of the problem happening after adding in a certain amount of instrumentation code (above)
For 1) I was reading through the docs and it seems like a fix to this might be
some combo of using the functions: `dr_app_recurlock_lock`,
`dr_mark_safe_to_suspend`, and/or `dr_redirect_execution`. The locks used in the clean calls and drwrap_wrap call backs are pretty
much only for synchronizing use of a custom hashtable and a couple of
arena allocators.