I'd investigate the possibility of something before the wait statement putting the system into a bad state.
In OS design there is a concept called thrashing. The idea is that things become exponential bad when you reach a threshold. For example, if I have a 4 core computer and I have between 1 and 4 processes running things are fine. However, the moment I have a fifth process things slow down. What happens is the OS will have to occasionally save the state of a process to disk then restore the fifth process to memory. This disk activity causes timing to switch from RAM (nanoseconds) to Disk (milliseconds). The end result is the time for 5 processes to complete is exponentially longer than 4 processes + 1 process. For example, all 5 processes might take 30 minutes to complete but 4 + 1 process will take 20 minutes + 5 minutes for a total of 25 minutes.
The generalized concept is, don't try to do too much at once and things get done faster. From a real life example, if you have 20 things which need to be fixed on a project, trying to fix all 20 things at once usually fails to fix anything properly. However, picking the top 3 things, fix them, pick the next top 3 things, fix them, etc. will get all things fixed faster and better.
Maybe you code is doing too much too quickly. Maybe waiting for certain events earlier in the cycle will give the system time to get things done. Conceptually you might have:
- do #1 something
- do #2 something
- wait for a condition
It could be the system is still dealing with conditions #6, #5 and possibly #4. So the wait condition isn't really waiting for #6 to finish. I'd look at some of the earlier steps and think, in theory, could they take some time to complete occasionally. If I notice that #2 might actually take a while to complete, even if I don't need to wait for it in order to complete #3, I might need to wait. Not for a direct dependency between #2 and #3 but more they will both be competing for a resource. So I might change the code to be:
- do #1 something
- do #2 something
- do #3 something
- do #4 something
- wait for a condition which indicates #2 is complete
- do #5 something
- do #6 something
- wait for a condition
- do #7 something
- do #8 something
Where you wait, what you wait for, etc. requires you to have a really good feel for what is happening in the system. The worst part is my example has been HUGELY simplified. It is rare for the code to be linear. The timing is more a call to a call to a sub-routine to a library all inside a loop with if statements.
Sometimes I'll test a theory by just profiling the CPU, memory and Disk I/O to see if there is a correlation between a failure in my test suite and a spike in one of the things I'm profiling. Unfortunately, sometimes the act of profiling changes timing just enough to make the problem go away. You stop profiling and the problem comes back.
Sometimes if the profiling shows spikes in certain areas I'll add a sleep statement just before the spike and see if the spike goes away. If it does then I'll look at the code near the sleep and see if I can change the sleep to a wait-for-event statement.
The best thing I can suggest is that knowledge is power. Time box trying to fix the problem. Set two goals; the first is to see if you can make the problem go away and the second is to learn something about the system and how it works. The more you learn about the system the more you can apply to your next project. So trying to fix this problem might not be successful but hopefully it will add to your experience and pay off in future projects.