On our cluster we configure cancelling our jobs because it makes more sense for our situation, so I have no experience with that resume from being suspended. I can think of two possible reasons for this:
- one is memory (have you checked your memory logs and see if there is a correlation between node memory occupation and jobs not resuming correctly)
- the second one is some resources disappearing (temp files? maybe in some circumstances slurm totally wipes out /tmp the second job -- if so, that would be a slurm bug, obviously)
Assuming that you're stuck without finding a root cause which you can address, I guess it depends on what "doesn't recover" means. It's one thing if it crashes immediately. It's another if it just stalls without even starting but slurm still thinks it's running and the users are charged their allocation -- even worse if your cluster does not enforce a wallclock limit (or has a very long one). Depending on frequency of the issue, size of your cluster and other conditions, you may want to consider writing a watchdog script which would search for these jobs and cancel them?
As I said, not really an answer, just my $0.02 cents (or even less)