Turns out it might be a problem in our own code :)
We have some code that makes sure users can ssh into machine they are running jobs on. Once the last job on a node finishes we have to kill all of the users running processes that have been started through ssh and leftover screen sessions etc. Very rarely it seems like too much is killed, but it happens so rarely that we have a hard time testing it.
We have a fix that we think works that will be rolled out shortly, which will hopefully fix the problem. But since we can't reliably reproduce the problem we can't really be sure.