Did you have any luck finding the cause this issue?
I have two very identical 4 node clusters on linux (each node has 4 cores). One of the cluster has been running stable for a while but on the second cluster, I get the same error as you about 2-3 days into running a very large job (about 20 million cells that are 0.2m in all dimensions). What is surprising is that that the error says "... timed out for MP process 2 on node1" even though node1 is the main node starting the sinulation in the cluster.