I had similar problems in the past.
The 2 most common issues were:
1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit.
2. Topology and msg forwarding and aggregation.
For 2 - it would seem the nodes designated for forwarding are statically assigned based on topology. I could be wrong, but that's my observation, as I would get the socket timeout error when they had issues, even though other nodes in the same topology 'zone' were ok and could be used instead.
It took debug3 to observe this in the logs, I think.
HTH
--Dani_L.
NOTESThe time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the systemadministrator manually changes the system time). If you need a monotonically increasing clock, see clock_get‐time(2).