I'm seeing a lot of messages like this:
2016/04/26 20:39:01 [WARN] raft: Failed to contact
10.2.7.25:8300 in 513.653662ms
2016/04/26 20:39:01 [WARN] raft: Failed to contact
10.2.8.90:8300 in 624.19334ms
2016/04/26 20:39:01 [WARN] raft: Failed to contact
10.2.8.25:8300 in 513.600026ms
2016/04/26 20:39:01 [WARN] raft: Failed to contact
10.2.8.215:8300 in 513.586349ms
2016/04/26 20:39:01 [WARN] raft: Failed to contact quorum of nodes, stepping down
Which is causing leadership to churn constantly. There are brief periods w/ no leader.
I'm running this in my own datacenter and our network guy has yet to find any reason why RTT should ever be more than a 0.5ms between these nodes. Does the measurement indicate network RTT or does it also include any transactions which might bring w/ them contention for mutex, I/O, etc...?